o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
rename it to ufs_dirhashreclaimpercent, as suggested
by jhb@. As an added bonus this avoids divide-by-zero
errors.
Requested by: jhb, markj
Reviewied by: jhb, markj
are not suspended. In particular, on the SU-enabled vulumes, there is
no reason why, between the call to softdep_flushfiles() and
softdep_waitidle(), SU work items cannot be queued.
Correct the condition to trigger the panic by only checking when
forced operation is done. Convert direct panic() call into KASSERT(),
there is no invalid on-disk data structures directly involved, so
follow the usual debugging vs. non-debugging approach.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
two.
nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.
This is a fixup for 273271.
No strong objections from: kib
Pointy hat to: mjg
MFC after: 2 weeks
survives remount in rw, also it is set for vnodes on rootfs before
noatime can be set or clock is adjusted. All conditions result in
wrong atime for accessed vnodes.
Submitted by: bde
MFC after: 1 week
by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq().
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 6 weeks (after r271596)
conflict with the quota metadata access. Mark quota vnode lock as
recursive and always exclusive to avoid the problem.
Reported by: hrs
Tested by: hrs, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
for dirhash cache, setting a target percent to reclaim (exposed via
SYSCTL). This allows to always make some amount of progress keeping the maximum
reclaim age dynamic.
Tested by: pho
Reviewed by: jhb
kern_renameat() already starts write on the mp, which prevents
parallel unmount from proceed. Busying mp after vn_start_write()
deadlocks the unmount.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
unmount. There is no need to suspend read-only filesystem, while we
need suspension on modificable mount point.
Reported by: rwatson
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
r269533, which was tested before r269457 was committed, implicitely
relied on the Giant to protect the manipulations of the softdepmounts
list. Use softdep global lock consistently to guarantee the list
structure now.
Insert the new struct mount_softdeps into the softdepmounts only after
it is sufficiently initialized, to prevent softdep_speedup() from
accessing bare memory. Similarly, remove struct mount_softdeps for
the unmounted filesystem from the tailq before destroying structure
rwlock.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
have to adjust freeblk records to reflect the change to a full-size block.
For example, suppose we have a block made up of fragments 8-15 and
want to free its last two fragments. We are given a request that says:
FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0
where frags are the number of fragments to free and oldfrags are the
number of fragments to keep. To block align it, we have to change it to
have a valid full-size blkno, so it becomes:
FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6
Submitted by: Mikihito Takehara
Tested by: Mikihito Takehara
Reviewed by: Jeff Roberson
MFC after: 1 week
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.
Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix
forcing filesystem VOP_LINK() methods to repeat the code. In
tmpfs_link(), remove redundand check for the type of the source,
already done by VFS.
Note that NFS server already performs this check before calling
VOP_LINK().
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
unmount time) in the helper vfs_write_suspend_umnt(). Use it instead
of two inline copies in FFS.
Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
permissions test, forgotten in r164033.
Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once. Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).
Reported by: bde
Reviewed by: bde, trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko
module' sources.
In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.
One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option. This is similar to
the option QUOTA and ufs_quota.c.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2,
you'll end up corrupting your FS as the superblock is updated and written
to a different location...
makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing
this issue...
Reviewed by: silience from mckusick
MFC after: 1 week
a journal block even when there are no journal entries to be written.
Until the root cause is found, handle this case by ensuring that a
valid journal segment is always written.
Second, the data buffer used for writing journal entries was never
being scrubbed of old data. Fix this.
Submitted by: Takehara Mikihito
Obtained from: Netflix, Inc.
MFC after: 3 days
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
waiting sync pass we need to do a blocking acquire and restart.
Another thread, typically the buf daemon, may have this buf locked and
if we don't wait we can fail to sync the file. This lead to a great
variety of softdep panics because we rely on all dependencies being
flushed before proceeding in several cases.
Reported by: pho
Discussed with: mckusick
Sponsored by: EMC / Isilon Storage Division
MFC after: 2 weeks
This could cause dirhash panics since the dirhash state would be
successfully truncated while the directory was not.
Reported by: pho
Discussed with: mckusick
Sponsored by: EMC / Isilon Storage Division
MFC after: 2 weeks
We had previously tried to flush all MKDIR_PARENT dependencies (and
all the NEWBLOCK pagedeps) by calling ffs_update(). However this will
only resolve these dependencies in direct blocks. So very large
directories with MKDIR_PARENT dependencies in indirect blocks had
not yet gotten flushed. As the directory is in the midst of doing a
complete sync, we simply defer the checking of the MKDIR_PARENT
dependencies until the indirect blocks have been sync'ed.
Reported by: Shawn Wallbridge of imaginaryforces.com
Tested by: John-Mark Gurney <jmg@funkthat.com>
PR: 183424
MFC after: 2 weeks
di_extsize is the EA size and as such it should be unsigned.
Adjust related types for consistency.
Reviewed by: mckusick (previous version)
MFC after: 3 weeks
single kernel-wide soft update lock can be replaced with a
per-filesystem soft-updates lock. This per-filesystem lock will
allow each filesystem to have its own soft-updates flushing thread
rather than being limited to a single soft-updates flushing thread
for the entire kernel.
Move soft update variables out of the ufsmount structure and into
their own mount_softdeps structure referenced by ufsmount field
um_softdep. Eventually the per-filesystem lock will be in this
structure. For now there is simply a pointer to the kernel-wide
soft updates lock.
Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock
pointer in the mount_softdeps structure instead of a pointer to the
kernel-wide soft-updates lock.
Replace the five hash tables used by soft updates with per-filesystem
copies of these tables allocated in the mount_softdeps structure.
Several functions that flush dependencies when too many are allocated
in the kernel used to operate across all filesystems. They are now
parameterized to flush dependencies from a specified filesystem.
For now, we stick with the round-robin flushing strategy when the
kernel as a whole has too many dependencies allocated.
While there are many lines of changes, there should be no functional
change in the operation of soft updates.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Add KASSERTS that soft dependency functions only get called
for filesystems running with soft dependencies. Calling these
functions when soft updates are not compiled into the system
become panic's.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Convert three functions exported from ffs_softdep.c to static
functions as they are not used outside of ffs_softdep.c.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Most of the code between UFS1 and UFS2 is shared so this change
is pretty safe. Not only this makes UFS1 and 2 consistent but it
also matches what NetBSD and MacOS X have for some years now.
Reviewed by: mckusick
MFC after: 1 month
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
allocations under low free-space conditions (-r254995), determine
that old block-preference search order used before -r249782 worked
a bit better. This change reverts to that block-preference search order.
MFC after: 2 weeks
I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
full (10-20GB free), processes which write data to it start
eating 100% CPU and write speed drops below 1MB/sec (normally
to gives 400MB/sec). The revision at which it first became
apparent was http://svnweb.freebsd.org/changeset/base/249782.
The offending change reserved an area in each cylinder group to
store metadata. The new algorithm attempts to save this area for
metadata and allows its use for non-metadata only after all the
data areas have been exhausted. The size of the reserved area
defaults to half of minfree, so the filesystem reports full before
the data area can completely fill. However, in this report, the
filesystem has had minfree reduced to 1% thus forcing the metadata
area to be used for data. As the filesystem approached full, it
had only metadata areas left to allocate. The result was that
every block allocation had to scan summary data for 30,000 cylinder
groups before falling back to searching up to 30,000 metadata areas.
The fix is to give up on saving the metadata areas once the free
space reserve drops below 2%. The effect of this change is to use
the old algorithm of just accepting the first available block that
we find. Since most filesystems use the default 5% minfree, this
will have no effect on their operation. For those that want to push
to the limit, they will get their crappy block placements quickly.
Submitted by: Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
PR: kern/181226
MFC after: 2 weeks