Commit Graph

2017 Commits

Author SHA1 Message Date
mckusick
ae1163bd55 The UFS/FFS filesystem checks directory link counts when doing
directory create and delete operations. If it ever finds a directory
with a link count less than 2, it panics. Thus, an rm -rf that
encounters a directory with a link count below 2 causes a kernel
panic. The proposed fix is to return the error EINVAL rather than
panicing. The effect is that the requested operation is not done,
but the system continues to run. At a more convenient later time,
the filesystem can be unmounted and cleaned (with fsck or journal
run). Once cleaned, the operation can be rerun to successful
completion.

This fix takes that approach. The panic message has been converted
into a uprintf(9) to provide the user with the inode number and
filesystem mount point of the offending directory and EINVAL is
returned for the operation.

The long (three year) delay in fixing this problem occurred because
the bug was misclassified when originally assigned and only this week
was found during a sweep of old unresolved bug reports.

PR:          180894
Reviewed by: kib
MFC after:   2 weeks
2016-10-26 20:28:23 +00:00
marcel
03c2e26593 Include <sys/types.h> explicitly instead of depending on that
header being included by <sys/param.h>. When compiled as part
of makefs(8) and on macOS or Linux, <sys/param.h> is not our
own.
2016-10-24 18:12:57 +00:00
kib
23dcf48f60 Add FFS pager, which uses buffer cache read operation to validate pages.
See the comments for more detailed description of the algorithm.

The pager is used unconditionally when the block size of the
underlying device is larger than the machine page size, since local
vnode pager cannot handle the configuration [1].  Otherwise, the
vfs.ffs.use_buf_pager sysctl allows to switch to the local pager.

Measurements demonstrated no regression in the ever-important
buildworld benchmark, and small (~5%) throughput improvements in the
special microbenchmark configuration for dbench over swap-backed
md(4).

Code can be generalized and reused for other filesystems which use
buffer cache.

Reported by:	Anton Yuzhaninov <citrin@citrin.ru> [1]
Tested by:	pho
Benchmarked by:	mjg, pho
Reviewed by:	alc, markj, mckusick (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8198
2016-10-19 11:09:29 +00:00
mjg
6a50fe29a5 vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
kib
800b88629e Be more strict when selecting between snapshot/regular mount.
Reclaimed vnode type is VBAD, so succesful comparision like
devvp->v_type != VREG does not imply that the devvp references
snapshot, it might be due to a reclaimed vnode.  Explicitely check the
vnode type.

In the the most important case of ffs_blkfree(), the devfs vnode is
locked and its type is stable.  In other cases, if the vnode is
reclaimed right after the check, hopefully the buffer methods return
right error codes.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-19 15:58:33 +00:00
kib
ce2b0686e5 Fix libprocstat build after r305902.
- Use _Bool to not require userspace to include stdbool.h.
- Make extattr.h usable without vnode_if.h.
- Follow i_ump to get cdev pointer.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 18:14:31 +00:00
kib
20f1e8ac63 Reduce size of ufs inode.
Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding.  To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2.  Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by:	pho (previous version)
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 16:47:34 +00:00
bde
4000d6ac7e Sprinkle DOINGASYNC() checks so as to do delayed writes for async
mounts in almost all cases instead of in most cases.  Don't override
DOINGASYNC() by any condition except IO_SYNC.

Fix previous sprinking of DOINGASYNC() checks.  Don't override IO_SYNC
by DOINGASYNC().  In ffs_write() and ffs_extwrite(), there were
intentional overrides that just broke O_SYNC of data.  In
ffs_truncate(), there are 5 calls to ffs_update(), 4 with
apparently-unintentional overrides and 1 without; this had no effect
due to the main async mount hack descibed below.

Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden
for the soft updates case too (to do a delayed write instead of a sync
write).  This is supposed to be the only change that affects anything
except async mounts.

In ffs_update(), remove the 19 year old efficiency hack of ignoring
the waitfor flag for async mounts, so that fsync() almost works for
async mounts.  All callers are supposed to be fixed to not ask for a
sync update unless they are for fsync() or [I]O_SYNC operations.
fsync() now almost works for async mounts.  It used to sync the data
but not the most important metdata (the inode).  It still doesn't sync
associated directories.

This gave 10-20% fewer writes for my makeworld benchmark with async
mounted tmp and obj directories from an already small number.

Style fixes:
- in ffs_balloc.c, remove rotted quadruplicated comments about the
  simplest part of the DOING*() decisions and rearrange the nearly-
  quadruplicated code to be more nearly so.
- in ufs_vnops.c, use a consistent style with less negative logic and
  no manual "optimization" of || to | in DOING*() expressions.

Reviewed by:	kib (previous version)
2016-09-08 17:40:40 +00:00
kib
8d2c7fdc8d On rename, do not perform truncation of dirhash if the vnode
truncation failed.

Doing so resulted in inconsistent state of the ufs dirhash with regard
to the actual directory inode state, and could lead to spurious ENOENT
errors for lookups of existing files in production kernels, or
assertion failures in the debugging kernels.

Change the logic of calling ufsdirhash_dirtrunc() to be same as in
ufs_direnter().  Execute UFS_TRUNCATE() first, log error, and only do
dirtrunc() if UFS_TRUNCATE() succeeded.

Note that the problem was exacerbated by the bug in the
flush_newblk_dep() function (see r305599), which caused in the spurios
errors from ffs_sync() and then ffs_truncate().

In collaboration with:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:09:34 +00:00
kib
03f5910da5 Do not leak transient ENOLCK error from flush_newblk_dep() loop.
The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error
from the failed attempt is irrelevant.  But since there is path after
retry which does not clear error, it is possible to return spurious
error from the function.

The issue resulted in a spurious failure of softdep_sync_buf(),
causing further spurious failure of ffs_sync().

In collaboration with:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:08:54 +00:00
kib
835bb1130c When logging unlikely UFS_TRUNCATE() failure in ufs_direnter(),
include error code.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:08:08 +00:00
kib
666d76bef3 When externding directory inode in ufs_direnter(), adjust i_endoff.
This change is formally not needed, since i_endoff not used in all
code paths after the call to ufs_direnter(), and i_endoff is
recalculated by the next lookup.  But having the value correct makes
the reasoning about code simpler.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:07:25 +00:00
kib
dc190244af In dqsync(), when called from quotactl(), um_quotas entry might appear
cleared since nothing prevents completion of the parallel quotaoff.
There is nothing to sync in this case, and no reason to panic.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:06:43 +00:00
kib
c8c9eee1f7 In softdep_prealloc(), return early not only for snapshots, but for
the quota files as well.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:05:13 +00:00
kib
a483a2c0bd There is no need to upgrade the last dvp lock on lookups for modifying
operations.  Instead of upgrading, assert that the lock is exclusive.
Explain the cause in comments.

This effectively reverts r209367.

Tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:04:45 +00:00
kib
ded52104dc Partially lift suspension when ffs_reload() finished with cgs and
going to re-read inodes.

Secondary write initiators, e.g. ufs_inactive(), might need to start a
write while owning the vnode lock.  Since the suspended state
established by /dev/ufssuspend prevents them from entering
vn_start_secondary_write(), we get deadlock otherwise.

Note that it is arguably not very useful to re-read inodes after
/dev/ufssuspend suspension, because the suspension does not block
readers, and other threads might read existing files in parallel with
suspension owner (for now, only growfs(8)) operations.  This
effectively means that suspension owner cannot safely modify existing
inodes, and then there is no sense in re-reading.  But keep the code
enabled for now.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:01:28 +00:00
imp
a88b45a2a8 Renumber the advertising clause. 2016-09-06 15:17:35 +00:00
mckusick
d1132818f9 Bug 211013 reports that a write error to a UFS filesystem running
with softupdates panics the kernel. The problem that has been pointed
out is that when there is a transient write error on certain metadata
blocks, specifically directory blocks (PAGEDEP), inode blocks
(INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group
(BMSAFEMAP, but only when journaling is enabled), we get a panic
in one of the routines called by softdep_disk_io_initiation that
the I/O is "already started" when we retry the write.

These dependency types potentially need to do roll-backs when called
by softdep_disk_io_initiation before doing a write and then a
roll-forward when called by softdep_disk_write_complete after the
I/O completes.  The panic happens when there is a transient error.
At the top of softdep_disk_write_complete we check to see if the
write had an error and if an error occurred we just return.  This
return is correct most of the time because the main role of the routines
called by softdep_disk_write_complete is to process the now-completed
dependencies so that the next I/O steps can happen.

But for the four types listed above, they do not get to do their
rollback operations. This causes the panic when softdep_disk_io_initiation
gets called on the second attempt to do the write and the roll-back
routines find that the roll-backs have already been done. As an
aside I note that there is also the problem that the buffer will
have been unlocked and thus made visible to the filesystem and to
user applications with the roll-backs in place.

The way to resolve the problem is to add a flag to the routines called
by softdep_disk_write_complete for the four dependency types noted
that indicates whether the write was successful (WRITESUCCEEDED).
If the write does not succeed, they do just the roll-backs and then
return. If the write was successful they also do their usual
processing of the now-completed dependencies.

The fix was tested by selectively injecting write errors for buffers
holding dependencies of each of the four types noted above and then
verifying that the kernel no longer paniced and that following the
successful retry of the write that the filesystem could be unmounted
and successfully checked cleanly.

PR: 211013
Reviewed by: kib
2016-08-16 21:02:30 +00:00
kib
73de606a86 In UFS_BALLOC(), invalidate pages of indirect buffers on failed block
allocation unwinding.

Dandling buffers are released on UFS_BALLOC() failure to ensure that
later attempt to allocate blocks in close range do not find the blocks
with invalid content, since possible partial block allocations are
unwound.  As such, it is not enough to just release the buffers, the
pages must also invalidated and removed from the vnode vm_object
queue.  Otherwise the pages might be found later and used to
reconstruct indirect buffers when doing allocations at offset close to
the failure point, and their stale content compromise the filesystem
integrity.

Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE
is required.  To be sure, clear the B_CACHE flag as well.  This
complements the r174973, which started releasing buffers.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:30:58 +00:00
kib
fd0eaea7aa On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert
that recorded allocated blocks numbers match the physical block
numbers of dandling buffers which are released.

When finally freeing the blocks during unwind, assert that dandling
buffers where not re-allocated.  They shouldn't, because the vnode lock
is owned exclusive.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:18:38 +00:00
kib
e037bf91c2 When looking up dandling buffers for unwing after failing block
allocation in UFS_BALLOC(), there is no need to map them.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:05:15 +00:00
kib
4e01efcc0c When block allocation fails in UFS_BALLOC(), and the volume does not
have SU enabled, there is no point in calling softdep_request_cleanup().

The call cannot produce free blocks, but we unecessarily lock ufsmount
and do inode block write.  Usual point of not doing optimizations for
the corner case of the full volume is not applicable there, the work
is easily avoidable, and the addition CPU and disk io load do not lead
to succeeding retry.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:50:48 +00:00
kib
03c6968a37 In ffs_balloc_ufs{1,2} routines, assert that unwind records do not
overflow local arrays.  This is not immediately obvious from the
static code inspection, due to retry logic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:49:56 +00:00
kib
20b89ec291 Implement VOP_FDATASYNC() for UFS.
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:22:23 +00:00
trasz
255ed885fa Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
kib
9009fc003e Ensure that the UFS directory vnode' vm_object is properly sized
before UFS_BALLOC() is called.  I do not believe that this caused any
real issue on FreeBSD because the exclusive vnode lock is held over
the balloc/resize, the change is to make formally correct KPI use.

Based on:	the Matthew Dillon' patch from DragonFly BSD
PR:	93942
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-20 14:40:56 +00:00
kevlo
1006f009c6 arc4random() returns 0 to (2**32)−1, use an alternative to initialize
i_gen if it's zero rather than a divide by 2.

With inputs from  delphij, mckusick, rmacklem

Reviewed by:	mckusick
2016-05-22 14:31:20 +00:00
kib
e006a8a1fd Stop dropping and reacquiring Giant around geom calls in UFS.
Sponsored by:	The FreeBSD Foundation
2016-05-21 10:13:25 +00:00
kib
93315c1cff Improve handling of rdev->si_mountpt on mount and unmount of FFS
volumes.  Treat the field as a semaphore protecting availability of
the device for mounting.  Do no access devvp->v_rdev without the vnode
lock owned.

Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-21 09:49:35 +00:00
kib
54336eef75 Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
kib
393eee329b Do enable io accounting for read-only mounts and mounts which are
remounted to writeable after initial read-only.  Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.

Based on submission by:	bde
MFC after:	1 week
2016-05-17 21:35:35 +00:00
kib
6c68797acf If IO_SYNC was passed to ffs_truncate(), request synchronous inode
update from the final ffs_update().

Noted by:	bde
MFC after:	1 week
2016-05-17 21:30:58 +00:00
kib
b27a7ca7b1 For async UFS mounts, shrink the directory asynchronously, at least do
not pass IO_SYNC to ffs_truncate() unneccessary.

Submitted by:	bde
MFC after:	1 week
2016-05-17 21:28:28 +00:00
kib
e1e51d237c Fix comments.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:24:27 +00:00
kib
67809d96fd Fix typo in the message.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:19:20 +00:00
pfg
6ac8bdb320 UFS: spelling fixes on comments.
No functional change.
2016-04-29 20:43:51 +00:00
ngie
35bd8367fe Add FEATURE knob for testing for UFS extended attribute kernel support
Support can be verified via `feature_present("ufs_extattr")`, etc.

Differential Revision: https://reviews.freebsd.org/D6053
MFC after: 2 weeks
Relnotes: yes
Reviewed by: asomers, kib
Sponsored by: EMC / Isilon Storage Division
2016-04-22 08:09:27 +00:00
pfg
729533413f sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
pfg
858f336801 ufs: replace 0 with NULL for pointers.
While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by:	mckusick
MFC after:	2 weeks
2016-04-10 21:48:11 +00:00
trasz
825d80e01c Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
trasz
ca92bb3067 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
kib
dae532b324 Split the global taskqueue used to process all UFS trim completions,
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.

But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes.  Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.

Reported by:	Nick Evans <nevans@talkpoint.com>
Tested by:	Nick Evans <nevans@talkpoint.com>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:21:17 +00:00
kib
6fb3ce9534 Style: wrap long lines.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:07:12 +00:00
kib
e8766307c3 Fix locking mistake in softdep_waitidle(). The surrounding code
expects that the loop is always exited with the SU lock owned, even on
error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-03-23 09:58:51 +00:00
mckusick
6a020d6c89 The UFS filesystem requires that the last block of a file always be
allocated. When shortening the length of a file in which the new end
of the file contains a hole, the hole must have a block allocated.

Reported by: Maxim Sobolev
Reviewed by: kib
Tested by:   Peter Holm
2016-02-24 01:58:40 +00:00
trasz
0fe7e27aea Remove ffs_mountroot() prototype; seems to be long gone.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-28 12:21:23 +00:00
mckusick
32cb8b928a This fixes a bug in UFS2 exported NFS volumes. An NFS client can
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.

Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib
2016-01-27 21:27:05 +00:00
mckusick
0b10a802f8 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
kib
70adc1e216 Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the
ast was rescheduled during VFS_SYNC().  It is possible that enough
parallel writes or slow/hung volume result in VFS_SYNC() deferring to
the ast flushing of workqueue.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-21 11:50:32 +00:00
kib
29c1a1655d Update ctime when atime or birthtime are updated.
Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or
IN_UPDATE, then clear them with ufs_itimes(), making transient
(possibly inconsistent) change to the times, and then copy
user-supplied times into the inode.  Instead, directly clear IN_ACCESS
or IN_UPDATE when user supplied the time, and copy the value into the
inode.

Minor inconsistency left is that the inode ctime is updated even when
birthtime update attempt is performed on a UFS1 volume.

Submitted by:	bde
MFC after:	2 weeks
2015-12-07 12:09:04 +00:00