Commit Graph

2000 Commits

Author SHA1 Message Date
Kirk McKusick
988fd417a0 Bug 211013 reports that a write error to a UFS filesystem running
with softupdates panics the kernel. The problem that has been pointed
out is that when there is a transient write error on certain metadata
blocks, specifically directory blocks (PAGEDEP), inode blocks
(INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group
(BMSAFEMAP, but only when journaling is enabled), we get a panic
in one of the routines called by softdep_disk_io_initiation that
the I/O is "already started" when we retry the write.

These dependency types potentially need to do roll-backs when called
by softdep_disk_io_initiation before doing a write and then a
roll-forward when called by softdep_disk_write_complete after the
I/O completes.  The panic happens when there is a transient error.
At the top of softdep_disk_write_complete we check to see if the
write had an error and if an error occurred we just return.  This
return is correct most of the time because the main role of the routines
called by softdep_disk_write_complete is to process the now-completed
dependencies so that the next I/O steps can happen.

But for the four types listed above, they do not get to do their
rollback operations. This causes the panic when softdep_disk_io_initiation
gets called on the second attempt to do the write and the roll-back
routines find that the roll-backs have already been done. As an
aside I note that there is also the problem that the buffer will
have been unlocked and thus made visible to the filesystem and to
user applications with the roll-backs in place.

The way to resolve the problem is to add a flag to the routines called
by softdep_disk_write_complete for the four dependency types noted
that indicates whether the write was successful (WRITESUCCEEDED).
If the write does not succeed, they do just the roll-backs and then
return. If the write was successful they also do their usual
processing of the now-completed dependencies.

The fix was tested by selectively injecting write errors for buffers
holding dependencies of each of the four types noted above and then
verifying that the kernel no longer paniced and that following the
successful retry of the write that the filesystem could be unmounted
and successfully checked cleanly.

PR: 211013
Reviewed by: kib
2016-08-16 21:02:30 +00:00
Konstantin Belousov
948137db10 In UFS_BALLOC(), invalidate pages of indirect buffers on failed block
allocation unwinding.

Dandling buffers are released on UFS_BALLOC() failure to ensure that
later attempt to allocate blocks in close range do not find the blocks
with invalid content, since possible partial block allocations are
unwound.  As such, it is not enough to just release the buffers, the
pages must also invalidated and removed from the vnode vm_object
queue.  Otherwise the pages might be found later and used to
reconstruct indirect buffers when doing allocations at offset close to
the failure point, and their stale content compromise the filesystem
integrity.

Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE
is required.  To be sure, clear the B_CACHE flag as well.  This
complements the r174973, which started releasing buffers.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:30:58 +00:00
Konstantin Belousov
2da4bcb82b On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert
that recorded allocated blocks numbers match the physical block
numbers of dandling buffers which are released.

When finally freeing the blocks during unwind, assert that dandling
buffers where not re-allocated.  They shouldn't, because the vnode lock
is owned exclusive.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:18:38 +00:00
Konstantin Belousov
39f7cbe9ab When looking up dandling buffers for unwing after failing block
allocation in UFS_BALLOC(), there is no need to map them.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:05:15 +00:00
Konstantin Belousov
554248587e When block allocation fails in UFS_BALLOC(), and the volume does not
have SU enabled, there is no point in calling softdep_request_cleanup().

The call cannot produce free blocks, but we unecessarily lock ufsmount
and do inode block write.  Usual point of not doing optimizations for
the corner case of the full volume is not applicable there, the work
is easily avoidable, and the addition CPU and disk io load do not lead
to succeeding retry.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:50:48 +00:00
Konstantin Belousov
f1503c5a49 In ffs_balloc_ufs{1,2} routines, assert that unwind records do not
overflow local arrays.  This is not immediately obvious from the
static code inspection, due to retry logic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:49:56 +00:00
Konstantin Belousov
2f514f92cf Implement VOP_FDATASYNC() for UFS.
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:22:23 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Konstantin Belousov
e73b16ae7a Ensure that the UFS directory vnode' vm_object is properly sized
before UFS_BALLOC() is called.  I do not believe that this caused any
real issue on FreeBSD because the exclusive vnode lock is held over
the balloc/resize, the change is to make formally correct KPI use.

Based on:	the Matthew Dillon' patch from DragonFly BSD
PR:	93942
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-20 14:40:56 +00:00
Kevin Lo
57d2ac2f90 arc4random() returns 0 to (2**32)−1, use an alternative to initialize
i_gen if it's zero rather than a divide by 2.

With inputs from  delphij, mckusick, rmacklem

Reviewed by:	mckusick
2016-05-22 14:31:20 +00:00
Konstantin Belousov
5f36cc5bfa Stop dropping and reacquiring Giant around geom calls in UFS.
Sponsored by:	The FreeBSD Foundation
2016-05-21 10:13:25 +00:00
Konstantin Belousov
c70b3cd28d Improve handling of rdev->si_mountpt on mount and unmount of FFS
volumes.  Treat the field as a semaphore protecting availability of
the device for mounting.  Do no access devvp->v_rdev without the vnode
lock owned.

Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-21 09:49:35 +00:00
Konstantin Belousov
3f7ca894de Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
Konstantin Belousov
bfe89a8ee1 Do enable io accounting for read-only mounts and mounts which are
remounted to writeable after initial read-only.  Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.

Based on submission by:	bde
MFC after:	1 week
2016-05-17 21:35:35 +00:00
Konstantin Belousov
f30ddc49e6 If IO_SYNC was passed to ffs_truncate(), request synchronous inode
update from the final ffs_update().

Noted by:	bde
MFC after:	1 week
2016-05-17 21:30:58 +00:00
Konstantin Belousov
3d0691acdc For async UFS mounts, shrink the directory asynchronously, at least do
not pass IO_SYNC to ffs_truncate() unneccessary.

Submitted by:	bde
MFC after:	1 week
2016-05-17 21:28:28 +00:00
Konstantin Belousov
98cbffd7bd Fix comments.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:24:27 +00:00
Konstantin Belousov
5b6526413f Fix typo in the message.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:19:20 +00:00
Pedro F. Giffuni
08e941833d UFS: spelling fixes on comments.
No functional change.
2016-04-29 20:43:51 +00:00
Enji Cooper
028d33c96a Add FEATURE knob for testing for UFS extended attribute kernel support
Support can be verified via `feature_present("ufs_extattr")`, etc.

Differential Revision: https://reviews.freebsd.org/D6053
MFC after: 2 weeks
Relnotes: yes
Reviewed by: asomers, kib
Sponsored by: EMC / Isilon Storage Division
2016-04-22 08:09:27 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Pedro F. Giffuni
abafa4db03 ufs: replace 0 with NULL for pointers.
While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by:	mckusick
MFC after:	2 weeks
2016-04-10 21:48:11 +00:00
Edward Tomasz Napierala
ae34b6ff96 Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Konstantin Belousov
c79dff0fbe Split the global taskqueue used to process all UFS trim completions,
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.

But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes.  Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.

Reported by:	Nick Evans <nevans@talkpoint.com>
Tested by:	Nick Evans <nevans@talkpoint.com>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:21:17 +00:00
Konstantin Belousov
d016708959 Style: wrap long lines.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:07:12 +00:00
Konstantin Belousov
6336aefc1d Fix locking mistake in softdep_waitidle(). The surrounding code
expects that the loop is always exited with the SU lock owned, even on
error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-03-23 09:58:51 +00:00
Kirk McKusick
24b6748c42 The UFS filesystem requires that the last block of a file always be
allocated. When shortening the length of a file in which the new end
of the file contains a hole, the hole must have a block allocated.

Reported by: Maxim Sobolev
Reviewed by: kib
Tested by:   Peter Holm
2016-02-24 01:58:40 +00:00
Edward Tomasz Napierala
50102a6342 Remove ffs_mountroot() prototype; seems to be long gone.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-28 12:21:23 +00:00
Kirk McKusick
abe53f7e91 This fixes a bug in UFS2 exported NFS volumes. An NFS client can
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.

Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib
2016-01-27 21:27:05 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Konstantin Belousov
44de1ba141 Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the
ast was rescheduled during VFS_SYNC().  It is possible that enough
parallel writes or slow/hung volume result in VFS_SYNC() deferring to
the ast flushing of workqueue.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-21 11:50:32 +00:00
Konstantin Belousov
dc15d94fd9 Update ctime when atime or birthtime are updated.
Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or
IN_UPDATE, then clear them with ufs_itimes(), making transient
(possibly inconsistent) change to the times, and then copy
user-supplied times into the inode.  Instead, directly clear IN_ACCESS
or IN_UPDATE when user supplied the time, and copy the value into the
inode.

Minor inconsistency left is that the inode ctime is updated even when
birthtime update attempt is performed on a UFS1 volume.

Submitted by:	bde
MFC after:	2 weeks
2015-12-07 12:09:04 +00:00
Kirk McKusick
43a993bb7d For performance reasons, it is useful to have a single string used as
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:01:02 +00:00
Konstantin Belousov
fe920528c1 Do not perform read-ahead for BA_CLRBUF request when we are low on
memory or when dirty buffer queue is too large.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-10-27 13:44:13 +00:00
Warner Losh
476dfdc3bf Do not relocate extents to make them contiguous if the underlying drive can do
deletions. Ability to do deletions is a strong indication that this
optimization will not help performance. It will only generate extra write
traffic. These devices are typically flash based and have a limited number of
write cycles. In addition, making the file contiguous in LBA space doesn't
improve the access times from flash devices because they have no seek time.

Reviewed by: mckusick@
2015-10-16 03:06:02 +00:00
Gleb Smirnoff
20e8b32db4 In softdep_setup_freeblocks():
- Move the bread() to the beginning of function.
- Return if it fails, otherwise we will panic.

Submitted by:	mckusick
Sponsored by:	Netflix
2015-10-07 12:36:28 +00:00
Konstantin Belousov
7a82f35c9d Do not consume extra reference. This is a bug in r287479.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 12:28:18 +00:00
Konstantin Belousov
c65f759837 Declare the writes around the call to VFS_SYNC() in
softdep_ast_cleanup_proc().

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 08:48:24 +00:00
Konstantin Belousov
76cae067b8 By doing file extension fast, it is possible to create excess supply
of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and
D_ALLOCINDIR), which can exhaust kmem.

Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and
D_DIRREM, by scheduling ast to flush dependencies, after the thread,
which created new dep, left the VFS/FFS innards.  For D_NEWBLK, the
only way to get rid of them is to do full sync, since items are
attached to data blocks of arbitrary vnodes.  The check for D_NEWBLK
excess in softdep_ast_cleanup_proc() is unlocked.

For 32bit arches, reduce the total amount of allowed dependencies by
two.  It could be considered increasing the limit for 64 bit platforms
with direct maps.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 13:07:27 +00:00
Jeff Roberson
98082691bb - Make 'struct buf *buf' private to vfs_bio.c. Having a global variable
'buf' is inconvenient and has lead me to some irritating to discover
   bugs over the years.  It also makes it more challenging to refactor
   the buf allocation system.
 - Move swbuf and declare it as an extern in vfs_bio.c.  This is still
   not perfect but better than it was before.
 - Eliminate the unused ffs function that relied on knowledge of the buf
   array.
 - Move the shutdown code that iterates over the buf array into vfs_bio.c.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-07-29 02:26:57 +00:00
Jeff Roberson
fade8dd714 Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
Mateusz Guzik
f0725a8e1e Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
Mark Johnston
5f34e93c58 Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
Mark Murray
d1b06863fb Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
Konstantin Belousov
521987c3e5 Simplify code, no need to test the flag before clearing it.
Submitted by:	ed
MFC after:	12 days
2015-06-29 13:06:24 +00:00
Konstantin Belousov
b2c3df842b Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer.  Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed.  Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock.  Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with:	Conrad Meyer <cse.cem@gmail.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation (kib),
	  EMC/Isilon storage division (Conrad)
MFC after:	2 weeks
2015-06-27 09:44:14 +00:00
Konstantin Belousov
1eabd96728 vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active.  This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list.  Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation.  In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list.  When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-17 04:46:58 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
Konstantin Belousov
46c3d3ac99 Syncing a directory vnode might drop the vnode lock in the
softdep_sync() similarly to the regular vnode sync.  Allow retry for
both vnode types.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-03 20:48:00 +00:00