Commit Graph

2097 Commits

Author SHA1 Message Date
Konstantin Belousov
39f7cbe9ab When looking up dandling buffers for unwing after failing block
allocation in UFS_BALLOC(), there is no need to map them.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:05:15 +00:00
Konstantin Belousov
554248587e When block allocation fails in UFS_BALLOC(), and the volume does not
have SU enabled, there is no point in calling softdep_request_cleanup().

The call cannot produce free blocks, but we unecessarily lock ufsmount
and do inode block write.  Usual point of not doing optimizations for
the corner case of the full volume is not applicable there, the work
is easily avoidable, and the addition CPU and disk io load do not lead
to succeeding retry.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:50:48 +00:00
Konstantin Belousov
f1503c5a49 In ffs_balloc_ufs{1,2} routines, assert that unwind records do not
overflow local arrays.  This is not immediately obvious from the
static code inspection, due to retry logic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:49:56 +00:00
Konstantin Belousov
2f514f92cf Implement VOP_FDATASYNC() for UFS.
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:22:23 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Konstantin Belousov
e73b16ae7a Ensure that the UFS directory vnode' vm_object is properly sized
before UFS_BALLOC() is called.  I do not believe that this caused any
real issue on FreeBSD because the exclusive vnode lock is held over
the balloc/resize, the change is to make formally correct KPI use.

Based on:	the Matthew Dillon' patch from DragonFly BSD
PR:	93942
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-20 14:40:56 +00:00
Kevin Lo
57d2ac2f90 arc4random() returns 0 to (2**32)−1, use an alternative to initialize
i_gen if it's zero rather than a divide by 2.

With inputs from  delphij, mckusick, rmacklem

Reviewed by:	mckusick
2016-05-22 14:31:20 +00:00
Konstantin Belousov
5f36cc5bfa Stop dropping and reacquiring Giant around geom calls in UFS.
Sponsored by:	The FreeBSD Foundation
2016-05-21 10:13:25 +00:00
Konstantin Belousov
c70b3cd28d Improve handling of rdev->si_mountpt on mount and unmount of FFS
volumes.  Treat the field as a semaphore protecting availability of
the device for mounting.  Do no access devvp->v_rdev without the vnode
lock owned.

Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-21 09:49:35 +00:00
Konstantin Belousov
3f7ca894de Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
Konstantin Belousov
bfe89a8ee1 Do enable io accounting for read-only mounts and mounts which are
remounted to writeable after initial read-only.  Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.

Based on submission by:	bde
MFC after:	1 week
2016-05-17 21:35:35 +00:00
Konstantin Belousov
f30ddc49e6 If IO_SYNC was passed to ffs_truncate(), request synchronous inode
update from the final ffs_update().

Noted by:	bde
MFC after:	1 week
2016-05-17 21:30:58 +00:00
Konstantin Belousov
3d0691acdc For async UFS mounts, shrink the directory asynchronously, at least do
not pass IO_SYNC to ffs_truncate() unneccessary.

Submitted by:	bde
MFC after:	1 week
2016-05-17 21:28:28 +00:00
Konstantin Belousov
98cbffd7bd Fix comments.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:24:27 +00:00
Konstantin Belousov
5b6526413f Fix typo in the message.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:19:20 +00:00
Pedro F. Giffuni
08e941833d UFS: spelling fixes on comments.
No functional change.
2016-04-29 20:43:51 +00:00
Enji Cooper
028d33c96a Add FEATURE knob for testing for UFS extended attribute kernel support
Support can be verified via `feature_present("ufs_extattr")`, etc.

Differential Revision: https://reviews.freebsd.org/D6053
MFC after: 2 weeks
Relnotes: yes
Reviewed by: asomers, kib
Sponsored by: EMC / Isilon Storage Division
2016-04-22 08:09:27 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Pedro F. Giffuni
abafa4db03 ufs: replace 0 with NULL for pointers.
While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by:	mckusick
MFC after:	2 weeks
2016-04-10 21:48:11 +00:00
Edward Tomasz Napierala
ae34b6ff96 Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Konstantin Belousov
c79dff0fbe Split the global taskqueue used to process all UFS trim completions,
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.

But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes.  Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.

Reported by:	Nick Evans <nevans@talkpoint.com>
Tested by:	Nick Evans <nevans@talkpoint.com>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:21:17 +00:00
Konstantin Belousov
d016708959 Style: wrap long lines.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:07:12 +00:00
Konstantin Belousov
6336aefc1d Fix locking mistake in softdep_waitidle(). The surrounding code
expects that the loop is always exited with the SU lock owned, even on
error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-03-23 09:58:51 +00:00
Kirk McKusick
24b6748c42 The UFS filesystem requires that the last block of a file always be
allocated. When shortening the length of a file in which the new end
of the file contains a hole, the hole must have a block allocated.

Reported by: Maxim Sobolev
Reviewed by: kib
Tested by:   Peter Holm
2016-02-24 01:58:40 +00:00
Edward Tomasz Napierala
50102a6342 Remove ffs_mountroot() prototype; seems to be long gone.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-28 12:21:23 +00:00
Kirk McKusick
abe53f7e91 This fixes a bug in UFS2 exported NFS volumes. An NFS client can
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.

Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib
2016-01-27 21:27:05 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Konstantin Belousov
44de1ba141 Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the
ast was rescheduled during VFS_SYNC().  It is possible that enough
parallel writes or slow/hung volume result in VFS_SYNC() deferring to
the ast flushing of workqueue.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-21 11:50:32 +00:00
Konstantin Belousov
dc15d94fd9 Update ctime when atime or birthtime are updated.
Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or
IN_UPDATE, then clear them with ufs_itimes(), making transient
(possibly inconsistent) change to the times, and then copy
user-supplied times into the inode.  Instead, directly clear IN_ACCESS
or IN_UPDATE when user supplied the time, and copy the value into the
inode.

Minor inconsistency left is that the inode ctime is updated even when
birthtime update attempt is performed on a UFS1 volume.

Submitted by:	bde
MFC after:	2 weeks
2015-12-07 12:09:04 +00:00
Kirk McKusick
43a993bb7d For performance reasons, it is useful to have a single string used as
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:01:02 +00:00
Konstantin Belousov
fe920528c1 Do not perform read-ahead for BA_CLRBUF request when we are low on
memory or when dirty buffer queue is too large.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-10-27 13:44:13 +00:00
Warner Losh
476dfdc3bf Do not relocate extents to make them contiguous if the underlying drive can do
deletions. Ability to do deletions is a strong indication that this
optimization will not help performance. It will only generate extra write
traffic. These devices are typically flash based and have a limited number of
write cycles. In addition, making the file contiguous in LBA space doesn't
improve the access times from flash devices because they have no seek time.

Reviewed by: mckusick@
2015-10-16 03:06:02 +00:00
Gleb Smirnoff
20e8b32db4 In softdep_setup_freeblocks():
- Move the bread() to the beginning of function.
- Return if it fails, otherwise we will panic.

Submitted by:	mckusick
Sponsored by:	Netflix
2015-10-07 12:36:28 +00:00
Konstantin Belousov
7a82f35c9d Do not consume extra reference. This is a bug in r287479.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 12:28:18 +00:00
Konstantin Belousov
c65f759837 Declare the writes around the call to VFS_SYNC() in
softdep_ast_cleanup_proc().

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 08:48:24 +00:00
Konstantin Belousov
76cae067b8 By doing file extension fast, it is possible to create excess supply
of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and
D_ALLOCINDIR), which can exhaust kmem.

Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and
D_DIRREM, by scheduling ast to flush dependencies, after the thread,
which created new dep, left the VFS/FFS innards.  For D_NEWBLK, the
only way to get rid of them is to do full sync, since items are
attached to data blocks of arbitrary vnodes.  The check for D_NEWBLK
excess in softdep_ast_cleanup_proc() is unlocked.

For 32bit arches, reduce the total amount of allowed dependencies by
two.  It could be considered increasing the limit for 64 bit platforms
with direct maps.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 13:07:27 +00:00
Jeff Roberson
98082691bb - Make 'struct buf *buf' private to vfs_bio.c. Having a global variable
'buf' is inconvenient and has lead me to some irritating to discover
   bugs over the years.  It also makes it more challenging to refactor
   the buf allocation system.
 - Move swbuf and declare it as an extern in vfs_bio.c.  This is still
   not perfect but better than it was before.
 - Eliminate the unused ffs function that relied on knowledge of the buf
   array.
 - Move the shutdown code that iterates over the buf array into vfs_bio.c.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-07-29 02:26:57 +00:00
Jeff Roberson
fade8dd714 Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
Mateusz Guzik
f0725a8e1e Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
Mark Johnston
5f34e93c58 Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
Mark Murray
d1b06863fb Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
Konstantin Belousov
521987c3e5 Simplify code, no need to test the flag before clearing it.
Submitted by:	ed
MFC after:	12 days
2015-06-29 13:06:24 +00:00
Konstantin Belousov
b2c3df842b Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer.  Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed.  Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock.  Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with:	Conrad Meyer <cse.cem@gmail.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation (kib),
	  EMC/Isilon storage division (Conrad)
MFC after:	2 weeks
2015-06-27 09:44:14 +00:00
Konstantin Belousov
1eabd96728 vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active.  This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list.  Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation.  In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list.  When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-17 04:46:58 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
Konstantin Belousov
46c3d3ac99 Syncing a directory vnode might drop the vnode lock in the
softdep_sync() similarly to the regular vnode sync.  Allow retry for
both vnode types.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-03 20:48:00 +00:00
Konstantin Belousov
aef373ce38 Remove unused variable.
When deallocate_dependencies() is performed,
softdep_journal_freeblocks() already called cancel_allocdirect() which
should have eliminated direct dependencies for all truncated full
blocks.  The indirect dependencies are allowed above, since second-
and third-level dependencies are only dealt with by the code which
frees indirect block, which happens after the inode write.

Discussed with:	mckusick, jeff
Reviewed by:	jeff
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-31 15:50:54 +00:00
Konstantin Belousov
69baeadc31 Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros.  It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review:	https://reviews.freebsd.org/D2665
Reviewed by:	rodrigc
Looked at by:	bjk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 13:24:17 +00:00
Konstantin Belousov
0bc4fe10ad After r283600, NODELAY flag to inodedep_lookup() function is unused.
Eliminate it, and simplify code by removing the local dflags variable
always initialized to DEPALLOC.

Noted by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:49:04 +00:00
Konstantin Belousov
1bc93bb7b9 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
Kirk McKusick
74a87c3804 Limit the number of cylinder groups that will be searched when
trying to build a cluster. The limit is tunable using the sysctl
vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups
per block allocation. It was previously limited to the number of
cylinder groups in the filesystem per block allocation. When there
were no clusters of the needed size left, it repeatedly searched
the whole filesystem for a non-existent cluster on every block
allocation. The result was very slow filesystem allocation with
100% CPU utilization. The old behavior can be had by setting
vfs.ffs.maxclustersearch to a huge number (1,000,000).

This change affects only the layout policy routines so is not able
to interfere with the integrity of the filesystem.

Reported by: Dmitry Sivachenko (demon@)
Tested by:   Dmitry Sivachenko (demon@)
MFC after:   2 weeks
2015-04-24 23:27:50 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Konstantin Belousov
9928931186 Fix build (with gcc).
Reported by:	bz, ian
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-27 15:49:21 +00:00
Konstantin Belousov
4af9f77ecb Fix the hand after the immediate reboot when the following command
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init

Hang occurs on the rootfs unmount.  There are two issues:

1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0.  FFS is suspended during
unmount, so unmount just hangs.

2. As noted above, the inodedep is linked eventually.  It is not
linked until the superblock is written.  But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed.  It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock.  Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.

Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.

There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.

Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons.  Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.

PR:	195458
In collaboration with:	gjb, pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-27 13:55:56 +00:00
Konstantin Belousov
f6f76e1747 Partially revert r277922, avoid sleeping and do flush if we a awaken,
instead of waiting for the FLUSH_* flags.  Also, when requesting
flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was
already set.

Reported and tested by:	dim,
	"Lundberg, Johannes" <johannes@brilliantservice.co.jp>
Bisected by:	dim
MFC after:	2 weeks
2015-02-05 13:00:27 +00:00
Konstantin Belousov
1c9b585695 When mounting SU-enabled mount point, wait until the softdep_flush()
thread started and incremented the stat_flush_threads [1].

Unconditionally wakeup softdep_flush threads when needed, do not try
to check wchan, which is racy and breaks abstraction.

Reported by and discussed with:	glebius, neel
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-30 11:41:46 +00:00
Konstantin Belousov
cda374fe00 The sys_quotactl() contract demands that the mount point is
vfs_unbusy()ed when the cmd is Q_QUOTAON, regardless of other input
parameters or error return.

Submitted by:	Conrad Meyer
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:  https://reviews.freebsd.org/D1684
Tested by:	pho
MFC after:	1 week
2015-01-27 10:32:49 +00:00
Konstantin Belousov
789bdfdbc6 Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some
fs, e.g. smbfs, already did it.

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-21 13:29:33 +00:00
Konstantin Belousov
6c21f6edb8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
Gleb Kurtsou
dde58752db Adjust printf format specifiers for dev_t and ino_t in kernel.
ino_t and dev_t are about to become uint64_t.

Reviewed by:	kib, mckusick
2014-12-17 07:27:19 +00:00
Gleb Smirnoff
90effb2341 Merge from projects/sendfile:
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
  doesn't sleep. It returns immediately, and will execute the I/O done handler
  function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
  method for vnode_pager.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-23 12:01:52 +00:00
Gleb Smirnoff
dbfd8ef2e4 buf.h is not needed here, and pollutes when ufsmount.h is included
from userland code.

Sponsored by:	Nginx, Inc.
2014-11-23 01:02:19 +00:00
Gleb Smirnoff
8c7f0b92b4 Include required files directly instead of pollution via ufs/ufsmount.h.
Sponsored by:	Nginx, Inc.
2014-11-23 01:01:14 +00:00
Davide Italiano
0a4102656e Use the correct variable name. 2014-11-22 00:42:30 +00:00
Davide Italiano
c38b12220f Make ufs_dirhashreclaimperc a percentage for real and
rename it to ufs_dirhashreclaimpercent, as suggested
by jhb@. As an added bonus this avoids divide-by-zero
errors.

Requested by:	jhb, markj
Reviewied by:	jhb, markj
2014-11-22 00:37:37 +00:00
Konstantin Belousov
ca109b01cf When non-forced unmount or remount rw->ro is performed, writes on UFS
are not suspended.  In particular, on the SU-enabled vulumes, there is
no reason why, between the call to softdep_flushfiles() and
softdep_waitidle(), SU work items cannot be queued.

Correct the condition to trigger the panic by only checking when
forced operation is done.  Convert direct panic() call into KASSERT(),
there is no invalid on-disk data structures directly involved, so
follow the usual debugging vs. non-debugging approach.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-02 13:14:55 +00:00
Mateusz Guzik
4fce16e4c9 Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after:	2 weeks
2014-10-20 18:00:50 +00:00
Mateusz Guzik
8f889ce7c8 Use lockless quota checks in qsync and qsyncvp.
No strong objections from: kib, mckusick
MFC after:	1 week
2014-10-16 12:41:14 +00:00
Konstantin Belousov
df5c9c0411 Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS
survives remount in rw, also it is set for vnodes on rootfs before
noatime can be set or clock is adjusted.  All conditions result in
wrong atime for accessed vnodes.

Submitted by:	bde
MFC after:	1 week
2014-10-11 19:09:56 +00:00
Warner Losh
34eed6d3de Restore the backed-out change, using __offsetof instead. 2014-10-10 00:35:08 +00:00
Baptiste Daroussin
16b2833f90 Backout r272825 every useland usage of ufs/ufs/dir.h are now broken with that change 2014-10-09 17:26:29 +00:00
Baptiste Daroussin
92a1d420ba Use offsetof() from sys/types.h instead of a custom one
This fixes build with recent gcc versions
2014-10-09 15:26:22 +00:00
Konstantin Belousov
d15b55c554 Provide the unique implementation for the VOP_GETPAGES() method used
by ffs and ext2fs.  Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages().  Use vm_pager_free_nonreq().

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	6 weeks (after r271596)
2014-09-15 12:28:29 +00:00
Alan Cox
3e5c84e292 We don't need an exclusive object lock on the expected execution path
through {ext2,ffs}_getpages().

Reviewed by:	kib, pfg
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-09-13 18:26:13 +00:00
Konstantin Belousov
c6fef2d49a Direct access to the quota files, in particular, lookup, causes lock
conflict with the quota metadata access.  Mark quota vnode lock as
recursive and always exclusive to avoid the problem.

Reported by:	hrs
Tested by:	hrs, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-29 09:04:24 +00:00
Davide Italiano
266d427bc0 Rather than using an hardcoded reclaim age, rely on an LRU-like approach
for dirhash cache, setting a target percent to reclaim (exposed via
SYSCTL). This allows to always make some amount of progress keeping the maximum
reclaim age dynamic.

Tested by:      pho
Reviewed by:	jhb
2014-08-25 17:06:18 +00:00
Konstantin Belousov
29d03af1e4 Do not busy the UFS mount point inside VOP_RENAME(). The
kern_renameat() already starts write on the mp, which prevents
parallel unmount from proceed.  Busying mp after vn_start_write()
deadlocks the unmount.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-20 08:15:23 +00:00
Konstantin Belousov
4ce90426e0 Correct the test for condition to suspend UFS filesystem during
unmount.  There is no need to suspend read-only filesystem, while we
need suspension on modificable mount point.

Reported by:	rwatson
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-20 08:13:03 +00:00
Konstantin Belousov
a6b5e6e32f Revision r269457 removed the Giant around mount and unmount code, but
r269533, which was tested before r269457 was committed, implicitely
relied on the Giant to protect the manipulations of the softdepmounts
list.  Use softdep global lock consistently to guarantee the list
structure now.

Insert the new struct mount_softdeps into the softdepmounts only after
it is sufficiently initialized, to prevent softdep_speedup() from
accessing bare memory.  Similarly, remove struct mount_softdeps for
the unmounted filesystem from the tailq before destroying structure
rwlock.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-12 09:33:00 +00:00
Kirk McKusick
3da0b29d99 The SUJ journal is only prepared to handle full-size block numbers, so we
have to adjust freeblk records to reflect the change to a full-size block.
For example, suppose we have a block made up of fragments 8-15 and
want to free its last two fragments. We are given a request that says:
    FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0
where frags are the number of fragments to free and oldfrags are the
number of fragments to keep. To block align it, we have to change it to
have a valid full-size blkno, so it becomes:
    FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6

Submitted by: Mikihito Takehara
Tested by:    Mikihito Takehara
Reviewed by:  Jeff Roberson
MFC after:    1 week
2014-08-07 16:53:07 +00:00
Kirk McKusick
5f9500c358 Add support for multi-threading of soft updates.
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.

Reviewed by:  kib
Tested by:    Peter Holm and Scott Long
MFC after:    2 weeks
Sponsored by: Netflix
2014-08-04 22:03:58 +00:00
Warner Losh
ddd812b850 Simplify comment to remove multiple negative and passive voice. 2014-07-23 16:18:54 +00:00
Konstantin Belousov
65589a29f4 Check for the cross-device cross-link attempt in the VFS, instead of
forcing filesystem VOP_LINK() methods to repeat the code.  In
tmpfs_link(), remove redundand check for the type of the source,
already done by VFS.

Note that NFS server already performs this check before calling
VOP_LINK().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-16 14:04:46 +00:00
Konstantin Belousov
895b3782c6 Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt().  Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:10:00 +00:00
Konstantin Belousov
7b81a399a4 In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once.  Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by:	bde
Reviewed by:	bde, trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-17 07:11:00 +00:00
Konstantin Belousov
23f6698fbd Initialize the pbuf counter for directio using SYSINIT, instead of
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel.  Add ffs_rawread.c to the list of ufs.ko
module' sources.

In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.

One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option.  This is similar to
the option QUOTA and ufs_quota.c.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:55:06 +00:00
John-Mark Gurney
e9838c1140 don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags..
If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2,
you'll end up corrupting your FS as the superblock is updated and written
to a different location...

makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing
this issue...

Reviewed by:	silience from mckusick
MFC after:	1 week
2014-06-03 21:46:13 +00:00
Scott Long
7d155880ee Due to reasons unknown at this time, the system can be forced to write
a journal block even when there are no journal entries to be written.
Until the root cause is found, handle this case by ensuring that a
valid journal segment is always written.

Second, the data buffer used for writing journal entries was never
being scrubbed of old data.  Fix this.

Submitted by:	Takehara Mikihito
Obtained from:	Netflix, Inc.
MFC after:	3 days
2014-05-06 20:40:16 +00:00
Kirk McKusick
e24784d32a Update comment to explain search order reverted to historical order
in -r254996.

Suggested by: Pedro Giffuni <pfg@FreeBSD.org>
MFC:          3 days
2014-03-22 11:26:39 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Jeff Roberson
4803948fe2 - If we fail to do a non-blocking acquire of a buf lock while doing a
waiting sync pass we need to do a blocking acquire and restart.
   Another thread, typically the buf daemon, may have this buf locked and
   if we don't wait we can fail to sync the file.  This lead to a great
   variety of softdep panics because we rely on all dependencies being
   flushed before proceeding in several cases.

Reported by:	pho
Discussed with:	mckusick
Sponsored by:	EMC / Isilon Storage Division
MFC after:	2 weeks
2014-03-06 00:13:21 +00:00
Jeff Roberson
c13a58b022 - Gracefully handle truncation failures when trying to shrink directories.
This could cause dirhash panics since the dirhash state would be
   successfully truncated while the directory was not.

Reported by:	pho
Discussed with:	mckusick
Sponsored by:	EMC / Isilon Storage Division
MFC after:	2 weeks
2014-03-06 00:10:07 +00:00
Pedro F. Giffuni
4896af9f55 ufs: small formatting fixes.
Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after:	3 days
Reviewed by:	mckusick
2014-03-02 02:52:34 +00:00
Kirk McKusick
be1fd82346 Fine tune filesystem block allocations under low free-space
conditions (-r254995) based on further operational experience.

Submitted by:  Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
MFC after: 2 weeks
2013-12-30 17:04:24 +00:00
Kirk McKusick
9cd7cb1bf1 Properly handle unsigned comparison.
MFC after: 2 weeks
2013-12-30 06:19:42 +00:00
Kirk McKusick
2e436d88a3 We needlessly panic when trying to flush MKDIR_PARENT dependencies.
We had previously tried to flush all MKDIR_PARENT dependencies (and
all the NEWBLOCK pagedeps) by calling ffs_update(). However this will
only resolve these dependencies in direct blocks. So very large
directories with MKDIR_PARENT dependencies in indirect blocks had
not yet gotten flushed. As the directory is in the midst of doing a
complete sync, we simply defer the checking of the MKDIR_PARENT
dependencies until the indirect blocks have been sync'ed.

Reported by: Shawn Wallbridge of imaginaryforces.com
Tested by:   John-Mark Gurney <jmg@funkthat.com>
PR:          183424
MFC after:   2 weeks
2013-12-01 07:34:21 +00:00
John-Mark Gurney
e0ce310797 fix white space...
MFC after:	1 week
2013-11-20 21:21:29 +00:00
John-Mark Gurney
b6ffc3b567 fix a use after free, jsegdep_merge will free wk, avoid the next check...
CID:		1006098
Sponsored by:	Imaginary Forces
Reviewed by:	mckusick
MFC after:	1 week
2013-11-20 21:16:53 +00:00
Pedro F. Giffuni
4b367145f7 UFS2: make di_extsize unsigned.
di_extsize is the EA size and as such it should be unsigned.
Adjust related types for consistency.

Reviewed by:	mckusick (previous version)
MFC after:	3 weeks
2013-10-24 00:33:29 +00:00
Brooks Davis
cf058082cd Allow kernels without options SOFTUPDATES to build. This should fix the
embedded tinderboxes.

Reviewed by:	emaste
2013-10-21 20:51:08 +00:00
Kirk McKusick
07599ccb28 Fix build problem on ARM (which defaults to building without soft updates).
Reported by:  Tinderbox
Sponsored by: Netflix
2013-10-21 13:09:09 +00:00
Kirk McKusick
58941b9f15 Restructuring of the soft updates code to set it up so that the
single kernel-wide soft update lock can be replaced with a
per-filesystem soft-updates lock. This per-filesystem lock will
allow each filesystem to have its own soft-updates flushing thread
rather than being limited to a single soft-updates flushing thread
for the entire kernel.

Move soft update variables out of the ufsmount structure and into
their own mount_softdeps structure referenced by ufsmount field
um_softdep.  Eventually the per-filesystem lock will be in this
structure. For now there is simply a pointer to the kernel-wide
soft updates lock.

Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock
pointer in the mount_softdeps structure instead of a pointer to the
kernel-wide soft-updates lock.

Replace the five hash tables used by soft updates with per-filesystem
copies of these tables allocated in the mount_softdeps structure.

Several functions that flush dependencies when too many are allocated
in the kernel used to operate across all filesystems. They are now
parameterized to flush dependencies from a specified filesystem.
For now, we stick with the round-robin flushing strategy when the
kernel as a whole has too many dependencies allocated.

While there are many lines of changes, there should be no functional
change in the operation of soft updates.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-21 00:28:02 +00:00
Kirk McKusick
cc76ac5a6c Fourth of several cleanups to soft dependency implementation.
Add KASSERTS that soft dependency functions only get called
for filesystems running with soft dependencies. Calling these
functions when soft updates are not compiled into the system
become panic's.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 22:21:01 +00:00
Kirk McKusick
519e3c3b9f Third of several cleanups to soft dependency implementation.
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 21:11:40 +00:00
Kirk McKusick
90a306d8af Second of several cleanups to soft dependency implementation.
Delete two unused functions in ffs_sofdep.c.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 20:52:07 +00:00
Kirk McKusick
8850120f48 First of several cleanups to soft dependency implementation.
Convert three functions exported from ffs_softdep.c to static
functions as they are not used outside of ffs_softdep.c.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 20:41:38 +00:00
Pedro F. Giffuni
b876c780d0 Make di_blocks unsigned in UFS1 as is the case already for UFS2.
Most of the code between UFS1 and UFS2 is shared so this change
is pretty safe. Not only this makes UFS1 and 2 consistent but it
also matches what NetBSD and MacOS X have for some years now.

Reviewed by:	mckusick
MFC after:	1 month
2013-10-14 18:17:09 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Kirk McKusick
2ce451f089 In looking at block layouts as part of fixing filesystem block
allocations under low free-space conditions (-r254995), determine
that old block-preference search order used before -r249782 worked
a bit better. This change reverts to that block-preference search order.

MFC after:	2 weeks
2013-08-28 17:46:32 +00:00
Kirk McKusick
28702816d8 A performance problem was reported in PR kern/181226:
I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
    full (10-20GB free), processes which write data to it start
    eating 100% CPU and write speed drops below 1MB/sec (normally
    to gives 400MB/sec). The revision at which it first became
    apparent was http://svnweb.freebsd.org/changeset/base/249782.

The offending change reserved an area in each cylinder group to
store metadata. The new algorithm attempts to save this area for
metadata and allows its use for non-metadata only after all the
data areas have been exhausted. The size of the reserved area
defaults to half of minfree, so the filesystem reports full before
the data area can completely fill. However, in this report, the
filesystem has had minfree reduced to 1% thus forcing the metadata
area to be used for data. As the filesystem approached full, it
had only metadata areas left to allocate. The result was that
every block allocation had to scan summary data for 30,000 cylinder
groups before falling back to searching up to 30,000 metadata areas.

The fix is to give up on saving the metadata areas once the free
space reserve drops below 2%. The effect of this change is to use
the old algorithm of just accepting the first available block that
we find. Since most filesystems use the default 5% minfree, this
will have no effect on their operation. For those that want to push
to the limit, they will get their crappy block placements quickly.

Submitted by:  Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
PR:            kern/181226
MFC after:     2 weeks
2013-08-28 17:38:05 +00:00
Ivan Voras
60dd37465a Take a very small step toward the Century of the Anchovy by increasing the
time dirhash entries stay in memory before being considered for eviction to
1 minute.
2013-08-28 10:06:20 +00:00
Kenneth D. Merry
7da1a731c6 Expand the use of stat(2) flags to allow storing some Windows/DOS
and CIFS file attributes as BSD stat(2) flags.

This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.

The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.

The summary of the flags is as follows:

UF_SYSTEM:	Command line name: "system" or "usystem"
		ZFS name: XAT_SYSTEM, ZFS_SYSTEM
		Windows: FILE_ATTRIBUTE_SYSTEM

		This flag means that the file is used by the
		operating system.  FreeBSD does not enforce any
		special handling when this flag is set.

UF_SPARSE:	Command line name: "sparse" or "usparse"
		ZFS name: XAT_SPARSE, ZFS_SPARSE
		Windows: FILE_ATTRIBUTE_SPARSE_FILE

		This flag means that the file is sparse.  Although
		ZFS may modify this in some situations, there is
		not generally any special handling for this flag.

UF_OFFLINE:	Command line name: "offline" or "uoffline"
		ZFS name: XAT_OFFLINE, ZFS_OFFLINE
		Windows: FILE_ATTRIBUTE_OFFLINE

		This flag means that the file has been moved to
		offline storage.  FreeBSD does not have any special
		handling for this flag.

UF_REPARSE:	Command line name: "reparse" or "ureparse"
		ZFS name: XAT_REPARSE, ZFS_REPARSE
		Windows: FILE_ATTRIBUTE_REPARSE_POINT

		This flag means that the file is a Windows reparse
		point.  ZFS has special handling code for reparse
		points, but we don't currently have the other
		supporting infrastructure for them.

UF_HIDDEN:	Command line name: "hidden" or "uhidden"
		ZFS name: XAT_HIDDEN, ZFS_HIDDEN
		Windows: FILE_ATTRIBUTE_HIDDEN

		This flag means that the file may be excluded from
		a directory listing if the application honors it.
		FreeBSD has no special handling for this flag.

		The name and bit definition for UF_HIDDEN are
		identical to the definition in MacOS X.

UF_READONLY:	Command line name: "urdonly", "rdonly", "readonly"
		ZFS name: XAT_READONLY, ZFS_READONLY
		Windows: FILE_ATTRIBUTE_READONLY

		This flag means that the file may not written or
		appended, but its attributes may be changed.

		ZFS currently enforces this flag, but Illumos
		developers have discussed disabling enforcement.

		The behavior of this flag is different than MacOS X.
		MacOS X uses UF_IMMUTABLE to represent the DOS
		readonly permission, but that flag has a stronger
		meaning than the semantics of DOS readonly permissions.

UF_ARCHIVE:	Command line name: "uarch", "uarchive"
		ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
		Windows name: FILE_ATTRIBUTE_ARCHIVE

		The UF_ARCHIVED flag means that the file has changed and
		needs to be archived.  The meaning is same as
		the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
		the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.

		msdosfs and ZFS have special handling for this flag.
		i.e. they will set it when the file changes.

sys/param.h:		Bump __FreeBSD_version to 1000047 for the
			addition of new stat(2) flags.

chflags.1:		Document the new command line flag names
			(e.g. "system", "hidden") available to the
			user.

ls.1:			Reference chflags(1) for a list of file flags
			and their meanings.

strtofflags.c:		Implement the mapping between the new
			command line flag names and new stat(2)
			flags.

chflags.2:		Document all of the new stat(2) flags, and
			explain the intended behavior in a little
			more detail.  Explain how they map to
			Windows file attributes.

			Different filesystems behave differently
			with respect to flags, so warn the
			application developer to take care when
			using them.

zfs_vnops.c:		Add support for getting and setting the
			UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
			UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.

			All of these flags are implemented using
			attributes that ZFS already supports, so
			the on-disk format has not changed.

			ZFS currently doesn't allow setting the
			UF_REPARSE flag, and we don't really have
			the other infrastructure to support reparse
			points.

msdosfs_denode.c,
msdosfs_vnops.c:	Add support for getting and setting
			UF_HIDDEN, UF_SYSTEM and UF_READONLY
			in MSDOSFS.

			It supported SF_ARCHIVED, but this has been
			changed to be UF_ARCHIVE, which has the same
			semantics as the DOS archive attribute instead
			of inverse semantics like SF_ARCHIVED.

			After discussion with Bruce Evans, change
			several things in the msdosfs behavior:

			Use UF_READONLY to indicate whether a file
			is writeable instead of file permissions, but
			don't actually enforce it.

			Refuse to change attributes on the root
			directory, because it is special in FAT
			filesystems, but allow most other attribute
			changes on directories.

			Don't set the archive attribute on a directory
			when its modification time is updated.
			Windows and DOS don't set the archive attribute
			in that scenario, so we are now bug-for-bug
			compatible.

smbfs_node.c,
smbfs_vnops.c:		Add support for UF_HIDDEN, UF_SYSTEM,
			UF_READONLY and UF_ARCHIVE in SMBFS.

			This is similar to changes that Apple has
			made in their version of SMBFS (as of
			smb-583.8, posted on opensource.apple.com),
			but not quite the same.

			We map SMB_FA_READONLY to UF_READONLY,
			because UF_READONLY is intended to match
			the semantics of the DOS readonly flag.
			The MacOS X code maps both UF_IMMUTABLE
			and SF_IMMUTABLE to SMB_FA_READONLY, but
			the immutable flags have stronger meaning
			than the DOS readonly bit.

stat.h:			Add definitions for UF_SYSTEM, UF_SPARSE,
			UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
			and UF_HIDDEN.

			The definition of UF_HIDDEN is the same as
			the MacOS X definition.

			Add commented-out definitions of
			UF_COMPRESSED and UF_TRACKED.  They are
			defined in MacOS X (as of 10.8.2), but we
			do not implement them (yet).

ufs_vnops.c:		Add support for getting and setting
			UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
			UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
			Alphabetize the flags that are supported.

			These new flags are only stored, UFS does
			not take any action if the flag is set.

Sponsored by:	Spectra Logic
Reviewed by:	bde (earlier version)
2013-08-21 23:04:48 +00:00
Kirk McKusick
824009a16a This bug fix is in a code path in rename taken when there is a
collision between a rename and an open system call for the same
target file. Here, rename releases its vnode references, waits for
the open to finish, and then restarts by reacquiring its needed
vnode locks. In this case, rename was unlocking but failing to
release its reference to one of its held vnodes. The effect was
that even after all the actual references to the vnode had gone,
the vnode still showed active references. For files that had been
removed, their space was not reclaimed until the filesystem was
forcibly unmounted.

This bug manifested itself in the Postgres server which would
leak/lose hundreds of files per day amounting to many gigabytes of
disk space. This bug required shutting down Postgres, forcibly
unmounting its filesystem, remounting its filesystem and restarting
Postgres every few days to recover the lost space.

Reported by: Dan Thomas and Palle Girgensohn
Bug-fix by:  kib
Tested by:   Dan Thomas and Palle Girgensohn
MFC after:   2 weeks
2013-08-06 16:50:05 +00:00
Kirk McKusick
8cf85cf292 With the addition of journalled soft updates, the "newblk" structures
persist much longer than previously. Historically we had at most 100
entries; now the count may reach a million. With the increased count
we spent far too much time looking them up in the grossly undersized
newblk hash table. Configure the newblk hash table to accurately reflect
the number of entries that it must index.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2013-08-05 22:02:45 +00:00
Kirk McKusick
57591d8e78 To better understand performance problems with journalled soft updates,
we need to collect the highest level of allocation for each of the
different soft update dependency structures. This change collects these
statistics and makes them available using `sysctl debug.softdep.highuse'.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2013-08-05 22:01:16 +00:00
Kirk McKusick
fdd9a6f478 Update to comments describing block allocation policy.
Submitted by: Bruce Evans
2013-07-14 18:44:33 +00:00
Konstantin Belousov
2aea094f65 Only copy as much bytes as there in superblock, instead of the full
block copy, when copying the superblock into the snapshot.  UFS1 does
not align superblock on the block boundary, and bcopy runs off the end
of the buffer.

Reported by:	Andre Albsmeier <Andre.Albsmeier@siemens.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-12 18:52:33 +00:00
Pedro F. Giffuni
53aa3d1a99 Change i_gen in UFS to an unsigned type.
Missing type change from r252435.

This fixes a "Stale NFS file handle" error.

Reported by:	Claude Bisson
Tested by:	Claude Bisson
Pointed hat:	pfg
2013-07-10 18:19:48 +00:00
Konstantin Belousov
cc3d8c35f5 There are several code sequences like
vfs_busy(mp);
      vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls.  The unmount starts a write, while vfs_write_suspend() drain
writers.  On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked.  The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2013-07-09 20:49:32 +00:00
Kirk McKusick
3f8db5b147 Make better use of metadata area by avoiding using it for data blocks
that no should no longer immediately follow their indirect blocks.

MFC after:	2 weeks
2013-07-02 21:07:08 +00:00
Pedro F. Giffuni
5f3a9c4000 Style fix: spaces.
Cleanup the incomplete revert.

Reported by:	bde
MFC after:	4 weeks
2013-07-02 18:45:37 +00:00
Pedro F. Giffuni
9a0aea4625 Change i_gen in UFS to an unsigned type.
Revert the simplification of the i_gen calculation.
It is still a good idea to avoid zero values and for the case
of old filesystems there is probably no advantage in using
the complete 32 bits anyways.

Discussed with:	bde
MFC after:	4 weeks
2013-07-01 21:43:40 +00:00
Pedro F. Giffuni
bcb2f550be Change i_gen in UFS to an unsigned type.
Further simplify the i_gen calculation for older disks.
Having a zero here is not really a problem and this is more
similar to what is done in newfs_random().

Reported by:	Xin Li
MFC after:	4 weeks
2013-07-01 14:49:23 +00:00
Gleb Kurtsou
e5a5b5b64e Don't assume that UFS on-disk format of a directory is the same as
defined by <sys/dirent.h>

Always start parsing at DIRBLKSIZ aligned offset, skip first entries if
uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too
small for single entry.

Preallocate buffer for cookies. Cookies will be replaced with d_off
field in struct dirent at later point.

Skip entries with zero inode number.

Stop mangling dirent in ufs_extattr_iterate_directory().

Reviewed by:	kib
Sponsored by:	Google Summer Of Code 2011
2013-07-01 04:06:40 +00:00
Pedro F. Giffuni
60f7df4c1b Change i_gen in UFS to an unsigned type.
Missed format specifier.

Reported by:	mdf
MFC after:	4 weeks
2013-07-01 03:31:19 +00:00
Pedro F. Giffuni
eee4072f13 Change i_gen in UFS to an unsigned type.
In UFS, i_gen is a random generated value and there is not way for
it to be negative. Actually, the value of i_gen is just used to
match bit patterns and it is of not consequence if the values are
signed or not.

Following other filesystems, set it to unsigned and use it as such,

Discussed by:	mckusick
Reviewed by:	mckusick (previous version)
MFC after:	4 weeks
2013-07-01 03:00:15 +00:00
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Kirk McKusick
97371fa56d Properly spell sentinel (missed in 250891)
No functional changes.

Spotted by:  Navdeep Parhar and Alexey Dokuchaev
MFC after:   2 weeks
2013-05-22 05:07:55 +00:00
Kirk McKusick
b1bd9340fa Add missing buffer releases (brelse) after bread calls that return
an error. One could argue that returning a buffer even when it is
not valid is incorrect, but bread has always returned a buffer
valid or not.

Reviewed by: kib
MFC after:   2 weeks
2013-05-22 00:57:22 +00:00
Kirk McKusick
21844a3d5d Add missing 28th element to softdep types name array.
Found by:    Coverity Scan, CID 1007621
Reviewed by: kib
MFC after:   2 weeks
2013-05-22 00:48:24 +00:00
Kirk McKusick
d80dbbdb4a Null a pointer after it is freed so that when it is returned
the return value is NULL. Based on the returned flags, the
return value should never be inspected in the case where NULL
is returned, but it is good coding practice not to return a
pointer to freed memory.

Found by:    Coverity Scan, CID 1006096
Reviewed by: kib
MFC after:   2 weeks
2013-05-22 00:40:26 +00:00
Kirk McKusick
64e2b0887c Remove a bogus check for a NULL buffer pointer.
Add a KASSERT that it is not NULL.

Found by:    Coverity Scan, CID 1009114
Reviewed by: kib
MFC after:   2 weeks
2013-05-22 00:30:34 +00:00
Kirk McKusick
13e369a747 Properly spell sentinel (not sintenel or sentinal).
No functional changes.

Spotted by:  kib
MFC after:   2 weeks
2013-05-22 00:17:50 +00:00
Eitan Adler
a164074fc4 Fix several typos
PR:		kern/176054
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de>
MFC after:	3 days
2013-05-12 16:43:26 +00:00
Gabor Kovesdan
ab3f6b347e - Correct mispellings of the word occurrence
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:40:10 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Kirk McKusick
cd861931e6 The code in clear_remove() and clear_inodedeps() skips one entry
in the pagedep and inodedep hash tables. An entry in the table is
skipped because 'pagedep_hash' and 'inodedep_hash' hold the size
of the hash tables - 1.

The chance that this would have any operational failure is extremely
unlikely. These funtions only need to find a single entry and are
only called when there are too many entries. The chance that they
would fail because all the entries are on the single skipped hash
chain are remote.

Submitted by: Pedro Martelletto
Reviewed by:  kib
MFC after:    2 weeks
2013-04-03 19:26:32 +00:00
Kirk McKusick
baa12a84a7 The purpose of this change to the FFS layout policy is to reduce the
running time for a full fsck. It also reduces the random access time
for large files and speeds the traversal time for directory tree walks.

The key idea is to reserve a small area in each cylinder group
immediately following the inode blocks for the use of metadata,
specifically indirect blocks and directory contents. The new policy
is to preferentially place metadata in the metadata area and
everything else in the blocks that follow the metadata area.

The size of this area can be set when creating a filesystem using
newfs(8) or changed in an existing filesystem using tunefs(8).
Both utilities use the `-k held-for-metadata-blocks' option to
specify the amount of space to be held for metadata blocks in each
cylinder group. By default, newfs(8) sets this area to half of
minfree (typically 4% of the data area).

This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker

Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   4 weeks
2013-03-22 21:45:28 +00:00
Kirk McKusick
3289d5877a When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   4 weeks
2013-03-20 17:57:00 +00:00
Konstantin Belousov
59a01b70af UFS support of the unmapped i/o for the user data buffers.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho, scottl, jhb, bf
2013-03-19 15:08:15 +00:00
Konstantin Belousov
e81ff91e62 Do not remap usermode pages into KVA for physio.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:43:57 +00:00
Konstantin Belousov
0d3bb4afa8 Remove negative name cache entry pointing to the target name, which
could be instantiated while tdvp was unlocked.

Reported by:	Rick Miller <vmiller at hostileadmin com>
Tested by:	pho
MFC after:	1 week
2013-03-17 15:11:37 +00:00
Konstantin Belousov
70e198dd07 Some style fixes.
Sponsored by:	The FreeBSD Foundation
2013-03-14 20:31:39 +00:00
Konstantin Belousov
c535690b33 Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions.  The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-14 20:28:26 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Konstantin Belousov
ba05dec5a4 The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered.  The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead.  Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration.  At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
Reviewed by:	mckusick
MFC after:	2 weeks
2013-02-27 07:32:39 +00:00
Konstantin Belousov
94f4ac214c An inode block must not be blockingly read while cg block is owned.
The order is inode buffer lock -> snaplk -> cg buffer lock, reversing
the order causes deadlocks.

Inode block must not be written while cg block buffer is owned. The
FFS copy on write needs to allocate a block to copy the content of the
inode block, and the cylinder group selected for the allocation might
be the same as the owned cg block.  The reserved block detection code
in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the
situation, because the locked cg buffer is not exposed to it.

In order to maintain the dependency between initialized inode block
and the cg_initediblk pointer, look up the inode buffer in
non-blocking mode. If succeeded, brelse cg block, initialize the inode
block and write it.  After the write is finished, reread cg block and
update the cg_initediblk.

If inode block is already locked by another thread, let the another
thread initialize it.  If another thread raced with us after we
started writing inode block, the situation is detected by an update of
cg_initediblk.  Note that double-initialization of the inode block is
harmless, the block cannot be used until cg_initediblk is incremented.

Sponsored by:	The FreeBSD Foundation
In collaboration with:	pho
Reviewed by:	mckusick
MFC after:	1 month
X-MFC-note:	after r246877
2013-02-27 07:31:23 +00:00
Kirk McKusick
7839b23f00 The UFS2 filesystem allocates new blocks of inodes as they are needed.
When a cylinder group runs short of inodes, a new block for inodes is
allocated, zero'ed, and written to the disk. The zero'ed inodes must
be on the disk before the cylinder group can be updated to claim them.
If the cylinder group claiming the new inodes were written before the
zero'ed block of inodes, the system could crash with the filesystem in
an unrecoverable state.

Rather than adding a soft updates dependency to ensure that the new
inode block is written before it is claimed by the cylinder group
map, we just do a barrier write of the zero'ed inode block to ensure
that it will get written before the updated cylinder group map can
be written. This change should only slow down bulk loading of newly
created filesystems since that is the primary time that new inode
blocks need to be created.

Reported by: Robert Watson
Reviewed by: kib
Tested by:   Peter Holm
2013-02-16 15:11:40 +00:00
Konstantin Belousov
9604a7f1b8 Fix several unsafe pointer dereferences in the buffered_write()
function, implementing the sysctl vfs.ffs.set_bufoutput (not used in
the tree yet).

- The current directory vnode dereference is unsafe since fd_cdir
  could be changed and unreferenced, lock the filedesc around and vref
  the fd_cdir.
- The VTOI() conversion of the fd_cdir is unsafe without first
  checking that the vnode is indeed from an FFS mount, otherwise
  the code dereferences a random memory.
- The cdir could be reclaimed from under us, lock it around the
  checks.
- The type of the fp vnode might be not a disk, or it might have
  changed while the thread was in flight, check the type.

Reviewed and tested by:	mckusick
MFC after:	2 weeks
2013-02-10 10:17:33 +00:00