Commit Graph

2235 Commits

Author SHA1 Message Date
kib
9cca77fa4f Do not free(9) uninitialized pointer.
Reported and tested by:	allanjude
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2018-02-19 19:08:25 +00:00
markj
65a57d392f Fix a memory leak introduced in r328426.
ffs_sbget() may return a superblock buffer even if it fails, so the
caller must be prepared to free it in this case. Moreover, when tasting
alternate superblock locations in a loop, ffs_sbget()'s readfunc
callback must free the previously allocated buffer.

Reported and tested by:	pho
Reviewed by:		kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D14390
2018-02-16 15:41:03 +00:00
mckusick
95a01a8bcc The goal of this change is to prevent accidental foot shooting by
folks running filesystems created on check-hash enabled kernels
(which I will call "new") on a non-check-hash enabled kernels (which
I will call "old). The idea here is to detect when a filesystem is
run on an old kernel and flag the filesystem so that when it gets
moved back to a new kernel, it will not start getting a slew of
check-hash errors.

Back when the UFS version 2 filesystem was created, it added a file
flag FS_INDEXDIRS that was to be set on any filesystem that kept
some sort of on-disk indexing for directories. The idea was precisely
to solve the issue we have today. Specifically that a newer kernel
that supported indexing would be able to tell that the filesystem
had been run on an older non-indexing kernel and that the indexes
should not be used until they had been rebuilt. Since we have never
implemented on-disk directory indicies, the FS_INDEXDIRS flag is
cleared every time any UFS version 2 filesystem ever created is
mounted for writing.

This commit repurposes the FS_INDEXDIRS flag as the FS_METACKHASH
flag. Thus, the FS_METACKHASH is definitively known to have always
been cleared. The FS_INDEXDIRS flag has been moved to a new block
of flags that will always be cleared starting with this commit
(until they get used to implement some future feature which needs
to detect that the filesystem was mounted on a kernel that predates
the new feature).

If a filesystem with check-hashes enabled is mounted on an old
kernel the FS_METACKHASH flag is cleared. When that filesystem is
mounted on a new kernel it will see that the FS_METACKHASH has been
cleared and clears all of the fs_metackhash flags. To get them
re-enabled the user must run fsck (in interactive mode without the
-y flag) which will ask for each supported check hash whether it
should be rebuilt and enabled. When fsck is run in its default preen
mode, it will just ignore the check hashes so they will remain
disabled.

The kernel has always disabled any check hash functions that it
does not support, so as more types of check hashes are added, we
will get a non-surprising result. Specifically if filesystems get
moved to kernels supporting fewer of the check hashes, those that
are not supported will be disabled. If the filesystem is moved back
to a kernel with more of the check-hashes available and fsck is run
interactively to rebuild them, then their checking will resume.
Otherwise just the smaller subset will be checked.

A side effect of this commit is that filesystems running with
cylinder-group check hashes will stop having them checked until
fsck is run to re-enable them (since none of them currently have
the FS_METACKHASH flag set). So, if you want check hashes enabled
on your filesystems after booting a kernel with these changes, you
need to run fsck to enable them. Any newly created filesystems will
have check hashes enabled. If in doubt as to whether you have check
hashes emabled, run dumpfs and look at the list of enabled flags
at the end of the superblock details.
2018-02-08 23:06:58 +00:00
pfg
999ae367a8 {ext2|ufs}_readdir: Avoid setting negative ncookies.
ncookies cannot be negative or the allocator will fail. This should only
happen if a caller is very broken but we can still try to survive the
event.

We should probably also verify for uio_resid > MAXPHYS but in that case
it is not clear that just clipping the ncookies value is an adequate
response.

MFC after:	2 weeks
2018-02-06 22:38:19 +00:00
mckusick
a7ac77d8a7 Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
  check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
  never mark it dirty, thus do not write it out, and hence never
  update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
  check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
  the old check hash is still in the VM cache, but some other pages
  are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
  from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
  error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

  - bufdone()
  -- b_ckhashcalc (calculates check hash)
  -- bufdone_finish()
  --- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
  in memory (but not all pages, or we wouldn't be doing the read),
  the I/O is done with bogus_page mapped in for the pages that exist
  in the VM cache. This mapping is done to avoid corrupting the
  cached pages if there is any I/O overrun. The vfs_vmio_iodone()
  function is responsible for replacing the bogus_page(s) with the
  cached ones. But we were calculating the check hash before the
  bogus_page(s) were replaced. Hence, when we were calculating the
  check hash, we were partly reading from bogus_page, which means
  we calculated a bad check hash (e.g., because multiple pages have
  been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
mckusick
174ee5526c When reading a cylinder group, break out reporting of check hash errors
from other types of errors so that the error is correctly reported.
2018-01-31 23:13:37 +00:00
pfg
434dd3e7fe Revert r328479:
{ext2|ufs}_readdir: Set limit on valid ncookies values.

We aren't allowed to set resid like this.

Pointed out by:	kib, imp
2018-01-27 16:34:00 +00:00
pfg
55c3e327b4 {ext2|ufs}_readdir: Set limit on valid ncookies values.
Sanitize the values that will be assigned to ncookies so that we ensure
they are sane and we can handle them.

Let ncookies signed as it was before r328346. The valid range is such
that unsigned values are not required and we are not able to avoid at
least one cast anyways.

Hinted by:	bde
2018-01-27 15:33:52 +00:00
mckusick
f5e73a2c14 Refactoring of reading and writing of the UFS/FFS superblock.
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.

No functional change intended.

Reviewed by: kib
2018-01-26 00:58:32 +00:00
pfg
944d693f04 ext2fs|ufs:Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	2 weeks
2018-01-24 17:58:48 +00:00
pfg
ca690ecdf9 Revert r327781, r328093, r328056:
ufs|ext2fs: Revert uses of mallocarray(9).

These aren't really useful: drop them.
Variable unsigning will be brought again later.
2018-01-24 16:44:57 +00:00
pfg
036ebddf97 ufs: use mallocarray(9).
Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.

Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by:	mckusick
2018-01-17 18:18:33 +00:00
kib
75be4853cd Softlink inodes can own buffers with dependencies.
At least, softlinks longer than 120 bytes have data fragments.

Submitted by:	mckusick
MFC after:	5 days
2018-01-11 13:37:45 +00:00
pfg
d269f13cdd Use mallocarray(9) in dirhash.
Basic use of mallocarray to prevent overflows. Here allocation is done
with M_NOWAIT so the code is prepared for the possibility of returning
NULL values. Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by:	mckusick
2018-01-10 19:45:38 +00:00
kib
681df22eda Generalize the fix from r322757 and apply it to several more places.
The code accesses bp->b_dep without owning the ufs mount softdep lock,
which makes it possible for the derefenced workitem to be freed in
parallel.  In particular, the deallocate_dependencies(),
softdep_disk_io_initiation() and softdep_disk_write_complete() are
affected.

Move the code to safely calculate ump from the buffer with
dependencies into the helper softdep_bp_to_mp() and use it for all
found cases.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	mckusick (as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:51:44 +00:00
kib
3d0a89393a When handling write completion, take SU lock around calls to
handle_written_XXX() in case of processing the buffer with an error.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	mckusick (as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:44:17 +00:00
kib
e27bd7cfd5 Postpone the disassotiation of the background write buffer with devvp
so that buf_complete() sees fully constructed buffer.

This is a NOP right now, but will be needed by the forthcoming SU change.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:33:11 +00:00
markj
6d12293ce2 Add missing newlines to a couple of error messages.
Keep error messages on a single line so that they're easier to grep for.

Reported by:	pho
MFC after:	1 week
2018-01-03 18:19:47 +00:00
pfg
3299e14e14 SPDX: Complete License ID tags for UFS. 2017-12-27 19:13:50 +00:00
eadler
421a929b1e kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
kan
c8da6fae2c Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
kan
1c662cab33 Remove dead initialization of the inode pointer.
The pointer gets initialized again later in the code. This also
improves code style(9).
2017-12-23 16:24:02 +00:00
jhb
e09154bf75 Rework pathconf handling for FIFOs.
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.

To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method.  For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.

While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value.  In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.

Discussed with:	bde
Reviewed by:	kib (part of a larger patch)
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D12572
2017-12-19 22:39:05 +00:00
jhb
3efec8ad25 Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf().
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations.  Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
  permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs.  Now fail with EINVAL to
  indicate hard links aren't supported.

Requested by:	bde (though perhaps not this exact implementation)
Reviewed by:	kib (earlier version)
MFC after:	1 month
Sponsored by:	Chelsio Communications
2017-12-19 19:51:36 +00:00
markj
b0e68228fd Provide a sysctl to force synchronous initialization of inode blocks.
FFS performs asynchronous inode initialization, using a barrier write
to ensure that the inode block is written before the corresponding
cylinder group header update. Some GEOMs do not appear to handle
BIO_ORDERED correctly, meaning that the barrier write may not work as
intended. The sysctl allows one to work around this problem at the
cost of expensive file creation on new filesystems. The default
behaviour is unchanged.

Reviewed by:	kib, mckusick
MFC after:	1 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13428
2017-12-09 15:44:30 +00:00
kib
153d3ae3ae Fix livelock in ufsdirhash_create().
When more than one thread enters ufsdirhash_create() for the same
directory and the inode dirhash is instantiated, but the dirhash' hash
is not, all of them lock the dirhash shared and then try to upgrade.
Since there are several threads owning the lock shared, upgrade fails
and the same attempt is repeated, ad infinitum.

To break the lockstep, lock the dirhash in exclusive mode after the
failed try-upgrade.

Reported and tested by:	pho
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-12-07 09:05:34 +00:00
pfg
78a6b08618 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
kib
d4fa3017bf Improve the message printed when the cylinder group checksum is wrong.
Mention the device path and mount point path, handle snapshots.

Tested by:	imp
Sponsored by:	The FreeBSD Foundation
2017-11-05 13:28:48 +00:00
markj
6e5d1bfd1d Remove a stale and incorrect comment.
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-10-28 02:51:27 +00:00
markj
6795d68368 Remove workqueue items after updating the workqueue tail pointer.
When QUEUE_MACRO_DEBUG_TRASH is configured, the queue linkage fields
are trashed upon removal of the item, so be sure to only read them before
removing the item.

No functional change intended.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-10-28 02:48:37 +00:00
markj
668ea833d1 Make drain_output() use bufobj_wwait().
No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12790
2017-10-25 17:20:18 +00:00
jhb
c991591342 Don't defer wakeup()s for completed journal workitems.
Normally wakeups() are performed for completed softupdates work items
in workitem_free() before the underlying memory is free()'d.
complete_jseg() was clearing the "wakeup needed" flag in work items to
defer the wakeup until the end of each loop iteration.  However, this
resulted in the item being free'd before it's address was used with
wakeup().  As a result, another part of the kernel could allocate this
memory from malloc() and use it as a wait channel for a different
"event" with a different lock.  This triggered an assertion failure
when the lock passed to sleepq_add() did not match the existing lock
associated with the sleep queue.  Fix this by removing the code to
defer the wakeup in complete_jseg() allowing the wakeup to occur
slightly earlier in workitem_free() before free() is called.

The main reason I can think of for deferring a wakeup() would be to
avoid waking up a waiter while holding a lock that the waiter would
need.  However, no locks are dropped in between the wakeup() in
workitem_free() and the end of the loop in complete_jseg() as far as I
can tell.

In general I think it is not safe to do a wakeup() after free() as one
cannot control how other parts of the kernel that might reuse the
address for a different wait channel will handle spurious wakeups.

Reported by:	pho
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12494
2017-09-26 23:24:15 +00:00
kib
7abf2e5170 Fix 32bit build.
Reported by:	emaste
Sponsored by:	The FreeBSD Foundation
2017-09-22 16:42:41 +00:00
mckusick
4c3c44cdd8 Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
    Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
    by individual filesystems for their own purpose. Their specific
    definitions are found in the header files for each filesystem
    that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
    It is only necessary to compute a check hash for a cylinder
    group when it is actually read from disk. When calling bread,
    you do not know whether the buffer was found in the cache or
    read. So a new flag (GB_CKHASH) and a pointer to a function to
    perform the hash has been added to breadn_flags to say that the
    function should be called to calculate a hash if the data has
    been read. The check hash is placed in b_ckhash and the B_CKHASH
    flag is set to indicate that a read was done and a check hash
    calculated. Though a rather elaborate mechanism, it should
    also work for check hashing other metadata in the future. A
    kernel internal API change was to change breada into a static
    fucntion and add flags and a function pointer to a check-hash
    function.

sys/ufs/ffs/fs.h:
    Add flags for types of check hashes; stored in a new word in the
    superblock. Define corresponding BX_ flags for the different types
    of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
    In ffs_getcg do the dance with breadn_flags to get a check hash and
    if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
    Copy across the BX_FFSTYPES flags in background writes.
    Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
    Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
    Include libkern/crc32.c in libufs and use it to compute check
    hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
    Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
    Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
    Offer to add check hashes to existing filesystems.
    Precompute check hashes when rebuilding cylinder group
    (although this will be done when it is written in fsutil.c
    it is necessary to do it early before comparing with the old
    cylinder group)

sbin/dumpfs/dumpfs.c
    Print out the new check hash flag(s)

sbin/fsdb/Makefile:
    Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)
2017-09-22 12:45:15 +00:00
jhb
6ae33bb87c Add UFS_LINK_MAX for the UFS-specific limit on link counts.
ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits.  This is a nop currently but will matter if LINK_MAX is
increased in the future.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2017-09-18 23:30:39 +00:00
mckusick
85aaecff93 The new fsck recovery information to enable it to find backup
superblocks created in revision 322297 only works on disks
with sector sizes up to 4K. This update allows the recovery
information to be created by newfs and used by fsck on disks
with sector sizes up to 64K. Note that FFS currently limits
filesystem to be mounted from disks with up to 8K sectors.
Expanding this limitation will be the subject of another
commit.

Reported by: Peter Holm
Reviewed with: kib
2017-09-04 20:19:36 +00:00
kib
0cda26ac73 Protect v_rdev dereference with the vnode interlock instead of the
vnode lock.

Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	10 days
2017-08-25 09:51:22 +00:00
kib
3149ed68c4 Avoid dereferencing potentially freed workitem in
softdep_count_dependencies().

Buffer's b_dep list is protected by the SU mount lock.  Owning the
buffer lock is not enough to guarantee the stability of the list.

Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime.  To get to ump, use the pointers chain which does not
involve workitems at all.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-08-21 16:23:44 +00:00
kib
77de7ac78a Style.
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-21 16:16:02 +00:00
mckusick
7898ca2150 Since the switch to GPT disk labels, fsck for UFS/FFS has been
unable to automatically find alternate superblocks. This checkin
places the information needed to find alternate superblocks to the
end of the area reserved for the boot block.

Filesystems created with a newfs of this vintage or later will
create the recovery information. If you have a filesystem created
prior to this change and wish to have a recovery block created for
your filesystem, you can do so by running fsck in forground mode
(i.e., do not use the -p or -y options). As it starts, fsck will
ask ``SAVE DATA TO FIND ALTERNATE SUPERBLOCKS'' to which you should
answer yes.

Discussed with: kib, imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11589
2017-08-09 05:17:21 +00:00
mckusick
940856bc57 Avoid reading a snapshot block when it is already in the cache.
Update the use of the B_CACHE flag (since the May 1999 commit
that made it the correct test here).

Reported by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
2017-07-31 20:41:45 +00:00
kib
24abfe4267 Improve publication of the newly allocated snapdata.
For freshly allocated snapdata, Lock sn_lock in advance, so
si_snapdata readers see the locked snapdata and not race.

For existing snapdata, if the thread was put to sleep waiting for
sn_lock, re-read si_snapdata.  This either closes the race or makes
the reliance on LK_DRAIN less important.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-07-21 18:42:35 +00:00
kib
4047279466 Unlock correct lock in ffs_snapblkfree().
It is possible for ffs_snapblkfree() to race and lock snaplock while
the devvp snapdata is instantiated, but no snapshots exist.  In this
case the loop over snapshots in ffs_snapblkfree() is not executed, and
the local variable vp is left initialized to NULL.

Unlock using &sn->sn_lock and not vp->v_vnlock.  For the inodes on the
snapshot list, the locks are same.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-07-21 18:36:17 +00:00
kib
9a191031a0 Account for lock recursion when transfering snaplock to the vnode lock
in ffs_snapremove().

Apparently ffs_snapremove() may be called with the snap lock recursed,
at least one trace demonstrated this when snapshot vnode was unlinked
while synced.  It was inactivated from the syncer thread.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-07-21 18:28:27 +00:00
kib
cce23ef417 Remove write-only variable.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2017-07-16 07:12:04 +00:00
kib
3daaee64c3 A followup to r320453, correct removal of the blocks from UFS snapshots.
Tested by:	pho
PR:    220693
Sponsored by:	The FreeBSD Foundation
2017-07-16 07:11:29 +00:00
jhb
b4c54ab330 Consistently use vop_stdpathconf() for default pathconf values.
Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values.  vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits.  Filesystems can still
explicitly override individual settings.

PR:		219851
Reported by:	cem
Reviewed by:	cem, kib, ngie
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D11541
2017-07-11 21:55:20 +00:00
mckusick
6d2611016c Create a new function ffs_getcg() to read in and verify a cylinder
group. Change all code points that open-coded this functionality
to use the new function. This commit is a refactoring with no
change in functionality.

In the future this change allows more robust checking of cylinder
group reads along the lines discussed in the hardening UFS session
at BSDCan (retry I/O, add checksums, etc). For more detail see the
session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS

Reviewed by: kib
2017-06-28 17:32:09 +00:00
kib
d8402b7b53 Mitigate several problems with the softdep_request_cleanup() on busy
host.

Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup.  It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().

There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel.  Due
  to the locking, they quickly start executing 'in phase' with the speed
  of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
  non-blocking attempt to lock the vnode owned by other thread fail,
  and loops executing without making the progress.
Retry logic does not allow the situation to recover.  The result is
a livelock.

Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
  A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
  even if there are still work items on the mp work list.  An
  assumption is that the items will be cleaned when other thread
  either fsyncs its vnode, or unlock and allow yet another thread to
  make the progress.

It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry.  But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.

Reported and tested by:	pho
Style and comments by:	mckusick
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-06-03 16:18:50 +00:00
kib
98a2e55649 Clean possible td_su reference on the struct mount being unmounted as
the last step of ffs_unmount().

It is possible that the mount point is recorded for cleanup in AST
context while softdep flush is executed during unmount.  The workitems
are flushed by other means for the unmount, but the stray reference to
struct mount blocks destruction of mount.  Check for the situation and
manually call vfs_rel() before returning from ffs_unmount().

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-03 14:15:14 +00:00
kib
a0ed3ee5d2 Remove spl() calls from UFS code.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-05-07 14:59:45 +00:00
emaste
1d5c0f311c UFS fs.h: clear warning from use in makefs(1)
makefs(1) has a number of signedness warnings (when built with higher
WARNS), most of which can be addressed by careful application of casts
in makefs itself.

There is one case where a signedness warning arises from the blksize
macro, so must be addressed in the macro itself.

Reviewed by:	kib, mckusick
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10589
2017-05-05 15:26:55 +00:00
glebius
5763443023 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
cem
1860c3b5c0 ufs: Export UFS_MAXNAMLEN to pathconf, statfs
Rather than the global NAME_MAX constant.  This change is required to
support systems with a NAME_MAX/MAXNAMLEN that differs from UFS_MAXNAMLEN.

This was missed in r313475 due to the alternative spelling ("NAME_MAX") of
MAXNAMLEN.  This change is also similar in spirit to r313780.

Reported by:	ngie@
Sponsored by:	Dell EMC Isilon
2017-04-05 01:44:03 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
kib
9973ae075e Do not leak mount references for dying threads.
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit.  In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1().  softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-25 10:38:18 +00:00
emaste
8e79b56e85 prefix UFS symbols with UFS_ to reduce namespace pollution
Specifically:
  ROOTINO -> UFS_ROOTINO
  WINO -> UFS_WINO
  NXADDR -> UFS_NXADDR
  NDADDR -> UFS_NDADDR
  NIADDR -> UFS_NIADDR
  MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by:	kib, mckusick
Obtained from:	NetBSD
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9536
2017-02-15 19:50:26 +00:00
cem
d21c7f090e ufs: Use UFS_MAXNAMLEN constant
(like NFS, EXT2FS, SVR4, IBCS2) instead of redefining the MAXNAMLEN
constant.

No functional change.

Reviewed by:	kib@, markj@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9500
2017-02-09 17:47:01 +00:00
cem
c3bd748d53 ffs_vnops: Simplify extattr access
As suggested in r167010, use the structure type and macros to access and
modify UFS2 extended attributes.  Add assertions that pointers are
aligned in places where we now access the data through a structure
pointer, instead of character-by-character.

PR:		216127
Reported by:	dewayne at heuristicsystems.com.au
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9225
2017-01-19 16:46:05 +00:00
cem
63708c4468 restore(8): Handle extended attribute names correctly
UFS2 extended attribute names are not NUL-terminated.  Handle
appropriately.

Correct the EXTATTR_BASE_LENGTH() macro, which handled ea_namelength ==
one (mod eight) extended attributes incorrectly.

PR:		216127
Reported by:	dewayne at heuristicsystems.com.au
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9208
2017-01-18 18:16:57 +00:00
cem
4d542f9fe8 ufs/extattr.h: Fix documentation of ea_name termination
The ea_name string is not nul-terminated.  Correct the documentation.

Because the subsequent field is padded to 8 bytes, and the padding is
zeroed, the ea_name string will appear to be nul-terminated whenever the
length isn't exactly one (mod eight).

This was introduced in r167010 (2007).

Additionally, mark the length fields as unsigned.  This particularly
matters for the single byte ea_namelength field, which can represent
extended attribute names up to 255 bytes long.

No functional change.

PR:		216127
Reported by:	dewayne at heuristicsystems.com.au
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9206
2017-01-18 17:55:49 +00:00
kib
7c67dd5f60 Use type-independent formats for printing nlink_t and ino_t.
Extracted from:	ino64 work by gleb, mckusick
Discussed with:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-06 16:59:33 +00:00
markj
4159d33f6b Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D8589
2016-11-23 17:53:07 +00:00
kib
aef0061353 Provide simple mutual exclusion between mount point update and unmount.
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call.  This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.

Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected.  Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.

In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-11-13 21:49:51 +00:00
emaste
242aa658d8 ANSIfy ffs_subr.c
Also renumber license clause to avoid skipping #3
2016-10-31 20:43:43 +00:00
mckusick
7e51aa668a Avoid possible overflow when calclating malloc size for auxillary
data structure sizes when mounting and reloading UFS/FFS
filesystems by using a u_long rather than an int for the size.

Reported by: Mariusz Zaborski <oshogbo@>
MFC after:   1 week
2016-10-28 20:15:19 +00:00
kib
1005ab8477 Generalize UFS buffer pager to allow it serving other filesystems
which also use buffer cache.

Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-28 11:43:59 +00:00
mckusick
ae1163bd55 The UFS/FFS filesystem checks directory link counts when doing
directory create and delete operations. If it ever finds a directory
with a link count less than 2, it panics. Thus, an rm -rf that
encounters a directory with a link count below 2 causes a kernel
panic. The proposed fix is to return the error EINVAL rather than
panicing. The effect is that the requested operation is not done,
but the system continues to run. At a more convenient later time,
the filesystem can be unmounted and cleaned (with fsck or journal
run). Once cleaned, the operation can be rerun to successful
completion.

This fix takes that approach. The panic message has been converted
into a uprintf(9) to provide the user with the inode number and
filesystem mount point of the offending directory and EINVAL is
returned for the operation.

The long (three year) delay in fixing this problem occurred because
the bug was misclassified when originally assigned and only this week
was found during a sweep of old unresolved bug reports.

PR:          180894
Reviewed by: kib
MFC after:   2 weeks
2016-10-26 20:28:23 +00:00
marcel
03c2e26593 Include <sys/types.h> explicitly instead of depending on that
header being included by <sys/param.h>. When compiled as part
of makefs(8) and on macOS or Linux, <sys/param.h> is not our
own.
2016-10-24 18:12:57 +00:00
kib
23dcf48f60 Add FFS pager, which uses buffer cache read operation to validate pages.
See the comments for more detailed description of the algorithm.

The pager is used unconditionally when the block size of the
underlying device is larger than the machine page size, since local
vnode pager cannot handle the configuration [1].  Otherwise, the
vfs.ffs.use_buf_pager sysctl allows to switch to the local pager.

Measurements demonstrated no regression in the ever-important
buildworld benchmark, and small (~5%) throughput improvements in the
special microbenchmark configuration for dbench over swap-backed
md(4).

Code can be generalized and reused for other filesystems which use
buffer cache.

Reported by:	Anton Yuzhaninov <citrin@citrin.ru> [1]
Tested by:	pho
Benchmarked by:	mjg, pho
Reviewed by:	alc, markj, mckusick (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8198
2016-10-19 11:09:29 +00:00
mjg
6a50fe29a5 vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
kib
800b88629e Be more strict when selecting between snapshot/regular mount.
Reclaimed vnode type is VBAD, so succesful comparision like
devvp->v_type != VREG does not imply that the devvp references
snapshot, it might be due to a reclaimed vnode.  Explicitely check the
vnode type.

In the the most important case of ffs_blkfree(), the devfs vnode is
locked and its type is stable.  In other cases, if the vnode is
reclaimed right after the check, hopefully the buffer methods return
right error codes.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-19 15:58:33 +00:00
kib
ce2b0686e5 Fix libprocstat build after r305902.
- Use _Bool to not require userspace to include stdbool.h.
- Make extattr.h usable without vnode_if.h.
- Follow i_ump to get cdev pointer.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 18:14:31 +00:00
kib
20f1e8ac63 Reduce size of ufs inode.
Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding.  To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2.  Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by:	pho (previous version)
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 16:47:34 +00:00
bde
4000d6ac7e Sprinkle DOINGASYNC() checks so as to do delayed writes for async
mounts in almost all cases instead of in most cases.  Don't override
DOINGASYNC() by any condition except IO_SYNC.

Fix previous sprinking of DOINGASYNC() checks.  Don't override IO_SYNC
by DOINGASYNC().  In ffs_write() and ffs_extwrite(), there were
intentional overrides that just broke O_SYNC of data.  In
ffs_truncate(), there are 5 calls to ffs_update(), 4 with
apparently-unintentional overrides and 1 without; this had no effect
due to the main async mount hack descibed below.

Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden
for the soft updates case too (to do a delayed write instead of a sync
write).  This is supposed to be the only change that affects anything
except async mounts.

In ffs_update(), remove the 19 year old efficiency hack of ignoring
the waitfor flag for async mounts, so that fsync() almost works for
async mounts.  All callers are supposed to be fixed to not ask for a
sync update unless they are for fsync() or [I]O_SYNC operations.
fsync() now almost works for async mounts.  It used to sync the data
but not the most important metdata (the inode).  It still doesn't sync
associated directories.

This gave 10-20% fewer writes for my makeworld benchmark with async
mounted tmp and obj directories from an already small number.

Style fixes:
- in ffs_balloc.c, remove rotted quadruplicated comments about the
  simplest part of the DOING*() decisions and rearrange the nearly-
  quadruplicated code to be more nearly so.
- in ufs_vnops.c, use a consistent style with less negative logic and
  no manual "optimization" of || to | in DOING*() expressions.

Reviewed by:	kib (previous version)
2016-09-08 17:40:40 +00:00
kib
8d2c7fdc8d On rename, do not perform truncation of dirhash if the vnode
truncation failed.

Doing so resulted in inconsistent state of the ufs dirhash with regard
to the actual directory inode state, and could lead to spurious ENOENT
errors for lookups of existing files in production kernels, or
assertion failures in the debugging kernels.

Change the logic of calling ufsdirhash_dirtrunc() to be same as in
ufs_direnter().  Execute UFS_TRUNCATE() first, log error, and only do
dirtrunc() if UFS_TRUNCATE() succeeded.

Note that the problem was exacerbated by the bug in the
flush_newblk_dep() function (see r305599), which caused in the spurios
errors from ffs_sync() and then ffs_truncate().

In collaboration with:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:09:34 +00:00
kib
03f5910da5 Do not leak transient ENOLCK error from flush_newblk_dep() loop.
The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error
from the failed attempt is irrelevant.  But since there is path after
retry which does not clear error, it is possible to return spurious
error from the function.

The issue resulted in a spurious failure of softdep_sync_buf(),
causing further spurious failure of ffs_sync().

In collaboration with:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:08:54 +00:00
kib
835bb1130c When logging unlikely UFS_TRUNCATE() failure in ufs_direnter(),
include error code.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:08:08 +00:00
kib
666d76bef3 When externding directory inode in ufs_direnter(), adjust i_endoff.
This change is formally not needed, since i_endoff not used in all
code paths after the call to ufs_direnter(), and i_endoff is
recalculated by the next lookup.  But having the value correct makes
the reasoning about code simpler.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:07:25 +00:00
kib
dc190244af In dqsync(), when called from quotactl(), um_quotas entry might appear
cleared since nothing prevents completion of the parallel quotaoff.
There is nothing to sync in this case, and no reason to panic.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:06:43 +00:00
kib
c8c9eee1f7 In softdep_prealloc(), return early not only for snapshots, but for
the quota files as well.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:05:13 +00:00
kib
a483a2c0bd There is no need to upgrade the last dvp lock on lookups for modifying
operations.  Instead of upgrading, assert that the lock is exclusive.
Explain the cause in comments.

This effectively reverts r209367.

Tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:04:45 +00:00
kib
ded52104dc Partially lift suspension when ffs_reload() finished with cgs and
going to re-read inodes.

Secondary write initiators, e.g. ufs_inactive(), might need to start a
write while owning the vnode lock.  Since the suspended state
established by /dev/ufssuspend prevents them from entering
vn_start_secondary_write(), we get deadlock otherwise.

Note that it is arguably not very useful to re-read inodes after
/dev/ufssuspend suspension, because the suspension does not block
readers, and other threads might read existing files in parallel with
suspension owner (for now, only growfs(8)) operations.  This
effectively means that suspension owner cannot safely modify existing
inodes, and then there is no sense in re-reading.  But keep the code
enabled for now.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:01:28 +00:00
imp
a88b45a2a8 Renumber the advertising clause. 2016-09-06 15:17:35 +00:00
mckusick
d1132818f9 Bug 211013 reports that a write error to a UFS filesystem running
with softupdates panics the kernel. The problem that has been pointed
out is that when there is a transient write error on certain metadata
blocks, specifically directory blocks (PAGEDEP), inode blocks
(INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group
(BMSAFEMAP, but only when journaling is enabled), we get a panic
in one of the routines called by softdep_disk_io_initiation that
the I/O is "already started" when we retry the write.

These dependency types potentially need to do roll-backs when called
by softdep_disk_io_initiation before doing a write and then a
roll-forward when called by softdep_disk_write_complete after the
I/O completes.  The panic happens when there is a transient error.
At the top of softdep_disk_write_complete we check to see if the
write had an error and if an error occurred we just return.  This
return is correct most of the time because the main role of the routines
called by softdep_disk_write_complete is to process the now-completed
dependencies so that the next I/O steps can happen.

But for the four types listed above, they do not get to do their
rollback operations. This causes the panic when softdep_disk_io_initiation
gets called on the second attempt to do the write and the roll-back
routines find that the roll-backs have already been done. As an
aside I note that there is also the problem that the buffer will
have been unlocked and thus made visible to the filesystem and to
user applications with the roll-backs in place.

The way to resolve the problem is to add a flag to the routines called
by softdep_disk_write_complete for the four dependency types noted
that indicates whether the write was successful (WRITESUCCEEDED).
If the write does not succeed, they do just the roll-backs and then
return. If the write was successful they also do their usual
processing of the now-completed dependencies.

The fix was tested by selectively injecting write errors for buffers
holding dependencies of each of the four types noted above and then
verifying that the kernel no longer paniced and that following the
successful retry of the write that the filesystem could be unmounted
and successfully checked cleanly.

PR: 211013
Reviewed by: kib
2016-08-16 21:02:30 +00:00
kib
73de606a86 In UFS_BALLOC(), invalidate pages of indirect buffers on failed block
allocation unwinding.

Dandling buffers are released on UFS_BALLOC() failure to ensure that
later attempt to allocate blocks in close range do not find the blocks
with invalid content, since possible partial block allocations are
unwound.  As such, it is not enough to just release the buffers, the
pages must also invalidated and removed from the vnode vm_object
queue.  Otherwise the pages might be found later and used to
reconstruct indirect buffers when doing allocations at offset close to
the failure point, and their stale content compromise the filesystem
integrity.

Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE
is required.  To be sure, clear the B_CACHE flag as well.  This
complements the r174973, which started releasing buffers.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:30:58 +00:00
kib
fd0eaea7aa On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert
that recorded allocated blocks numbers match the physical block
numbers of dandling buffers which are released.

When finally freeing the blocks during unwind, assert that dandling
buffers where not re-allocated.  They shouldn't, because the vnode lock
is owned exclusive.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:18:38 +00:00
kib
e037bf91c2 When looking up dandling buffers for unwing after failing block
allocation in UFS_BALLOC(), there is no need to map them.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:05:15 +00:00
kib
4e01efcc0c When block allocation fails in UFS_BALLOC(), and the volume does not
have SU enabled, there is no point in calling softdep_request_cleanup().

The call cannot produce free blocks, but we unecessarily lock ufsmount
and do inode block write.  Usual point of not doing optimizations for
the corner case of the full volume is not applicable there, the work
is easily avoidable, and the addition CPU and disk io load do not lead
to succeeding retry.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:50:48 +00:00
kib
03c6968a37 In ffs_balloc_ufs{1,2} routines, assert that unwind records do not
overflow local arrays.  This is not immediately obvious from the
static code inspection, due to retry logic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:49:56 +00:00
kib
20b89ec291 Implement VOP_FDATASYNC() for UFS.
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:22:23 +00:00
trasz
255ed885fa Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
kib
9009fc003e Ensure that the UFS directory vnode' vm_object is properly sized
before UFS_BALLOC() is called.  I do not believe that this caused any
real issue on FreeBSD because the exclusive vnode lock is held over
the balloc/resize, the change is to make formally correct KPI use.

Based on:	the Matthew Dillon' patch from DragonFly BSD
PR:	93942
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-20 14:40:56 +00:00
kevlo
1006f009c6 arc4random() returns 0 to (2**32)−1, use an alternative to initialize
i_gen if it's zero rather than a divide by 2.

With inputs from  delphij, mckusick, rmacklem

Reviewed by:	mckusick
2016-05-22 14:31:20 +00:00
kib
e006a8a1fd Stop dropping and reacquiring Giant around geom calls in UFS.
Sponsored by:	The FreeBSD Foundation
2016-05-21 10:13:25 +00:00
kib
93315c1cff Improve handling of rdev->si_mountpt on mount and unmount of FFS
volumes.  Treat the field as a semaphore protecting availability of
the device for mounting.  Do no access devvp->v_rdev without the vnode
lock owned.

Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-21 09:49:35 +00:00
kib
54336eef75 Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
kib
393eee329b Do enable io accounting for read-only mounts and mounts which are
remounted to writeable after initial read-only.  Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.

Based on submission by:	bde
MFC after:	1 week
2016-05-17 21:35:35 +00:00
kib
6c68797acf If IO_SYNC was passed to ffs_truncate(), request synchronous inode
update from the final ffs_update().

Noted by:	bde
MFC after:	1 week
2016-05-17 21:30:58 +00:00
kib
b27a7ca7b1 For async UFS mounts, shrink the directory asynchronously, at least do
not pass IO_SYNC to ffs_truncate() unneccessary.

Submitted by:	bde
MFC after:	1 week
2016-05-17 21:28:28 +00:00
kib
e1e51d237c Fix comments.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:24:27 +00:00
kib
67809d96fd Fix typo in the message.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:19:20 +00:00
pfg
6ac8bdb320 UFS: spelling fixes on comments.
No functional change.
2016-04-29 20:43:51 +00:00
ngie
35bd8367fe Add FEATURE knob for testing for UFS extended attribute kernel support
Support can be verified via `feature_present("ufs_extattr")`, etc.

Differential Revision: https://reviews.freebsd.org/D6053
MFC after: 2 weeks
Relnotes: yes
Reviewed by: asomers, kib
Sponsored by: EMC / Isilon Storage Division
2016-04-22 08:09:27 +00:00
pfg
729533413f sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
pfg
858f336801 ufs: replace 0 with NULL for pointers.
While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by:	mckusick
MFC after:	2 weeks
2016-04-10 21:48:11 +00:00
trasz
825d80e01c Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
trasz
ca92bb3067 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
kib
dae532b324 Split the global taskqueue used to process all UFS trim completions,
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.

But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes.  Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.

Reported by:	Nick Evans <nevans@talkpoint.com>
Tested by:	Nick Evans <nevans@talkpoint.com>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:21:17 +00:00
kib
6fb3ce9534 Style: wrap long lines.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:07:12 +00:00
kib
e8766307c3 Fix locking mistake in softdep_waitidle(). The surrounding code
expects that the loop is always exited with the SU lock owned, even on
error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-03-23 09:58:51 +00:00
mckusick
6a020d6c89 The UFS filesystem requires that the last block of a file always be
allocated. When shortening the length of a file in which the new end
of the file contains a hole, the hole must have a block allocated.

Reported by: Maxim Sobolev
Reviewed by: kib
Tested by:   Peter Holm
2016-02-24 01:58:40 +00:00
trasz
0fe7e27aea Remove ffs_mountroot() prototype; seems to be long gone.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-28 12:21:23 +00:00
mckusick
32cb8b928a This fixes a bug in UFS2 exported NFS volumes. An NFS client can
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.

Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib
2016-01-27 21:27:05 +00:00
mckusick
0b10a802f8 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
kib
70adc1e216 Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the
ast was rescheduled during VFS_SYNC().  It is possible that enough
parallel writes or slow/hung volume result in VFS_SYNC() deferring to
the ast flushing of workqueue.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-21 11:50:32 +00:00
kib
29c1a1655d Update ctime when atime or birthtime are updated.
Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or
IN_UPDATE, then clear them with ufs_itimes(), making transient
(possibly inconsistent) change to the times, and then copy
user-supplied times into the inode.  Instead, directly clear IN_ACCESS
or IN_UPDATE when user supplied the time, and copy the value into the
inode.

Minor inconsistency left is that the inode ctime is updated even when
birthtime update attempt is performed on a UFS1 volume.

Submitted by:	bde
MFC after:	2 weeks
2015-12-07 12:09:04 +00:00
mckusick
cb4ab786a1 For performance reasons, it is useful to have a single string used as
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:01:02 +00:00
kib
e768957c56 Do not perform read-ahead for BA_CLRBUF request when we are low on
memory or when dirty buffer queue is too large.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-10-27 13:44:13 +00:00
imp
fb9846240a Do not relocate extents to make them contiguous if the underlying drive can do
deletions. Ability to do deletions is a strong indication that this
optimization will not help performance. It will only generate extra write
traffic. These devices are typically flash based and have a limited number of
write cycles. In addition, making the file contiguous in LBA space doesn't
improve the access times from flash devices because they have no seek time.

Reviewed by: mckusick@
2015-10-16 03:06:02 +00:00
glebius
a2200eb70b In softdep_setup_freeblocks():
- Move the bread() to the beginning of function.
- Return if it fails, otherwise we will panic.

Submitted by:	mckusick
Sponsored by:	Netflix
2015-10-07 12:36:28 +00:00
kib
9a0916a6b3 Do not consume extra reference. This is a bug in r287479.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 12:28:18 +00:00
kib
4bfbbf8647 Declare the writes around the call to VFS_SYNC() in
softdep_ast_cleanup_proc().

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 08:48:24 +00:00
kib
d285ab0209 By doing file extension fast, it is possible to create excess supply
of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and
D_ALLOCINDIR), which can exhaust kmem.

Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and
D_DIRREM, by scheduling ast to flush dependencies, after the thread,
which created new dep, left the VFS/FFS innards.  For D_NEWBLK, the
only way to get rid of them is to do full sync, since items are
attached to data blocks of arbitrary vnodes.  The check for D_NEWBLK
excess in softdep_ast_cleanup_proc() is unlocked.

For 32bit arches, reduce the total amount of allowed dependencies by
two.  It could be considered increasing the limit for 64 bit platforms
with direct maps.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 13:07:27 +00:00
jeff
44267026a0 - Make 'struct buf *buf' private to vfs_bio.c. Having a global variable
'buf' is inconvenient and has lead me to some irritating to discover
   bugs over the years.  It also makes it more challenging to refactor
   the buf allocation system.
 - Move swbuf and declare it as an extern in vfs_bio.c.  This is still
   not perfect but better than it was before.
 - Eliminate the unused ffs function that relied on knowledge of the buf
   array.
 - Move the shutdown code that iterates over the buf array into vfs_bio.c.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-07-29 02:26:57 +00:00
jeff
3fb666cfae Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
mjg
c71e9ab863 Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
markj
d19ba3f89d Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
markm
d586165577 Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
kib
f07d3d4559 Simplify code, no need to test the flag before clearing it.
Submitted by:	ed
MFC after:	12 days
2015-06-29 13:06:24 +00:00
kib
9f65a2d8d9 Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer.  Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed.  Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock.  Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with:	Conrad Meyer <cse.cem@gmail.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation (kib),
	  EMC/Isilon storage division (Conrad)
MFC after:	2 weeks
2015-06-27 09:44:14 +00:00
kib
becc575eec vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active.  This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list.  Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation.  In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list.  When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-17 04:46:58 +00:00
mjg
1a3e7a935e Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
kib
5ad0e0e2f3 Syncing a directory vnode might drop the vnode lock in the
softdep_sync() similarly to the regular vnode sync.  Allow retry for
both vnode types.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-03 20:48:00 +00:00
kib
63f2992877 Remove unused variable.
When deallocate_dependencies() is performed,
softdep_journal_freeblocks() already called cancel_allocdirect() which
should have eliminated direct dependencies for all truncated full
blocks.  The indirect dependencies are allowed above, since second-
and third-level dependencies are only dealt with by the code which
frees indirect block, which happens after the inode write.

Discussed with:	mckusick, jeff
Reviewed by:	jeff
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-31 15:50:54 +00:00
kib
e2f56205b5 Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros.  It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review:	https://reviews.freebsd.org/D2665
Reviewed by:	rodrigc
Looked at by:	bjk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 13:24:17 +00:00
kib
b5e9f683a3 After r283600, NODELAY flag to inodedep_lookup() function is unused.
Eliminate it, and simplify code by removing the local dflags variable
always initialized to DEPALLOC.

Noted by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:49:04 +00:00
kib
ff588ae9b0 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
mckusick
d1dcdd191e Limit the number of cylinder groups that will be searched when
trying to build a cluster. The limit is tunable using the sysctl
vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups
per block allocation. It was previously limited to the number of
cylinder groups in the filesystem per block allocation. When there
were no clusters of the needed size left, it repeatedly searched
the whole filesystem for a non-existent cluster on every block
allocation. The result was very slow filesystem allocation with
100% CPU utilization. The old behavior can be had by setting
vfs.ffs.maxclustersearch to a huge number (1,000,000).

This change affects only the layout policy routines so is not able
to interfere with the integrity of the filesystem.

Reported by: Dmitry Sivachenko (demon@)
Tested by:   Dmitry Sivachenko (demon@)
MFC after:   2 weeks
2015-04-24 23:27:50 +00:00
rmacklem
ad77d0b1c1 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
kib
92897c6ace Fix build (with gcc).
Reported by:	bz, ian
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-27 15:49:21 +00:00
kib
871e29c970 Fix the hand after the immediate reboot when the following command
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init

Hang occurs on the rootfs unmount.  There are two issues:

1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0.  FFS is suspended during
unmount, so unmount just hangs.

2. As noted above, the inodedep is linked eventually.  It is not
linked until the superblock is written.  But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed.  It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock.  Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.

Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.

There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.

Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons.  Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.

PR:	195458
In collaboration with:	gjb, pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-27 13:55:56 +00:00
kib
37c9c38900 Partially revert r277922, avoid sleeping and do flush if we a awaken,
instead of waiting for the FLUSH_* flags.  Also, when requesting
flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was
already set.

Reported and tested by:	dim,
	"Lundberg, Johannes" <johannes@brilliantservice.co.jp>
Bisected by:	dim
MFC after:	2 weeks
2015-02-05 13:00:27 +00:00
kib
83723416a6 When mounting SU-enabled mount point, wait until the softdep_flush()
thread started and incremented the stat_flush_threads [1].

Unconditionally wakeup softdep_flush threads when needed, do not try
to check wchan, which is racy and breaks abstraction.

Reported by and discussed with:	glebius, neel
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-30 11:41:46 +00:00
kib
da0490b2e8 The sys_quotactl() contract demands that the mount point is
vfs_unbusy()ed when the cmd is Q_QUOTAON, regardless of other input
parameters or error return.

Submitted by:	Conrad Meyer
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:  https://reviews.freebsd.org/D1684
Tested by:	pho
MFC after:	1 week
2015-01-27 10:32:49 +00:00
kib
4e541c8756 Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some
fs, e.g. smbfs, already did it.

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-21 13:29:33 +00:00
kib
77c9d3f4e8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
gleb
5c99f46b3b Adjust printf format specifiers for dev_t and ino_t in kernel.
ino_t and dev_t are about to become uint64_t.

Reviewed by:	kib, mckusick
2014-12-17 07:27:19 +00:00
glebius
b4ef8e602d Merge from projects/sendfile:
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
  doesn't sleep. It returns immediately, and will execute the I/O done handler
  function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
  method for vnode_pager.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-23 12:01:52 +00:00
glebius
fb402568c4 buf.h is not needed here, and pollutes when ufsmount.h is included
from userland code.

Sponsored by:	Nginx, Inc.
2014-11-23 01:02:19 +00:00
glebius
69a603c777 Include required files directly instead of pollution via ufs/ufsmount.h.
Sponsored by:	Nginx, Inc.
2014-11-23 01:01:14 +00:00
davide
64ef011694 Use the correct variable name. 2014-11-22 00:42:30 +00:00
davide
12cc45e8da Make ufs_dirhashreclaimperc a percentage for real and
rename it to ufs_dirhashreclaimpercent, as suggested
by jhb@. As an added bonus this avoids divide-by-zero
errors.

Requested by:	jhb, markj
Reviewied by:	jhb, markj
2014-11-22 00:37:37 +00:00
kib
a6f1fd882b When non-forced unmount or remount rw->ro is performed, writes on UFS
are not suspended.  In particular, on the SU-enabled vulumes, there is
no reason why, between the call to softdep_flushfiles() and
softdep_waitidle(), SU work items cannot be queued.

Correct the condition to trigger the panic by only checking when
forced operation is done.  Convert direct panic() call into KASSERT(),
there is no invalid on-disk data structures directly involved, so
follow the usual debugging vs. non-debugging approach.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-02 13:14:55 +00:00
mjg
12e0034dd0 Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after:	2 weeks
2014-10-20 18:00:50 +00:00
mjg
cdeccc2a52 Use lockless quota checks in qsync and qsyncvp.
No strong objections from: kib, mckusick
MFC after:	1 week
2014-10-16 12:41:14 +00:00
kib
1b90da9ab8 Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS
survives remount in rw, also it is set for vnodes on rootfs before
noatime can be set or clock is adjusted.  All conditions result in
wrong atime for accessed vnodes.

Submitted by:	bde
MFC after:	1 week
2014-10-11 19:09:56 +00:00
imp
a2b4dd0675 Restore the backed-out change, using __offsetof instead. 2014-10-10 00:35:08 +00:00
bapt
05bd7a92d7 Backout r272825 every useland usage of ufs/ufs/dir.h are now broken with that change 2014-10-09 17:26:29 +00:00
bapt
c47600c4ec Use offsetof() from sys/types.h instead of a custom one
This fixes build with recent gcc versions
2014-10-09 15:26:22 +00:00
kib
ebd8a253bb Provide the unique implementation for the VOP_GETPAGES() method used
by ffs and ext2fs.  Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages().  Use vm_pager_free_nonreq().

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	6 weeks (after r271596)
2014-09-15 12:28:29 +00:00
alc
e169525bb8 We don't need an exclusive object lock on the expected execution path
through {ext2,ffs}_getpages().

Reviewed by:	kib, pfg
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-09-13 18:26:13 +00:00
kib
ece271f9c9 Direct access to the quota files, in particular, lookup, causes lock
conflict with the quota metadata access.  Mark quota vnode lock as
recursive and always exclusive to avoid the problem.

Reported by:	hrs
Tested by:	hrs, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-29 09:04:24 +00:00
davide
a8c92b02a8 Rather than using an hardcoded reclaim age, rely on an LRU-like approach
for dirhash cache, setting a target percent to reclaim (exposed via
SYSCTL). This allows to always make some amount of progress keeping the maximum
reclaim age dynamic.

Tested by:      pho
Reviewed by:	jhb
2014-08-25 17:06:18 +00:00
kib
c3baa63a48 Do not busy the UFS mount point inside VOP_RENAME(). The
kern_renameat() already starts write on the mp, which prevents
parallel unmount from proceed.  Busying mp after vn_start_write()
deadlocks the unmount.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-20 08:15:23 +00:00
kib
37cbc4993d Correct the test for condition to suspend UFS filesystem during
unmount.  There is no need to suspend read-only filesystem, while we
need suspension on modificable mount point.

Reported by:	rwatson
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-20 08:13:03 +00:00
kib
c721b1c6ff Revision r269457 removed the Giant around mount and unmount code, but
r269533, which was tested before r269457 was committed, implicitely
relied on the Giant to protect the manipulations of the softdepmounts
list.  Use softdep global lock consistently to guarantee the list
structure now.

Insert the new struct mount_softdeps into the softdepmounts only after
it is sufficiently initialized, to prevent softdep_speedup() from
accessing bare memory.  Similarly, remove struct mount_softdeps for
the unmounted filesystem from the tailq before destroying structure
rwlock.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-12 09:33:00 +00:00
mckusick
8565976fe4 The SUJ journal is only prepared to handle full-size block numbers, so we
have to adjust freeblk records to reflect the change to a full-size block.
For example, suppose we have a block made up of fragments 8-15 and
want to free its last two fragments. We are given a request that says:
    FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0
where frags are the number of fragments to free and oldfrags are the
number of fragments to keep. To block align it, we have to change it to
have a valid full-size blkno, so it becomes:
    FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6

Submitted by: Mikihito Takehara
Tested by:    Mikihito Takehara
Reviewed by:  Jeff Roberson
MFC after:    1 week
2014-08-07 16:53:07 +00:00
mckusick
52fea1f7b2 Add support for multi-threading of soft updates.
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.

Reviewed by:  kib
Tested by:    Peter Holm and Scott Long
MFC after:    2 weeks
Sponsored by: Netflix
2014-08-04 22:03:58 +00:00
imp
8570c379e9 Simplify comment to remove multiple negative and passive voice. 2014-07-23 16:18:54 +00:00
kib
32a7383c85 Check for the cross-device cross-link attempt in the VFS, instead of
forcing filesystem VOP_LINK() methods to repeat the code.  In
tmpfs_link(), remove redundand check for the type of the source,
already done by VFS.

Note that NFS server already performs this check before calling
VOP_LINK().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-16 14:04:46 +00:00
kib
a698488f0f Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt().  Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:10:00 +00:00
kib
f5cc3af6c8 In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once.  Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by:	bde
Reviewed by:	bde, trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-17 07:11:00 +00:00
kib
321a65896d Initialize the pbuf counter for directio using SYSINIT, instead of
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel.  Add ffs_rawread.c to the list of ufs.ko
module' sources.

In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.

One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option.  This is similar to
the option QUOTA and ufs_quota.c.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:55:06 +00:00
jmg
43af3dde11 don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags..
If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2,
you'll end up corrupting your FS as the superblock is updated and written
to a different location...

makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing
this issue...

Reviewed by:	silience from mckusick
MFC after:	1 week
2014-06-03 21:46:13 +00:00
scottl
66aaa46593 Due to reasons unknown at this time, the system can be forced to write
a journal block even when there are no journal entries to be written.
Until the root cause is found, handle this case by ensuring that a
valid journal segment is always written.

Second, the data buffer used for writing journal entries was never
being scrubbed of old data.  Fix this.

Submitted by:	Takehara Mikihito
Obtained from:	Netflix, Inc.
MFC after:	3 days
2014-05-06 20:40:16 +00:00
mckusick
de9a58cc91 Update comment to explain search order reverted to historical order
in -r254996.

Suggested by: Pedro Giffuni <pfg@FreeBSD.org>
MFC:          3 days
2014-03-22 11:26:39 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
jeff
9ee698e16a - If we fail to do a non-blocking acquire of a buf lock while doing a
waiting sync pass we need to do a blocking acquire and restart.
   Another thread, typically the buf daemon, may have this buf locked and
   if we don't wait we can fail to sync the file.  This lead to a great
   variety of softdep panics because we rely on all dependencies being
   flushed before proceeding in several cases.

Reported by:	pho
Discussed with:	mckusick
Sponsored by:	EMC / Isilon Storage Division
MFC after:	2 weeks
2014-03-06 00:13:21 +00:00
jeff
f1c05bd42b - Gracefully handle truncation failures when trying to shrink directories.
This could cause dirhash panics since the dirhash state would be
   successfully truncated while the directory was not.

Reported by:	pho
Discussed with:	mckusick
Sponsored by:	EMC / Isilon Storage Division
MFC after:	2 weeks
2014-03-06 00:10:07 +00:00
pfg
dc2809ceff ufs: small formatting fixes.
Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after:	3 days
Reviewed by:	mckusick
2014-03-02 02:52:34 +00:00
mckusick
88a7200f04 Fine tune filesystem block allocations under low free-space
conditions (-r254995) based on further operational experience.

Submitted by:  Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
MFC after: 2 weeks
2013-12-30 17:04:24 +00:00
mckusick
34936e35bb Properly handle unsigned comparison.
MFC after: 2 weeks
2013-12-30 06:19:42 +00:00
mckusick
ec81cfc872 We needlessly panic when trying to flush MKDIR_PARENT dependencies.
We had previously tried to flush all MKDIR_PARENT dependencies (and
all the NEWBLOCK pagedeps) by calling ffs_update(). However this will
only resolve these dependencies in direct blocks. So very large
directories with MKDIR_PARENT dependencies in indirect blocks had
not yet gotten flushed. As the directory is in the midst of doing a
complete sync, we simply defer the checking of the MKDIR_PARENT
dependencies until the indirect blocks have been sync'ed.

Reported by: Shawn Wallbridge of imaginaryforces.com
Tested by:   John-Mark Gurney <jmg@funkthat.com>
PR:          183424
MFC after:   2 weeks
2013-12-01 07:34:21 +00:00
jmg
7383d16d60 fix white space...
MFC after:	1 week
2013-11-20 21:21:29 +00:00
jmg
231f0a43ec fix a use after free, jsegdep_merge will free wk, avoid the next check...
CID:		1006098
Sponsored by:	Imaginary Forces
Reviewed by:	mckusick
MFC after:	1 week
2013-11-20 21:16:53 +00:00
pfg
9b0e32e06b UFS2: make di_extsize unsigned.
di_extsize is the EA size and as such it should be unsigned.
Adjust related types for consistency.

Reviewed by:	mckusick (previous version)
MFC after:	3 weeks
2013-10-24 00:33:29 +00:00
brooks
cbedeeb267 Allow kernels without options SOFTUPDATES to build. This should fix the
embedded tinderboxes.

Reviewed by:	emaste
2013-10-21 20:51:08 +00:00
mckusick
9f0cd2b759 Fix build problem on ARM (which defaults to building without soft updates).
Reported by:  Tinderbox
Sponsored by: Netflix
2013-10-21 13:09:09 +00:00
mckusick
62bcc54df0 Restructuring of the soft updates code to set it up so that the
single kernel-wide soft update lock can be replaced with a
per-filesystem soft-updates lock. This per-filesystem lock will
allow each filesystem to have its own soft-updates flushing thread
rather than being limited to a single soft-updates flushing thread
for the entire kernel.

Move soft update variables out of the ufsmount structure and into
their own mount_softdeps structure referenced by ufsmount field
um_softdep.  Eventually the per-filesystem lock will be in this
structure. For now there is simply a pointer to the kernel-wide
soft updates lock.

Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock
pointer in the mount_softdeps structure instead of a pointer to the
kernel-wide soft-updates lock.

Replace the five hash tables used by soft updates with per-filesystem
copies of these tables allocated in the mount_softdeps structure.

Several functions that flush dependencies when too many are allocated
in the kernel used to operate across all filesystems. They are now
parameterized to flush dependencies from a specified filesystem.
For now, we stick with the round-robin flushing strategy when the
kernel as a whole has too many dependencies allocated.

While there are many lines of changes, there should be no functional
change in the operation of soft updates.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-21 00:28:02 +00:00
mckusick
bd0c312c4a Fourth of several cleanups to soft dependency implementation.
Add KASSERTS that soft dependency functions only get called
for filesystems running with soft dependencies. Calling these
functions when soft updates are not compiled into the system
become panic's.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 22:21:01 +00:00
mckusick
54a566ac25 Third of several cleanups to soft dependency implementation.
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 21:11:40 +00:00
mckusick
25cf7a1f19 Second of several cleanups to soft dependency implementation.
Delete two unused functions in ffs_sofdep.c.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 20:52:07 +00:00
mckusick
297f76c244 First of several cleanups to soft dependency implementation.
Convert three functions exported from ffs_softdep.c to static
functions as they are not used outside of ffs_softdep.c.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 20:41:38 +00:00
pfg
7a70d69a08 Make di_blocks unsigned in UFS1 as is the case already for UFS2.
Most of the code between UFS1 and UFS2 is shared so this change
is pretty safe. Not only this makes UFS1 and 2 consistent but it
also matches what NetBSD and MacOS X have for some years now.

Reviewed by:	mckusick
MFC after:	1 month
2013-10-14 18:17:09 +00:00
pjd
029a6f5d92 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
mckusick
2cba188246 In looking at block layouts as part of fixing filesystem block
allocations under low free-space conditions (-r254995), determine
that old block-preference search order used before -r249782 worked
a bit better. This change reverts to that block-preference search order.

MFC after:	2 weeks
2013-08-28 17:46:32 +00:00
mckusick
2ecfc28415 A performance problem was reported in PR kern/181226:
I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
    full (10-20GB free), processes which write data to it start
    eating 100% CPU and write speed drops below 1MB/sec (normally
    to gives 400MB/sec). The revision at which it first became
    apparent was http://svnweb.freebsd.org/changeset/base/249782.

The offending change reserved an area in each cylinder group to
store metadata. The new algorithm attempts to save this area for
metadata and allows its use for non-metadata only after all the
data areas have been exhausted. The size of the reserved area
defaults to half of minfree, so the filesystem reports full before
the data area can completely fill. However, in this report, the
filesystem has had minfree reduced to 1% thus forcing the metadata
area to be used for data. As the filesystem approached full, it
had only metadata areas left to allocate. The result was that
every block allocation had to scan summary data for 30,000 cylinder
groups before falling back to searching up to 30,000 metadata areas.

The fix is to give up on saving the metadata areas once the free
space reserve drops below 2%. The effect of this change is to use
the old algorithm of just accepting the first available block that
we find. Since most filesystems use the default 5% minfree, this
will have no effect on their operation. For those that want to push
to the limit, they will get their crappy block placements quickly.

Submitted by:  Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
PR:            kern/181226
MFC after:     2 weeks
2013-08-28 17:38:05 +00:00
ivoras
ad95bec3c8 Take a very small step toward the Century of the Anchovy by increasing the
time dirhash entries stay in memory before being considered for eviction to
1 minute.
2013-08-28 10:06:20 +00:00