Commit Graph

2287 Commits

Author SHA1 Message Date
Kirk McKusick
6967c09c69 Expand DDB's set of printable soft dependency data structures. The
set of known soft dependency data structures now includes: sd_worklist,
sd_inodedep, sd_allocdirect, sd_allocindir, and sd_mkdir. DDB can
also print lists of sd_allinodedeps, sd_mkdir_list, and sd_workhead.
The sd_workhead script is useful for listing all the dependencies
associated with a buffer, e.g. bp->b_dep.

Prefix the soft dependency show names with sd_ so that they sort
together when listed by DDB's "show help" and to distinguish them
from other data structures printable by DDB.

Sponsored by: Netflix
2019-01-26 05:35:24 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Kirk McKusick
751ae98144 Move ASSERT_VOP_LOCKED to top of ufs_vinit() as it should be true
when the function is entered.

Suggested by: kib
2018-12-30 06:03:20 +00:00
Kirk McKusick
1c521f70d4 For consistency with FFS2's fifoops2 and both versions of FFS's
vnodeops make FFS1's fifoops1 use ffs_lock. Also delete ffs_reallocblks
from fifoops1 which is needed only for fifoops2 because of its
support for extended attributes that need to allocate blocks.

Suggested by: kib
2018-12-30 05:03:41 +00:00
Kirk McKusick
c0029546f8 When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by:  Christopher Krah <krah@protonmail.com>
Reported as:  FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by:  kib
MFC after:    1 week
Sponsored by: Netflix

M    sys/fs/ext2fs/ext2_vnops.c
M    sys/kern/vfs_subr.c
M    sys/ufs/ffs/ffs_snapshot.c
M    sys/ufs/ufs/ufs_vnops.c
2018-12-27 07:18:53 +00:00
Konstantin Belousov
8690d4dea3 Allocate v_object for the new snapshot vnode.
The vnode is not opened, so it ends up with the malloced buffers otherwise.

Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-23 18:54:09 +00:00
Kirk McKusick
c8f55fc4b4 Ensure that the inode check-hash is not left zeroed out in the case where
the check-hash fails. Prior to the fix in -r342133 the inode with the
zeroed out check-hash was written back to disk causing further confusion.

Reported by:  Gary Jennejohn (gj)
Sponsored by: Netflix
2018-12-15 18:49:30 +00:00
Kirk McKusick
72d28f97be Reorder ffs_verify_dinode_ckhash() so that it checks the inode check-hash
before copying in the inode so that the mode and link-count are not set
if the check-hash fails. This change ensures that the vnode will be properly
unwound and recycled rather than being held in the cache.

Initialize the file mode is zero so that if the loading of the inode
fails (for example because of a check-hash failure), the vnode will be
properly unwound and recycled.

Reported by:  Gary Jennejohn (gj)
Sponsored by: Netflix
2018-12-15 18:35:46 +00:00
Kirk McKusick
6fa9bc995a Must set ip->i_effnlink = ip->i_nlink to avoid a soft updates
"panic: softdep_update_inodeblock: bad link count" when releasing
a partially initialized vnode after an inode check-hash failure.

Reported by:  Gary Jennejohn <gljennjohn@gmail.com>
Reported by:  Peter Holm (pho)
Sponsored by: Netflix
2018-12-15 17:58:42 +00:00
Kirk McKusick
8f829a5cf0 Continuing efforts to provide hardening of FFS. This change adds a
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.

Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.

Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-12-11 22:14:37 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Kirk McKusick
bdd6b77e1f If the vfs.ffs.dotrimcons sysctl option is enabled while a file
deletion is active, specifically after a call to ffs_blkrelease_start()
but before the call to ffs_blkrelease_finish(), ffs_blkrelease_start()
will have handed out SINGLETON_KEY rather than starting a collection
sequence. Thus if we get a SINGLETON_KEY passed to ffs_blkrelease_finish(),
we just return rather than trying to finish the nonexistent sequence.

Reported by:  Warner Losh (imp@)
Sponsored by: Netflix
2018-12-06 01:04:56 +00:00
Kirk McKusick
fb14e73cb4 Normally when an attempt is made to mount a UFS/FFS filesystem whose
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.

This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.

Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.

Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.

Reported by:    Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by:   Netflix
2018-12-06 00:09:39 +00:00
Kirk McKusick
a02bd3e38c Move the check for the filesystem having been run on a kernel that
predates metadata check hashes so that it is done before deciding
whether to compute a check-hash of the superblock.

Reported by:  Rick Macklem <rmacklem@uoguelph.ca>
Sponsored by: Netflix
2018-11-26 00:58:07 +00:00
Kirk McKusick
ade67b509c Calculate updated superblock check-hash before writing it into the snapshot.
This corrects a bug that prevented snapshots from being mounted due to a
superblock check-hash failure.

Reported by:  Brennan Vincent <brennan@umanwizard.com>
Tested by:    Peter Holm (pho@)
Sponsored by: Netflix
2018-11-25 18:01:15 +00:00
Mark Johnston
6d2e2df764 Ensure that directory entry padding bytes are zeroed.
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace.  With the ino64 work there
are also some explicit pad fields in struct dirent.  Add a subroutine
to clear these bytes and use it in the in-tree filesystems.  The
NFS client is omitted for now as it was fixed separately in r340787.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-23 22:24:59 +00:00
Konstantin Belousov
1c4ca77890 Add d_off support for multiple filesystems.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature.  Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.

Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs.  At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.

Submitted by:	Jack Halford <jack@gandi.net>
Reviewed by:	mckusick, pfg, rmacklem (previous versions)
Sponsored by:	Gandi.net
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17917
2018-11-14 14:18:35 +00:00
Kirk McKusick
9fc5d538fc In preparation for adding inode check-hashes, clean up and
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.

Convert the utilities that had been using the undocumented
interface to use the new documented interface.

No functional change (as for now the libufs library does not
do inode check-hashes).

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-11-13 21:40:56 +00:00
Brooks Davis
1493c2ee62 Make vop_symlink take a const target path.
This will enable callers to take const paths as part of syscall
decleration improvements.

Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.

Bump __FreeBSD_version for external API consumers.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17805
2018-11-02 14:42:36 +00:00
Konstantin Belousov
4f77f48884 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
Kirk McKusick
ec888383cf Continuing efforts to provide hardening of FFS, this change adds a
check hash to the superblock. If a check hash fails when an attempt
is made to mount a filesystem, the mount fails with EINVAL (Invalid
argument). This avoids a class of filesystem panics related to
corrupted superblocks. The hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-10-23 21:10:06 +00:00
Konstantin Belousov
b7befdf509 Correct panic messages.
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
Approved by:	re (rgrimes)
MFC after:	1 week
2018-09-22 17:05:49 +00:00
Konstantin Belousov
bf94d6c78b Fix state of dquot-less vnodes after failed quotaoff.
UFS quotaoff iterates over all mp vnodes, and derefences and clears
the pointers to corresponding dquots. If SU work items transiently
reference some of dquots,quotaoff() would eventually fail, but all
processed vnodes are already stripped from dquots.  The state is
problematic, since quotas are left enabled, but there is no dquots
where blocks and inodes can be accounted.  The result is assertion
failures and NULL pointer dereferences.

Fix it by suspending writes around quotaoff() call.  Since the
filesystem is synced, no dandling references to dquots from SU
workitems can left behind, which means that quotaoff succeeds.

The complication there is that quotaoff VFS op is performed with the
mount point busied, while to suspend, we need to start write on the
mp.  If vn_start_write() is called on busied mp, system might deadlock
against parallel unmount request.  Handle this by unbusy-ing mp before
starting write, which in turn requires changing the quotaoff()
interface to return with the mount point not busied, same as was done
for quotaon().

Reviewed by:	mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17208
2018-09-19 14:36:57 +00:00
Konstantin Belousov
a408841593 Do not upgrade the vnode lock to call getinoquota().
Doing so can deadlock when the thread already owns another vnode lock,
e.g. during a rename, as was demonstrated by the reporter.  In fact,
there seems to be no need to force the call to getinoquota() always,
because vn_open() locks vnode exclusively, and this is the most
important case.  To add to the point, directories where the dirent is
added or removed, are locked exclusively as well.

Reported by:	bwidawsk
Tested by:	bwidawsk, pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
2018-09-17 19:38:43 +00:00
Kirk McKusick
4b6a2c497e The Call For Testing had no reports of operational problems and
found that performance was no worse and usually better when running
with TRIM consolidation. Performance improvement was most noticable
when multiple large files are released in a short period of time.

Thus, TRIM consolidation is being enabled by default. Should
operational problems be found, it can be disabled using the command
`sysctl vfs.ffs.dotrimcons=0'. This variable can also be set as a
tunable if early disabling is necessary.

Approved by:  re (gjb)
Sponsored by: Netflix
2018-09-06 23:28:35 +00:00
Mark Murray
19fa89e938 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
Kirk McKusick
a9c2220f5c Proper spelling of consolidation.
Submitted by: Dimitry Andric
2018-08-23 22:35:14 +00:00
Kirk McKusick
d530fd484b TRIM consolodation is supposed to be off by default 2018-08-20 21:19:21 +00:00
Kirk McKusick
4de0d16b8c For traditional disks, the filesystem attempts to allocate the
blocks of a file as contiguously as possible. Since the filesystem
does not know how large a file will grow when it is first being
written, it initially places the file in a set of blocks in which
it currently fits. As it grows, it is relocated to areas with
larger contiguous blocks.  In this way it saves its large contiguous
sets of blocks for the files that need them and thus avoids
unnecessaily fragmenting its disk space.

We used to skip reallocating the blocks of a file into a contiguous
sequence if the underlying flash device requested BIO_DELETE
notifications, because devices that benefit from BIO_DELETE also
benefit from not moving the data. However, in the algorithm described
above that reallocates the blocks, the destination for the data is
usually moved before the data is written to the initially allocated
location. So we rarely suffer the penalty of extra writes.  With
the addition of the consolodation of contiguous blocks into single
BIO_DELETE operations, having fewer but larger contiguous blocks
reduces the number of (slow and expensive) BIO_DELETE operations.
So when doing BIO_DELETE consolodation, we do block reallocation.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-08-19 17:19:20 +00:00
Kirk McKusick
fc6e171535 Add consolodation of TRIM / BIO_DELETE commands to the UFS/FFS filesystem.
When deleting files on filesystems that are stored on flash-memory
(solid-state) disk drives, the filesystem notifies the underlying
disk of the blocks that it is no longer using. The notification
allows the drive to avoid saving these blocks when it needs to
flash (zero out) one of its flash pages. These notifications of
no-longer-being-used blocks are referred to as TRIM notifications.
In FreeBSD these TRIM notifications are sent from the filesystem
to the drive using the BIO_DELETE command.

Until now, the filesystem would send a separate message to the drive
for each block of the file that was deleted. Each Gigabyte of file
size resulted in over 3000 TRIM messages being sent to the drive.
This burst of messages can overwhelm the drive's task queue causing
multiple second delays for read and write requests.

This implementation collects runs of contiguous blocks in the file
and then consolodates them into a single BIO_DELETE command to the
drive. The BIO_DELETE command describes the run of blocks as a
single large block being deleted. Each Gigabyte of file size can
result in as few as two BIO_DELETE commands and is typically less
than ten.  Though these larger BIO_DELETE commands take longer to
run, they do not clog the drive task queue, so read and write
commands can intersperse effectively with them.

Though this new feature has been throughly reviewed and tested, it
is being added disabled by default so as to minimize the possibility
of disrupting the upcoming 12.0 release. It can be enabled by running
``sysctl vfs.ffs.dotrimcons=1''. Users are encouraged to test it.
If no problems arise, we will consider requesting that it be enabled
by default for 12.0.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-08-19 16:56:42 +00:00
Kirk McKusick
7e038bc257 Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by:  Peter Holm
Suggested by: kib
Reviewed by:  kib
Sponsored by: Netflix
2018-08-18 22:21:59 +00:00
Kirk McKusick
cc91864c26 Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.
2018-08-18 21:21:06 +00:00
Jamie Gritton
284001a222 Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES).  These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do.  The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision:	D14791
2018-08-16 18:40:16 +00:00
Kirk McKusick
68c49bcc40 Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix
2018-08-06 21:09:11 +00:00
Konstantin Belousov
61b285ac57 Avoid assertion in /dev/ufssuspend when the suspend ioctl is
(incorrectly) called while another suspension is already active.

PR:	230220
Reported by:	dexuan
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-01 19:06:55 +00:00
Alan Somers
6040822c4e Make timespecadd(3) and friends public
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub.  NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions.  Solaris also defines a three-argument version, but
only in its kernel.  This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with:	cem, jilles, ian, bde
Differential Revision:	https://reviews.freebsd.org/D14725
2018-07-30 15:46:40 +00:00
Kirk McKusick
15430057b3 Add needed locking for um_flags added in -r335808.
While here document required locking details in ufsmount structure.

Reported by: kib
Reviewed by: kib
2018-07-17 04:43:58 +00:00
Conrad Meyer
9c9c01e43b ffs_syncvnode: Remove unhelpful print
It can occur during ordinary use of softupdates, or perhaps if writes to the
underlying media fail (causing bufs to be redirtied).  Either way, it is not
particularly actionable.

Reviewed by:	imp, kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16258
2018-07-14 15:45:11 +00:00
Kirk McKusick
e1c27cf7d6 Import commit from NetBSD with checkin message:
Avoid Undefined Behavior in ffs_clusteracct()

    Change the type of 'bit' variable from int to unsigned int and use unsigned
    values consistently.

    sys/ufs/ffs/ffs_subr.c:336:10, shift exponent -1 is negative

    Detected with Kernel Undefined Behavior Sanitizer.

    Reported by <Harry Pantazis>

Submitted by: Pedro Giffuni
2018-07-07 19:11:43 +00:00
Kirk McKusick
ab0bcb6032 Create um_flags in the ufsmount structure to hold flags for a UFS filesystem.
Convert integer structure flags to use um_flags:

	int	um_candelete;			/* devvp supports TRIM */
	int	um_writesuspended;		/* suspension in progress */

become:

#define UM_CANDELETE		0x00000001	/* devvp supports TRIM */
#define UM_WRITESUSPENDED	0x00000002	/* suspension in progress */

This is in preparation for adding other flags to indicate forcible
unmount in progress after a disk failure and possibly forcible
downgrade to read-only.

No functional change intended.

Sponsored by:	Netflix
2018-06-29 22:24:41 +00:00
Warner Losh
06753bd3f9 Use buf + strategy rather than bypassing geom_vfs layer
The reference counting that's done in the geom_vfs layer to prevent
delivery of requests to defunct devices only works if all requests go
through that layer. UFS was bypassing that layer for BIO_DELETE requests,
sending them to the geom_consumer directly with g_io_request. Allocate
a buf, fill it in, and call strategy on it instead.

Submitted by: Chuck Silvers
Reviewed by: scottl, imp, kirk
Sponsored by: Netflix
Differential: https://reviews.freebsd.org/D15456
2018-06-26 00:39:38 +00:00
Matt Macy
1ef3a74e6b ufs: remove cgbno variable where unused 2018-05-19 19:30:42 +00:00
Kirk McKusick
8ab507588b Fix warning found by Coverity.
CID 1009353:  Error handling issues  (CHECKED_RETURN)
2018-05-16 23:42:02 +00:00
Konstantin Belousov
2ebc882927 Detect and optimize reads from the hole on UFS.
- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
  was performed before the buffer is created, and EJUSTRETURN returned
  in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
  allocating the pages for it), copying from zero_region instead.

The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.

Requested and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D14917
2018-05-13 09:47:28 +00:00
Kirk McKusick
f8ccf17383 Renumber soft-update types starting at 1 instead of 0 to avoid confusion
of zero'ed memory appearing to have a valid soft-update type.

Also correct some comments.

Reviewed by: kib
2018-04-05 00:32:01 +00:00
Ed Maste
d8ba45e213 Revert r313780 (UFS_ prefix) 2018-03-17 12:59:55 +00:00
Ed Maste
1e2b9afca9 Prefix UFS symbols with UFS_ to reduce namespace pollution
Followup to r313780.  Also prefix ext2's and nandfs's versions with
EXT2_ and NANDFS_.

Reported by:	kib
Reviewed by:	kib, mckusick
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9623
2018-03-17 01:48:27 +00:00
Kirk McKusick
efbf396426 This change is some refactoring of Mark Johnston's changes in r329375
to fix the memory leak that I introduced in r328426. Instead of
trying to clear up the possible memory leak in all the clients, I
ensure that it gets cleaned up in the source (e.g., ffs_sbget ensures
that memory is always freed if it returns an error).

The original change in r328426 was a bit sparse in its description.
So I am expanding on its description here (thanks cem@ and rgrimes@
for your encouragement for my longer commit messages).

In preparation for adding check hashing to superblocks, r328426 is
a refactoring of the code to get the reading/writing of the superblock
into one place. Unlike the cylinder group reading/writing which
ends up in two places (ffs_getcg/ffs_geom_strategy in the kernel
and cgget/cgput in libufs), I have the core superblock functions
just in the kernel (ffs_sbfetch/ffs_sbput in ffs_subr.c which is
already imported into utilities like fsck_ffs as well as libufs to
implement sbget/sbput). The ffs_sbfetch and ffs_sbput functions
take a function pointer to do the actual I/O for which there are
four variants:

    ffs_use_bread / ffs_use_bwrite for the in-kernel filesystem

    g_use_g_read_data / g_use_g_write_data for kernel geom clients

    ufs_use_sa_read for the standalone code (stand/libsa/ufs.c
	but not stand/libsa/ufsread.c which is size constrained)

    use_pread / use_pwrite for libufs

Uses of these interfaces are in the UFS filesystem, geoms journal &
label, libsa changes, and libufs. They also permeate out into the
filesystem utilities fsck_ffs, newfs, growfs, clri, dump, quotacheck,
fsirand, fstyp, and quot. Some of these utilities should probably be
converted to directly use libufs (like dumpfs was for example), but
there does not seem to be much win in doing so.

Tested by: Peter Holm (pho@)
2018-03-02 04:34:53 +00:00
Conrad Meyer
d4e6557bae ffs: softdep_disk_write_complete: Quiesce spurious Coverity warning
Coverity cannot determine that handle_written_indirdep() does not access
uninitialized 'sbp' when flags argument is zero.

So, simply move the initialization slightly sooner to silence the warning.

No functional change.

Reported by:	Coverity
Sponsored by:	Dell EMC Isilon
2018-03-01 00:29:52 +00:00
Kirk McKusick
528833fae1 Use a more straight-forward approach to relaxing the location
restraints when validating one of the backup superblocks.
2018-02-26 00:34:56 +00:00
Kirk McKusick
4cbd996a84 Relax the location restraints when validating one of the
backup superblocks.
2018-02-24 03:33:46 +00:00
Kirk McKusick
f686b1710a Refactor fix in r329600 to do its check once in readsuper() rather
than in the two places that call readsuper().

No semantic change intended.

Reviewed by: kib
2018-02-21 19:56:19 +00:00
Konstantin Belousov
9f74642385 Do not free(9) uninitialized pointer.
Reported and tested by:	allanjude
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2018-02-19 19:08:25 +00:00
Mark Johnston
16759360d4 Fix a memory leak introduced in r328426.
ffs_sbget() may return a superblock buffer even if it fails, so the
caller must be prepared to free it in this case. Moreover, when tasting
alternate superblock locations in a loop, ffs_sbget()'s readfunc
callback must free the previously allocated buffer.

Reported and tested by:	pho
Reviewed by:		kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D14390
2018-02-16 15:41:03 +00:00
Kirk McKusick
068beacf21 The goal of this change is to prevent accidental foot shooting by
folks running filesystems created on check-hash enabled kernels
(which I will call "new") on a non-check-hash enabled kernels (which
I will call "old). The idea here is to detect when a filesystem is
run on an old kernel and flag the filesystem so that when it gets
moved back to a new kernel, it will not start getting a slew of
check-hash errors.

Back when the UFS version 2 filesystem was created, it added a file
flag FS_INDEXDIRS that was to be set on any filesystem that kept
some sort of on-disk indexing for directories. The idea was precisely
to solve the issue we have today. Specifically that a newer kernel
that supported indexing would be able to tell that the filesystem
had been run on an older non-indexing kernel and that the indexes
should not be used until they had been rebuilt. Since we have never
implemented on-disk directory indicies, the FS_INDEXDIRS flag is
cleared every time any UFS version 2 filesystem ever created is
mounted for writing.

This commit repurposes the FS_INDEXDIRS flag as the FS_METACKHASH
flag. Thus, the FS_METACKHASH is definitively known to have always
been cleared. The FS_INDEXDIRS flag has been moved to a new block
of flags that will always be cleared starting with this commit
(until they get used to implement some future feature which needs
to detect that the filesystem was mounted on a kernel that predates
the new feature).

If a filesystem with check-hashes enabled is mounted on an old
kernel the FS_METACKHASH flag is cleared. When that filesystem is
mounted on a new kernel it will see that the FS_METACKHASH has been
cleared and clears all of the fs_metackhash flags. To get them
re-enabled the user must run fsck (in interactive mode without the
-y flag) which will ask for each supported check hash whether it
should be rebuilt and enabled. When fsck is run in its default preen
mode, it will just ignore the check hashes so they will remain
disabled.

The kernel has always disabled any check hash functions that it
does not support, so as more types of check hashes are added, we
will get a non-surprising result. Specifically if filesystems get
moved to kernels supporting fewer of the check hashes, those that
are not supported will be disabled. If the filesystem is moved back
to a kernel with more of the check-hashes available and fsck is run
interactively to rebuild them, then their checking will resume.
Otherwise just the smaller subset will be checked.

A side effect of this commit is that filesystems running with
cylinder-group check hashes will stop having them checked until
fsck is run to re-enable them (since none of them currently have
the FS_METACKHASH flag set). So, if you want check hashes enabled
on your filesystems after booting a kernel with these changes, you
need to run fsck to enable them. Any newly created filesystems will
have check hashes enabled. If in doubt as to whether you have check
hashes emabled, run dumpfs and look at the list of enabled flags
at the end of the superblock details.
2018-02-08 23:06:58 +00:00
Pedro F. Giffuni
7cbd6d338e {ext2|ufs}_readdir: Avoid setting negative ncookies.
ncookies cannot be negative or the allocator will fail. This should only
happen if a caller is very broken but we can still try to survive the
event.

We should probably also verify for uio_resid > MAXPHYS but in that case
it is not clear that just clipping the ncookies value is an adequate
response.

MFC after:	2 weeks
2018-02-06 22:38:19 +00:00
Kirk McKusick
47806d1b93 Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
  check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
  never mark it dirty, thus do not write it out, and hence never
  update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
  check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
  the old check hash is still in the VM cache, but some other pages
  are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
  from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
  error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

  - bufdone()
  -- b_ckhashcalc (calculates check hash)
  -- bufdone_finish()
  --- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
  in memory (but not all pages, or we wouldn't be doing the read),
  the I/O is done with bogus_page mapped in for the pages that exist
  in the VM cache. This mapping is done to avoid corrupting the
  cached pages if there is any I/O overrun. The vfs_vmio_iodone()
  function is responsible for replacing the bogus_page(s) with the
  cached ones. But we were calculating the check hash before the
  bogus_page(s) were replaced. Hence, when we were calculating the
  check hash, we were partly reading from bogus_page, which means
  we calculated a bad check hash (e.g., because multiple pages have
  been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
Kirk McKusick
0d37a428f0 When reading a cylinder group, break out reporting of check hash errors
from other types of errors so that the error is correctly reported.
2018-01-31 23:13:37 +00:00
Pedro F. Giffuni
040fb18b60 Revert r328479:
{ext2|ufs}_readdir: Set limit on valid ncookies values.

We aren't allowed to set resid like this.

Pointed out by:	kib, imp
2018-01-27 16:34:00 +00:00
Pedro F. Giffuni
ee233ab975 {ext2|ufs}_readdir: Set limit on valid ncookies values.
Sanitize the values that will be assigned to ncookies so that we ensure
they are sane and we can handle them.

Let ncookies signed as it was before r328346. The valid range is such
that unsigned values are not required and we are not able to avoid at
least one cast anyways.

Hinted by:	bde
2018-01-27 15:33:52 +00:00
Kirk McKusick
dffce2150e Refactoring of reading and writing of the UFS/FFS superblock.
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.

No functional change intended.

Reviewed by: kib
2018-01-26 00:58:32 +00:00
Pedro F. Giffuni
a94a2945be ext2fs|ufs:Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	2 weeks
2018-01-24 17:58:48 +00:00
Pedro F. Giffuni
f9834d101a Revert r327781, r328093, r328056:
ufs|ext2fs: Revert uses of mallocarray(9).

These aren't really useful: drop them.
Variable unsigning will be brought again later.
2018-01-24 16:44:57 +00:00
Pedro F. Giffuni
90b618f35b ufs: use mallocarray(9).
Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.

Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by:	mckusick
2018-01-17 18:18:33 +00:00
Konstantin Belousov
147b0c1a3e Softlink inodes can own buffers with dependencies.
At least, softlinks longer than 120 bytes have data fragments.

Submitted by:	mckusick
MFC after:	5 days
2018-01-11 13:37:45 +00:00
Pedro F. Giffuni
802258f46b Use mallocarray(9) in dirhash.
Basic use of mallocarray to prevent overflows. Here allocation is done
with M_NOWAIT so the code is prepared for the possibility of returning
NULL values. Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by:	mckusick
2018-01-10 19:45:38 +00:00
Konstantin Belousov
c999b43527 Generalize the fix from r322757 and apply it to several more places.
The code accesses bp->b_dep without owning the ufs mount softdep lock,
which makes it possible for the derefenced workitem to be freed in
parallel.  In particular, the deallocate_dependencies(),
softdep_disk_io_initiation() and softdep_disk_write_complete() are
affected.

Move the code to safely calculate ump from the buffer with
dependencies into the helper softdep_bp_to_mp() and use it for all
found cases.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	mckusick (as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:51:44 +00:00
Konstantin Belousov
e51e3c7e73 When handling write completion, take SU lock around calls to
handle_written_XXX() in case of processing the buffer with an error.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	mckusick (as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:44:17 +00:00
Konstantin Belousov
377f88fb08 Postpone the disassotiation of the background write buffer with devvp
so that buf_complete() sees fully constructed buffer.

This is a NOP right now, but will be needed by the forthcoming SU change.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:33:11 +00:00
Mark Johnston
8a7a42bb17 Add missing newlines to a couple of error messages.
Keep error messages on a single line so that they're easier to grep for.

Reported by:	pho
MFC after:	1 week
2018-01-03 18:19:47 +00:00
Pedro F. Giffuni
51268c3852 SPDX: Complete License ID tags for UFS. 2017-12-27 19:13:50 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Alexander Kabaev
5f943cca65 Remove dead initialization of the inode pointer.
The pointer gets initialized again later in the code. This also
improves code style(9).
2017-12-23 16:24:02 +00:00
John Baldwin
b501cc5da6 Rework pathconf handling for FIFOs.
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.

To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method.  For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.

While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value.  In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.

Discussed with:	bde
Reviewed by:	kib (part of a larger patch)
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D12572
2017-12-19 22:39:05 +00:00
John Baldwin
599afe53a8 Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf().
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations.  Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
  permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs.  Now fail with EINVAL to
  indicate hard links aren't supported.

Requested by:	bde (though perhaps not this exact implementation)
Reviewed by:	kib (earlier version)
MFC after:	1 month
Sponsored by:	Chelsio Communications
2017-12-19 19:51:36 +00:00
Mark Johnston
b6fbf003e1 Provide a sysctl to force synchronous initialization of inode blocks.
FFS performs asynchronous inode initialization, using a barrier write
to ensure that the inode block is written before the corresponding
cylinder group header update. Some GEOMs do not appear to handle
BIO_ORDERED correctly, meaning that the barrier write may not work as
intended. The sysctl allows one to work around this problem at the
cost of expensive file creation on new filesystems. The default
behaviour is unchanged.

Reviewed by:	kib, mckusick
MFC after:	1 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13428
2017-12-09 15:44:30 +00:00
Konstantin Belousov
b6721e4a5c Fix livelock in ufsdirhash_create().
When more than one thread enters ufsdirhash_create() for the same
directory and the inode dirhash is instantiated, but the dirhash' hash
is not, all of them lock the dirhash shared and then try to upgrade.
Since there are several threads owning the lock shared, upgrade fails
and the same attempt is repeated, ad infinitum.

To break the lockstep, lock the dirhash in exclusive mode after the
failed try-upgrade.

Reported and tested by:	pho
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-12-07 09:05:34 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Konstantin Belousov
4e13cca54e Improve the message printed when the cylinder group checksum is wrong.
Mention the device path and mount point path, handle snapshots.

Tested by:	imp
Sponsored by:	The FreeBSD Foundation
2017-11-05 13:28:48 +00:00
Mark Johnston
4770655901 Remove a stale and incorrect comment.
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-10-28 02:51:27 +00:00
Mark Johnston
9cf7abcc1d Remove workqueue items after updating the workqueue tail pointer.
When QUEUE_MACRO_DEBUG_TRASH is configured, the queue linkage fields
are trashed upon removal of the item, so be sure to only read them before
removing the item.

No functional change intended.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-10-28 02:48:37 +00:00
Mark Johnston
4c52a9993b Make drain_output() use bufobj_wwait().
No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12790
2017-10-25 17:20:18 +00:00
John Baldwin
800c3e80de Don't defer wakeup()s for completed journal workitems.
Normally wakeups() are performed for completed softupdates work items
in workitem_free() before the underlying memory is free()'d.
complete_jseg() was clearing the "wakeup needed" flag in work items to
defer the wakeup until the end of each loop iteration.  However, this
resulted in the item being free'd before it's address was used with
wakeup().  As a result, another part of the kernel could allocate this
memory from malloc() and use it as a wait channel for a different
"event" with a different lock.  This triggered an assertion failure
when the lock passed to sleepq_add() did not match the existing lock
associated with the sleep queue.  Fix this by removing the code to
defer the wakeup in complete_jseg() allowing the wakeup to occur
slightly earlier in workitem_free() before free() is called.

The main reason I can think of for deferring a wakeup() would be to
avoid waking up a waiter while holding a lock that the waiter would
need.  However, no locks are dropped in between the wakeup() in
workitem_free() and the end of the loop in complete_jseg() as far as I
can tell.

In general I think it is not safe to do a wakeup() after free() as one
cannot control how other parts of the kernel that might reuse the
address for a different wait channel will handle spurious wakeups.

Reported by:	pho
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12494
2017-09-26 23:24:15 +00:00
Konstantin Belousov
55e5a5c1f4 Fix 32bit build.
Reported by:	emaste
Sponsored by:	The FreeBSD Foundation
2017-09-22 16:42:41 +00:00
Kirk McKusick
75e3597abb Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
    Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
    by individual filesystems for their own purpose. Their specific
    definitions are found in the header files for each filesystem
    that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
    It is only necessary to compute a check hash for a cylinder
    group when it is actually read from disk. When calling bread,
    you do not know whether the buffer was found in the cache or
    read. So a new flag (GB_CKHASH) and a pointer to a function to
    perform the hash has been added to breadn_flags to say that the
    function should be called to calculate a hash if the data has
    been read. The check hash is placed in b_ckhash and the B_CKHASH
    flag is set to indicate that a read was done and a check hash
    calculated. Though a rather elaborate mechanism, it should
    also work for check hashing other metadata in the future. A
    kernel internal API change was to change breada into a static
    fucntion and add flags and a function pointer to a check-hash
    function.

sys/ufs/ffs/fs.h:
    Add flags for types of check hashes; stored in a new word in the
    superblock. Define corresponding BX_ flags for the different types
    of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
    In ffs_getcg do the dance with breadn_flags to get a check hash and
    if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
    Copy across the BX_FFSTYPES flags in background writes.
    Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
    Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
    Include libkern/crc32.c in libufs and use it to compute check
    hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
    Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
    Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
    Offer to add check hashes to existing filesystems.
    Precompute check hashes when rebuilding cylinder group
    (although this will be done when it is written in fsutil.c
    it is necessary to do it early before comparing with the old
    cylinder group)

sbin/dumpfs/dumpfs.c
    Print out the new check hash flag(s)

sbin/fsdb/Makefile:
    Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)
2017-09-22 12:45:15 +00:00
John Baldwin
4f45713ae2 Add UFS_LINK_MAX for the UFS-specific limit on link counts.
ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits.  This is a nop currently but will matter if LINK_MAX is
increased in the future.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2017-09-18 23:30:39 +00:00
Kirk McKusick
855662c611 The new fsck recovery information to enable it to find backup
superblocks created in revision 322297 only works on disks
with sector sizes up to 4K. This update allows the recovery
information to be created by newfs and used by fsck on disks
with sector sizes up to 64K. Note that FFS currently limits
filesystem to be mounted from disks with up to 8K sectors.
Expanding this limitation will be the subject of another
commit.

Reported by: Peter Holm
Reviewed with: kib
2017-09-04 20:19:36 +00:00
Konstantin Belousov
2f9d88c7ae Protect v_rdev dereference with the vnode interlock instead of the
vnode lock.

Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	10 days
2017-08-25 09:51:22 +00:00
Konstantin Belousov
f0d5223230 Avoid dereferencing potentially freed workitem in
softdep_count_dependencies().

Buffer's b_dep list is protected by the SU mount lock.  Owning the
buffer lock is not enough to guarantee the stability of the list.

Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime.  To get to ump, use the pointers chain which does not
involve workitems at all.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-08-21 16:23:44 +00:00
Konstantin Belousov
b5f2560d09 Style.
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-21 16:16:02 +00:00
Kirk McKusick
77b63aa0fc Since the switch to GPT disk labels, fsck for UFS/FFS has been
unable to automatically find alternate superblocks. This checkin
places the information needed to find alternate superblocks to the
end of the area reserved for the boot block.

Filesystems created with a newfs of this vintage or later will
create the recovery information. If you have a filesystem created
prior to this change and wish to have a recovery block created for
your filesystem, you can do so by running fsck in forground mode
(i.e., do not use the -p or -y options). As it starts, fsck will
ask ``SAVE DATA TO FIND ALTERNATE SUPERBLOCKS'' to which you should
answer yes.

Discussed with: kib, imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11589
2017-08-09 05:17:21 +00:00
Kirk McKusick
33bbdde01f Avoid reading a snapshot block when it is already in the cache.
Update the use of the B_CACHE flag (since the May 1999 commit
that made it the correct test here).

Reported by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
2017-07-31 20:41:45 +00:00
Konstantin Belousov
5cf14660ae Improve publication of the newly allocated snapdata.
For freshly allocated snapdata, Lock sn_lock in advance, so
si_snapdata readers see the locked snapdata and not race.

For existing snapdata, if the thread was put to sleep waiting for
sn_lock, re-read si_snapdata.  This either closes the race or makes
the reliance on LK_DRAIN less important.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-07-21 18:42:35 +00:00
Konstantin Belousov
c536471408 Unlock correct lock in ffs_snapblkfree().
It is possible for ffs_snapblkfree() to race and lock snaplock while
the devvp snapdata is instantiated, but no snapshots exist.  In this
case the loop over snapshots in ffs_snapblkfree() is not executed, and
the local variable vp is left initialized to NULL.

Unlock using &sn->sn_lock and not vp->v_vnlock.  For the inodes on the
snapshot list, the locks are same.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-07-21 18:36:17 +00:00
Konstantin Belousov
f2e6bf5c05 Account for lock recursion when transfering snaplock to the vnode lock
in ffs_snapremove().

Apparently ffs_snapremove() may be called with the snap lock recursed,
at least one trace demonstrated this when snapshot vnode was unlinked
while synced.  It was inactivated from the syncer thread.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-07-21 18:28:27 +00:00
Konstantin Belousov
35bf780921 Remove write-only variable.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2017-07-16 07:12:04 +00:00
Konstantin Belousov
51a6a15f8c A followup to r320453, correct removal of the blocks from UFS snapshots.
Tested by:	pho
PR:    220693
Sponsored by:	The FreeBSD Foundation
2017-07-16 07:11:29 +00:00
John Baldwin
15a88f8158 Consistently use vop_stdpathconf() for default pathconf values.
Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values.  vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits.  Filesystems can still
explicitly override individual settings.

PR:		219851
Reported by:	cem
Reviewed by:	cem, kib, ngie
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D11541
2017-07-11 21:55:20 +00:00
Kirk McKusick
9c4f551e98 Create a new function ffs_getcg() to read in and verify a cylinder
group. Change all code points that open-coded this functionality
to use the new function. This commit is a refactoring with no
change in functionality.

In the future this change allows more robust checking of cylinder
group reads along the lines discussed in the hardening UFS session
at BSDCan (retry I/O, add checksums, etc). For more detail see the
session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS

Reviewed by: kib
2017-06-28 17:32:09 +00:00
Konstantin Belousov
698f05ab95 Mitigate several problems with the softdep_request_cleanup() on busy
host.

Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup.  It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().

There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel.  Due
  to the locking, they quickly start executing 'in phase' with the speed
  of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
  non-blocking attempt to lock the vnode owned by other thread fail,
  and loops executing without making the progress.
Retry logic does not allow the situation to recover.  The result is
a livelock.

Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
  A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
  even if there are still work items on the mp work list.  An
  assumption is that the items will be cleaned when other thread
  either fsyncs its vnode, or unlock and allow yet another thread to
  make the progress.

It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry.  But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.

Reported and tested by:	pho
Style and comments by:	mckusick
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-06-03 16:18:50 +00:00
Konstantin Belousov
4cbc378c61 Clean possible td_su reference on the struct mount being unmounted as
the last step of ffs_unmount().

It is possible that the mount point is recorded for cleanup in AST
context while softdep flush is executed during unmount.  The workitems
are flushed by other means for the unmount, but the stray reference to
struct mount blocks destruction of mount.  Check for the situation and
manually call vfs_rel() before returning from ffs_unmount().

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-03 14:15:14 +00:00
Konstantin Belousov
215b29f62c Remove spl() calls from UFS code.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-05-07 14:59:45 +00:00
Ed Maste
3cf259c390 UFS fs.h: clear warning from use in makefs(1)
makefs(1) has a number of signedness warnings (when built with higher
WARNS), most of which can be addressed by careful application of casts
in makefs itself.

There is one case where a signedness warning arises from the blksize
macro, so must be addressed in the macro itself.

Reviewed by:	kib, mckusick
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10589
2017-05-05 15:26:55 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Conrad Meyer
a96da1c3fb ufs: Export UFS_MAXNAMLEN to pathconf, statfs
Rather than the global NAME_MAX constant.  This change is required to
support systems with a NAME_MAX/MAXNAMLEN that differs from UFS_MAXNAMLEN.

This was missed in r313475 due to the alternative spelling ("NAME_MAX") of
MAXNAMLEN.  This change is also similar in spirit to r313780.

Reported by:	ngie@
Sponsored by:	Dell EMC Isilon
2017-04-05 01:44:03 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Konstantin Belousov
aca4bb9112 Do not leak mount references for dying threads.
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit.  In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1().  softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-25 10:38:18 +00:00
Ed Maste
1dc349ab95 prefix UFS symbols with UFS_ to reduce namespace pollution
Specifically:
  ROOTINO -> UFS_ROOTINO
  WINO -> UFS_WINO
  NXADDR -> UFS_NXADDR
  NDADDR -> UFS_NDADDR
  NIADDR -> UFS_NIADDR
  MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by:	kib, mckusick
Obtained from:	NetBSD
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9536
2017-02-15 19:50:26 +00:00
Conrad Meyer
0ecf59f68f ufs: Use UFS_MAXNAMLEN constant
(like NFS, EXT2FS, SVR4, IBCS2) instead of redefining the MAXNAMLEN
constant.

No functional change.

Reviewed by:	kib@, markj@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9500
2017-02-09 17:47:01 +00:00
Conrad Meyer
675c187cc4 ffs_vnops: Simplify extattr access
As suggested in r167010, use the structure type and macros to access and
modify UFS2 extended attributes.  Add assertions that pointers are
aligned in places where we now access the data through a structure
pointer, instead of character-by-character.

PR:		216127
Reported by:	dewayne at heuristicsystems.com.au
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9225
2017-01-19 16:46:05 +00:00
Conrad Meyer
c9bf814804 restore(8): Handle extended attribute names correctly
UFS2 extended attribute names are not NUL-terminated.  Handle
appropriately.

Correct the EXTATTR_BASE_LENGTH() macro, which handled ea_namelength ==
one (mod eight) extended attributes incorrectly.

PR:		216127
Reported by:	dewayne at heuristicsystems.com.au
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9208
2017-01-18 18:16:57 +00:00
Conrad Meyer
6e02fac9d9 ufs/extattr.h: Fix documentation of ea_name termination
The ea_name string is not nul-terminated.  Correct the documentation.

Because the subsequent field is padded to 8 bytes, and the padding is
zeroed, the ea_name string will appear to be nul-terminated whenever the
length isn't exactly one (mod eight).

This was introduced in r167010 (2007).

Additionally, mark the length fields as unsigned.  This particularly
matters for the single byte ea_namelength field, which can represent
extended attribute names up to 255 bytes long.

No functional change.

PR:		216127
Reported by:	dewayne at heuristicsystems.com.au
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9206
2017-01-18 17:55:49 +00:00
Konstantin Belousov
1c32456953 Use type-independent formats for printing nlink_t and ino_t.
Extracted from:	ino64 work by gleb, mckusick
Discussed with:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-06 16:59:33 +00:00
Mark Johnston
99e6e1930c Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D8589
2016-11-23 17:53:07 +00:00
Konstantin Belousov
714b7df502 Provide simple mutual exclusion between mount point update and unmount.
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call.  This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.

Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected.  Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.

In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-11-13 21:49:51 +00:00
Ed Maste
15c377c3cc ANSIfy ffs_subr.c
Also renumber license clause to avoid skipping #3
2016-10-31 20:43:43 +00:00
Kirk McKusick
ad544726aa Avoid possible overflow when calclating malloc size for auxillary
data structure sizes when mounting and reloading UFS/FFS
filesystems by using a u_long rather than an int for the size.

Reported by: Mariusz Zaborski <oshogbo@>
MFC after:   1 week
2016-10-28 20:15:19 +00:00
Konstantin Belousov
c39baa7480 Generalize UFS buffer pager to allow it serving other filesystems
which also use buffer cache.

Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-28 11:43:59 +00:00
Kirk McKusick
9e7e59125c The UFS/FFS filesystem checks directory link counts when doing
directory create and delete operations. If it ever finds a directory
with a link count less than 2, it panics. Thus, an rm -rf that
encounters a directory with a link count below 2 causes a kernel
panic. The proposed fix is to return the error EINVAL rather than
panicing. The effect is that the requested operation is not done,
but the system continues to run. At a more convenient later time,
the filesystem can be unmounted and cleaned (with fsck or journal
run). Once cleaned, the operation can be rerun to successful
completion.

This fix takes that approach. The panic message has been converted
into a uprintf(9) to provide the user with the inode number and
filesystem mount point of the offending directory and EINVAL is
returned for the operation.

The long (three year) delay in fixing this problem occurred because
the bug was misclassified when originally assigned and only this week
was found during a sweep of old unresolved bug reports.

PR:          180894
Reviewed by: kib
MFC after:   2 weeks
2016-10-26 20:28:23 +00:00
Marcel Moolenaar
d67ec19c3e Include <sys/types.h> explicitly instead of depending on that
header being included by <sys/param.h>. When compiled as part
of makefs(8) and on macOS or Linux, <sys/param.h> is not our
own.
2016-10-24 18:12:57 +00:00
Konstantin Belousov
895219834e Add FFS pager, which uses buffer cache read operation to validate pages.
See the comments for more detailed description of the algorithm.

The pager is used unconditionally when the block size of the
underlying device is larger than the machine page size, since local
vnode pager cannot handle the configuration [1].  Otherwise, the
vfs.ffs.use_buf_pager sysctl allows to switch to the local pager.

Measurements demonstrated no regression in the ever-important
buildworld benchmark, and small (~5%) throughput improvements in the
special microbenchmark configuration for dbench over swap-backed
md(4).

Code can be generalized and reused for other filesystems which use
buffer cache.

Reported by:	Anton Yuzhaninov <citrin@citrin.ru> [1]
Tested by:	pho
Benchmarked by:	mjg, pho
Reviewed by:	alc, markj, mckusick (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8198
2016-10-19 11:09:29 +00:00
Mateusz Guzik
8660b707ff vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
Konstantin Belousov
bf9c87c813 Be more strict when selecting between snapshot/regular mount.
Reclaimed vnode type is VBAD, so succesful comparision like
devvp->v_type != VREG does not imply that the devvp references
snapshot, it might be due to a reclaimed vnode.  Explicitely check the
vnode type.

In the the most important case of ffs_blkfree(), the devfs vnode is
locked and its type is stable.  In other cases, if the vnode is
reclaimed right after the check, hopefully the buffer methods return
right error codes.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-19 15:58:33 +00:00
Konstantin Belousov
6bd8ddcf0c Fix libprocstat build after r305902.
- Use _Bool to not require userspace to include stdbool.h.
- Make extattr.h usable without vnode_if.h.
- Follow i_ump to get cdev pointer.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 18:14:31 +00:00
Konstantin Belousov
e1db68971e Reduce size of ufs inode.
Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding.  To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2.  Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by:	pho (previous version)
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 16:47:34 +00:00
Bruce Evans
0c01bcb9ff Sprinkle DOINGASYNC() checks so as to do delayed writes for async
mounts in almost all cases instead of in most cases.  Don't override
DOINGASYNC() by any condition except IO_SYNC.

Fix previous sprinking of DOINGASYNC() checks.  Don't override IO_SYNC
by DOINGASYNC().  In ffs_write() and ffs_extwrite(), there were
intentional overrides that just broke O_SYNC of data.  In
ffs_truncate(), there are 5 calls to ffs_update(), 4 with
apparently-unintentional overrides and 1 without; this had no effect
due to the main async mount hack descibed below.

Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden
for the soft updates case too (to do a delayed write instead of a sync
write).  This is supposed to be the only change that affects anything
except async mounts.

In ffs_update(), remove the 19 year old efficiency hack of ignoring
the waitfor flag for async mounts, so that fsync() almost works for
async mounts.  All callers are supposed to be fixed to not ask for a
sync update unless they are for fsync() or [I]O_SYNC operations.
fsync() now almost works for async mounts.  It used to sync the data
but not the most important metdata (the inode).  It still doesn't sync
associated directories.

This gave 10-20% fewer writes for my makeworld benchmark with async
mounted tmp and obj directories from an already small number.

Style fixes:
- in ffs_balloc.c, remove rotted quadruplicated comments about the
  simplest part of the DOING*() decisions and rearrange the nearly-
  quadruplicated code to be more nearly so.
- in ufs_vnops.c, use a consistent style with less negative logic and
  no manual "optimization" of || to | in DOING*() expressions.

Reviewed by:	kib (previous version)
2016-09-08 17:40:40 +00:00
Konstantin Belousov
63876b3ba2 On rename, do not perform truncation of dirhash if the vnode
truncation failed.

Doing so resulted in inconsistent state of the ufs dirhash with regard
to the actual directory inode state, and could lead to spurious ENOENT
errors for lookups of existing files in production kernels, or
assertion failures in the debugging kernels.

Change the logic of calling ufsdirhash_dirtrunc() to be same as in
ufs_direnter().  Execute UFS_TRUNCATE() first, log error, and only do
dirtrunc() if UFS_TRUNCATE() succeeded.

Note that the problem was exacerbated by the bug in the
flush_newblk_dep() function (see r305599), which caused in the spurios
errors from ffs_sync() and then ffs_truncate().

In collaboration with:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:09:34 +00:00
Konstantin Belousov
7b05b8a29c Do not leak transient ENOLCK error from flush_newblk_dep() loop.
The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error
from the failed attempt is irrelevant.  But since there is path after
retry which does not clear error, it is possible to return spurious
error from the function.

The issue resulted in a spurious failure of softdep_sync_buf(),
causing further spurious failure of ffs_sync().

In collaboration with:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:08:54 +00:00
Konstantin Belousov
76db05eb14 When logging unlikely UFS_TRUNCATE() failure in ufs_direnter(),
include error code.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:08:08 +00:00
Konstantin Belousov
ea16af59a1 When externding directory inode in ufs_direnter(), adjust i_endoff.
This change is formally not needed, since i_endoff not used in all
code paths after the call to ufs_direnter(), and i_endoff is
recalculated by the next lookup.  But having the value correct makes
the reasoning about code simpler.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:07:25 +00:00
Konstantin Belousov
e599d951e3 In dqsync(), when called from quotactl(), um_quotas entry might appear
cleared since nothing prevents completion of the parallel quotaoff.
There is nothing to sync in this case, and no reason to panic.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:06:43 +00:00
Konstantin Belousov
60f1c000f3 In softdep_prealloc(), return early not only for snapshots, but for
the quota files as well.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:05:13 +00:00
Konstantin Belousov
ccb19123e5 There is no need to upgrade the last dvp lock on lookups for modifying
operations.  Instead of upgrading, assert that the lock is exclusive.
Explain the cause in comments.

This effectively reverts r209367.

Tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:04:45 +00:00
Konstantin Belousov
df4265774e Partially lift suspension when ffs_reload() finished with cgs and
going to re-read inodes.

Secondary write initiators, e.g. ufs_inactive(), might need to start a
write while owning the vnode lock.  Since the suspended state
established by /dev/ufssuspend prevents them from entering
vn_start_secondary_write(), we get deadlock otherwise.

Note that it is arguably not very useful to re-read inodes after
/dev/ufssuspend suspension, because the suspension does not block
readers, and other threads might read existing files in parallel with
suspension owner (for now, only growfs(8)) operations.  This
effectively means that suspension owner cannot safely modify existing
inodes, and then there is no sense in re-reading.  But keep the code
enabled for now.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-08 12:01:28 +00:00
Warner Losh
949b56ba66 Renumber the advertising clause. 2016-09-06 15:17:35 +00:00
Kirk McKusick
988fd417a0 Bug 211013 reports that a write error to a UFS filesystem running
with softupdates panics the kernel. The problem that has been pointed
out is that when there is a transient write error on certain metadata
blocks, specifically directory blocks (PAGEDEP), inode blocks
(INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group
(BMSAFEMAP, but only when journaling is enabled), we get a panic
in one of the routines called by softdep_disk_io_initiation that
the I/O is "already started" when we retry the write.

These dependency types potentially need to do roll-backs when called
by softdep_disk_io_initiation before doing a write and then a
roll-forward when called by softdep_disk_write_complete after the
I/O completes.  The panic happens when there is a transient error.
At the top of softdep_disk_write_complete we check to see if the
write had an error and if an error occurred we just return.  This
return is correct most of the time because the main role of the routines
called by softdep_disk_write_complete is to process the now-completed
dependencies so that the next I/O steps can happen.

But for the four types listed above, they do not get to do their
rollback operations. This causes the panic when softdep_disk_io_initiation
gets called on the second attempt to do the write and the roll-back
routines find that the roll-backs have already been done. As an
aside I note that there is also the problem that the buffer will
have been unlocked and thus made visible to the filesystem and to
user applications with the roll-backs in place.

The way to resolve the problem is to add a flag to the routines called
by softdep_disk_write_complete for the four dependency types noted
that indicates whether the write was successful (WRITESUCCEEDED).
If the write does not succeed, they do just the roll-backs and then
return. If the write was successful they also do their usual
processing of the now-completed dependencies.

The fix was tested by selectively injecting write errors for buffers
holding dependencies of each of the four types noted above and then
verifying that the kernel no longer paniced and that following the
successful retry of the write that the filesystem could be unmounted
and successfully checked cleanly.

PR: 211013
Reviewed by: kib
2016-08-16 21:02:30 +00:00
Konstantin Belousov
948137db10 In UFS_BALLOC(), invalidate pages of indirect buffers on failed block
allocation unwinding.

Dandling buffers are released on UFS_BALLOC() failure to ensure that
later attempt to allocate blocks in close range do not find the blocks
with invalid content, since possible partial block allocations are
unwound.  As such, it is not enough to just release the buffers, the
pages must also invalidated and removed from the vnode vm_object
queue.  Otherwise the pages might be found later and used to
reconstruct indirect buffers when doing allocations at offset close to
the failure point, and their stale content compromise the filesystem
integrity.

Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE
is required.  To be sure, clear the B_CACHE flag as well.  This
complements the r174973, which started releasing buffers.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:30:58 +00:00
Konstantin Belousov
2da4bcb82b On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert
that recorded allocated blocks numbers match the physical block
numbers of dandling buffers which are released.

When finally freeing the blocks during unwind, assert that dandling
buffers where not re-allocated.  They shouldn't, because the vnode lock
is owned exclusive.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:18:38 +00:00
Konstantin Belousov
39f7cbe9ab When looking up dandling buffers for unwing after failing block
allocation in UFS_BALLOC(), there is no need to map them.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 17:05:15 +00:00
Konstantin Belousov
554248587e When block allocation fails in UFS_BALLOC(), and the volume does not
have SU enabled, there is no point in calling softdep_request_cleanup().

The call cannot produce free blocks, but we unecessarily lock ufsmount
and do inode block write.  Usual point of not doing optimizations for
the corner case of the full volume is not applicable there, the work
is easily avoidable, and the addition CPU and disk io load do not lead
to succeeding retry.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:50:48 +00:00
Konstantin Belousov
f1503c5a49 In ffs_balloc_ufs{1,2} routines, assert that unwind records do not
overflow local arrays.  This is not immediately obvious from the
static code inspection, due to retry logic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-16 16:49:56 +00:00
Konstantin Belousov
2f514f92cf Implement VOP_FDATASYNC() for UFS.
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:22:23 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Konstantin Belousov
e73b16ae7a Ensure that the UFS directory vnode' vm_object is properly sized
before UFS_BALLOC() is called.  I do not believe that this caused any
real issue on FreeBSD because the exclusive vnode lock is held over
the balloc/resize, the change is to make formally correct KPI use.

Based on:	the Matthew Dillon' patch from DragonFly BSD
PR:	93942
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-20 14:40:56 +00:00
Kevin Lo
57d2ac2f90 arc4random() returns 0 to (2**32)−1, use an alternative to initialize
i_gen if it's zero rather than a divide by 2.

With inputs from  delphij, mckusick, rmacklem

Reviewed by:	mckusick
2016-05-22 14:31:20 +00:00
Konstantin Belousov
5f36cc5bfa Stop dropping and reacquiring Giant around geom calls in UFS.
Sponsored by:	The FreeBSD Foundation
2016-05-21 10:13:25 +00:00
Konstantin Belousov
c70b3cd28d Improve handling of rdev->si_mountpt on mount and unmount of FFS
volumes.  Treat the field as a semaphore protecting availability of
the device for mounting.  Do no access devvp->v_rdev without the vnode
lock owned.

Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-21 09:49:35 +00:00
Konstantin Belousov
3f7ca894de Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
Konstantin Belousov
bfe89a8ee1 Do enable io accounting for read-only mounts and mounts which are
remounted to writeable after initial read-only.  Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.

Based on submission by:	bde
MFC after:	1 week
2016-05-17 21:35:35 +00:00
Konstantin Belousov
f30ddc49e6 If IO_SYNC was passed to ffs_truncate(), request synchronous inode
update from the final ffs_update().

Noted by:	bde
MFC after:	1 week
2016-05-17 21:30:58 +00:00
Konstantin Belousov
3d0691acdc For async UFS mounts, shrink the directory asynchronously, at least do
not pass IO_SYNC to ffs_truncate() unneccessary.

Submitted by:	bde
MFC after:	1 week
2016-05-17 21:28:28 +00:00
Konstantin Belousov
98cbffd7bd Fix comments.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:24:27 +00:00
Konstantin Belousov
5b6526413f Fix typo in the message.
Submitted by:	bde
MFC after:	1 week
2016-05-17 08:19:20 +00:00
Pedro F. Giffuni
08e941833d UFS: spelling fixes on comments.
No functional change.
2016-04-29 20:43:51 +00:00
Enji Cooper
028d33c96a Add FEATURE knob for testing for UFS extended attribute kernel support
Support can be verified via `feature_present("ufs_extattr")`, etc.

Differential Revision: https://reviews.freebsd.org/D6053
MFC after: 2 weeks
Relnotes: yes
Reviewed by: asomers, kib
Sponsored by: EMC / Isilon Storage Division
2016-04-22 08:09:27 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Pedro F. Giffuni
abafa4db03 ufs: replace 0 with NULL for pointers.
While here also do late initialization of the variables we are
changing.

Found with devel/coccinelle.

Reviewed by:	mckusick
MFC after:	2 weeks
2016-04-10 21:48:11 +00:00
Edward Tomasz Napierala
ae34b6ff96 Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Konstantin Belousov
c79dff0fbe Split the global taskqueue used to process all UFS trim completions,
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.

But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes.  Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.

Reported by:	Nick Evans <nevans@talkpoint.com>
Tested by:	Nick Evans <nevans@talkpoint.com>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:21:17 +00:00
Konstantin Belousov
d016708959 Style: wrap long lines.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-27 08:07:12 +00:00
Konstantin Belousov
6336aefc1d Fix locking mistake in softdep_waitidle(). The surrounding code
expects that the loop is always exited with the SU lock owned, even on
error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-03-23 09:58:51 +00:00
Kirk McKusick
24b6748c42 The UFS filesystem requires that the last block of a file always be
allocated. When shortening the length of a file in which the new end
of the file contains a hole, the hole must have a block allocated.

Reported by: Maxim Sobolev
Reviewed by: kib
Tested by:   Peter Holm
2016-02-24 01:58:40 +00:00
Edward Tomasz Napierala
50102a6342 Remove ffs_mountroot() prototype; seems to be long gone.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-28 12:21:23 +00:00
Kirk McKusick
abe53f7e91 This fixes a bug in UFS2 exported NFS volumes. An NFS client can
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.

Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib
2016-01-27 21:27:05 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Konstantin Belousov
44de1ba141 Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the
ast was rescheduled during VFS_SYNC().  It is possible that enough
parallel writes or slow/hung volume result in VFS_SYNC() deferring to
the ast flushing of workqueue.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-21 11:50:32 +00:00
Konstantin Belousov
dc15d94fd9 Update ctime when atime or birthtime are updated.
Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or
IN_UPDATE, then clear them with ufs_itimes(), making transient
(possibly inconsistent) change to the times, and then copy
user-supplied times into the inode.  Instead, directly clear IN_ACCESS
or IN_UPDATE when user supplied the time, and copy the value into the
inode.

Minor inconsistency left is that the inode ctime is updated even when
birthtime update attempt is performed on a UFS1 volume.

Submitted by:	bde
MFC after:	2 weeks
2015-12-07 12:09:04 +00:00
Kirk McKusick
43a993bb7d For performance reasons, it is useful to have a single string used as
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:01:02 +00:00
Konstantin Belousov
fe920528c1 Do not perform read-ahead for BA_CLRBUF request when we are low on
memory or when dirty buffer queue is too large.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-10-27 13:44:13 +00:00
Warner Losh
476dfdc3bf Do not relocate extents to make them contiguous if the underlying drive can do
deletions. Ability to do deletions is a strong indication that this
optimization will not help performance. It will only generate extra write
traffic. These devices are typically flash based and have a limited number of
write cycles. In addition, making the file contiguous in LBA space doesn't
improve the access times from flash devices because they have no seek time.

Reviewed by: mckusick@
2015-10-16 03:06:02 +00:00
Gleb Smirnoff
20e8b32db4 In softdep_setup_freeblocks():
- Move the bread() to the beginning of function.
- Return if it fails, otherwise we will panic.

Submitted by:	mckusick
Sponsored by:	Netflix
2015-10-07 12:36:28 +00:00
Konstantin Belousov
7a82f35c9d Do not consume extra reference. This is a bug in r287479.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 12:28:18 +00:00
Konstantin Belousov
c65f759837 Declare the writes around the call to VFS_SYNC() in
softdep_ast_cleanup_proc().

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-05 08:48:24 +00:00
Konstantin Belousov
76cae067b8 By doing file extension fast, it is possible to create excess supply
of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and
D_ALLOCINDIR), which can exhaust kmem.

Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and
D_DIRREM, by scheduling ast to flush dependencies, after the thread,
which created new dep, left the VFS/FFS innards.  For D_NEWBLK, the
only way to get rid of them is to do full sync, since items are
attached to data blocks of arbitrary vnodes.  The check for D_NEWBLK
excess in softdep_ast_cleanup_proc() is unlocked.

For 32bit arches, reduce the total amount of allowed dependencies by
two.  It could be considered increasing the limit for 64 bit platforms
with direct maps.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 13:07:27 +00:00
Jeff Roberson
98082691bb - Make 'struct buf *buf' private to vfs_bio.c. Having a global variable
'buf' is inconvenient and has lead me to some irritating to discover
   bugs over the years.  It also makes it more challenging to refactor
   the buf allocation system.
 - Move swbuf and declare it as an extern in vfs_bio.c.  This is still
   not perfect but better than it was before.
 - Eliminate the unused ffs function that relied on knowledge of the buf
   array.
 - Move the shutdown code that iterates over the buf array into vfs_bio.c.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-07-29 02:26:57 +00:00
Jeff Roberson
fade8dd714 Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
Mateusz Guzik
f0725a8e1e Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
Mark Johnston
5f34e93c58 Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
Mark Murray
d1b06863fb Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
Konstantin Belousov
521987c3e5 Simplify code, no need to test the flag before clearing it.
Submitted by:	ed
MFC after:	12 days
2015-06-29 13:06:24 +00:00
Konstantin Belousov
b2c3df842b Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer.  Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed.  Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock.  Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with:	Conrad Meyer <cse.cem@gmail.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation (kib),
	  EMC/Isilon storage division (Conrad)
MFC after:	2 weeks
2015-06-27 09:44:14 +00:00
Konstantin Belousov
1eabd96728 vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active.  This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list.  Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation.  In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list.  When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-17 04:46:58 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
Konstantin Belousov
46c3d3ac99 Syncing a directory vnode might drop the vnode lock in the
softdep_sync() similarly to the regular vnode sync.  Allow retry for
both vnode types.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-03 20:48:00 +00:00
Konstantin Belousov
aef373ce38 Remove unused variable.
When deallocate_dependencies() is performed,
softdep_journal_freeblocks() already called cancel_allocdirect() which
should have eliminated direct dependencies for all truncated full
blocks.  The indirect dependencies are allowed above, since second-
and third-level dependencies are only dealt with by the code which
frees indirect block, which happens after the inode write.

Discussed with:	mckusick, jeff
Reviewed by:	jeff
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-31 15:50:54 +00:00
Konstantin Belousov
69baeadc31 Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros.  It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review:	https://reviews.freebsd.org/D2665
Reviewed by:	rodrigc
Looked at by:	bjk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 13:24:17 +00:00
Konstantin Belousov
0bc4fe10ad After r283600, NODELAY flag to inodedep_lookup() function is unused.
Eliminate it, and simplify code by removing the local dflags variable
always initialized to DEPALLOC.

Noted by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:49:04 +00:00
Konstantin Belousov
1bc93bb7b9 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
Kirk McKusick
74a87c3804 Limit the number of cylinder groups that will be searched when
trying to build a cluster. The limit is tunable using the sysctl
vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups
per block allocation. It was previously limited to the number of
cylinder groups in the filesystem per block allocation. When there
were no clusters of the needed size left, it repeatedly searched
the whole filesystem for a non-existent cluster on every block
allocation. The result was very slow filesystem allocation with
100% CPU utilization. The old behavior can be had by setting
vfs.ffs.maxclustersearch to a huge number (1,000,000).

This change affects only the layout policy routines so is not able
to interfere with the integrity of the filesystem.

Reported by: Dmitry Sivachenko (demon@)
Tested by:   Dmitry Sivachenko (demon@)
MFC after:   2 weeks
2015-04-24 23:27:50 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Konstantin Belousov
9928931186 Fix build (with gcc).
Reported by:	bz, ian
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-27 15:49:21 +00:00
Konstantin Belousov
4af9f77ecb Fix the hand after the immediate reboot when the following command
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init

Hang occurs on the rootfs unmount.  There are two issues:

1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0.  FFS is suspended during
unmount, so unmount just hangs.

2. As noted above, the inodedep is linked eventually.  It is not
linked until the superblock is written.  But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed.  It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock.  Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.

Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.

There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.

Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons.  Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.

PR:	195458
In collaboration with:	gjb, pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-27 13:55:56 +00:00
Konstantin Belousov
f6f76e1747 Partially revert r277922, avoid sleeping and do flush if we a awaken,
instead of waiting for the FLUSH_* flags.  Also, when requesting
flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was
already set.

Reported and tested by:	dim,
	"Lundberg, Johannes" <johannes@brilliantservice.co.jp>
Bisected by:	dim
MFC after:	2 weeks
2015-02-05 13:00:27 +00:00
Konstantin Belousov
1c9b585695 When mounting SU-enabled mount point, wait until the softdep_flush()
thread started and incremented the stat_flush_threads [1].

Unconditionally wakeup softdep_flush threads when needed, do not try
to check wchan, which is racy and breaks abstraction.

Reported by and discussed with:	glebius, neel
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-30 11:41:46 +00:00
Konstantin Belousov
cda374fe00 The sys_quotactl() contract demands that the mount point is
vfs_unbusy()ed when the cmd is Q_QUOTAON, regardless of other input
parameters or error return.

Submitted by:	Conrad Meyer
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:  https://reviews.freebsd.org/D1684
Tested by:	pho
MFC after:	1 week
2015-01-27 10:32:49 +00:00
Konstantin Belousov
789bdfdbc6 Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some
fs, e.g. smbfs, already did it.

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-21 13:29:33 +00:00
Konstantin Belousov
6c21f6edb8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00