Commit Graph

3928 Commits

Author SHA1 Message Date
Fedor Uporov
4ff6603ab3 Do not panic if inode bitmap is corrupted.
admbug:         804
Reported by:    Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:    pfg
MFC after:      1 week

Differential Revision:    https://reviews.freebsd.org/D19325
2019-03-04 11:12:19 +00:00
Fedor Uporov
80a4a9716b Validate block bitmaps.
Reviewed by:    pfg
MFC after:      1 week

Differential Revision:    https://reviews.freebsd.org/D19324
2019-03-04 11:01:23 +00:00
Fedor Uporov
daa2d62da2 Add additional on-disk inode checks.
Reviewed by:    pfg
MFC after:      1 week

Differential Revision:    https://reviews.freebsd.org/D19323
2019-03-04 10:55:01 +00:00
Fedor Uporov
6e38bf94e5 Make superblock reading logic more strict.
Add more on-disk superblock consistency checks to ext2_compute_sb_data() function.
It should decrease the probability of mounting filesystems with corrupted superblock data.

Reviewed by:    pfg
MFC after:      1 week

Differential Revision:    https://reviews.freebsd.org/D19322
2019-03-04 10:42:25 +00:00
Alan Somers
c02ccc7e44 Fix typos from r344664
Sponsored by:	The FreeBSD Foundation
2019-03-01 15:49:11 +00:00
Alan Somers
cf16949867 fuse(4): convert debug printfs into dtrace probes
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags.  They fell into three basic groups:

1) Totally redundant with dtrace FBT probes.  These I deleted.
2) Print textual information, usually error messages.  These I converted to
   SDT probes of the form fuse:fuse:FILE:trace.  They work just like the old
   printf statements except they can be enabled at runtime with dtrace.
   They can be filtered by FILE and/or by priority.
3) More complicated probes that print detailed information.  These I
   converted into ad-hoc SDT probes.

Sponsored by:	The FreeBSD Foundation
2019-02-28 19:27:54 +00:00
Conrad Meyer
f6ebb68395 fuse: Fix a regression introduced in r337165
On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
use a buf cache block size in excess of permitted size.  This did not affect
most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
The issue was discovered and reported using a custom kernel with a DFLTPHYS
of 512kB.

PR:		230260 (comment #9)
Reported by:	ken@
MFC after:	π/𝑒 weeks
2019-02-21 02:41:57 +00:00
Matt Macy
81167243b4 PFS: Bump NAMELEN and don't require clients to be sleepable
- debugfs consumers expect to be able to export names more than 48 characters

- debugfs consumers expect to be able to hold locks across calls and are able
  to handle allocation failures

Reviewed by:	hps@
MFC after:	1 week
Sponsored by:	iX Systems
Differential Revision:	https://reviews.freebsd.org/D19256
2019-02-20 20:55:02 +00:00
Conrad Meyer
02295caf43 Fuse: whitespace and style(9) cleanup
Take a pass through fixing some of the most egregious whitespace issues in
fs/fuse.  Also fix some style(9) warts while here.  Not 100% cleaned up, but
somewhat less painful to look at and edit.

No functional change.
2019-02-20 02:49:26 +00:00
Conrad Meyer
bd4cb2a46d fuse: add descriptions for remaining sysctls
(Except reclaim revoked; I don't know what that goal of that one is.)
2019-02-20 02:48:59 +00:00
Edward Tomasz Napierala
c9172fb4f1 Work around the "nfscl: bad open cnt on server" assertion
that can happen when rerooting into NFSv4 rootfs with kernel
built with INVARIANTS.

I've talked to rmacklem@ (back in 2017), and while the root cause
is still unknown, the case guarded by assertion (nfscl_doclose()
being called from VOP_INACTIVE) is believed to be safe, and the
whole thing seems to run just fine.

Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2019-02-19 12:45:37 +00:00
Conrad Meyer
3c324b9465 FUSE: Refresh cached file size when it changes (lookup)
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.

As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).

We do something similar in fuse_vnop_getattr already.

PR:		230258 (as reported in description; other issues explored in
			comments are not all resolved)
Reported by:	MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
2019-02-15 22:55:13 +00:00
Conrad Meyer
c4af8b173a FUSE: The FUSE design expects writethrough caching
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.

Prior to this change, we implement and default to writeback caching.  This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.

In this change, add cache mode as an extension of cache enable/disable.  The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).

For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data.  WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement.  Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.

Refs:
  * https://sourceforge.net/p/fuse/mailman/message/8902254/
  * https://github.com/vgough/encfs/issues/315

PR:		230258 (inspired by)
2019-02-15 22:52:49 +00:00
Conrad Meyer
194e691aaf FUSE: Only "dirty" cached file size when data is dirty
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR).  In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).

PR:		230258 (related)
2019-02-15 22:51:09 +00:00
Conrad Meyer
09176f096b FUSE: Respect userspace FS "do-not-cache" of path components
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.

Historically, FreeBSD FUSE has ignored this client directive entirely.  This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.

For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.

Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.

PR:		230258 (inspired by; does not fix)
2019-02-15 22:50:31 +00:00
Conrad Meyer
78a7722fbc FUSE: Respect userspace FS "do-not-cache" of file attributes
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.

Historically, FreeBSD FUSE has ignored this client directive entirely.  This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.

For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.

In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.

One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy.  It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?

Some minor APIs changed to facilitate this:

1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").

2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0.  It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).

3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.

4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache).  The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.

PR:		230258 (inspired by; does not fix)
2019-02-15 22:49:15 +00:00
Konstantin Belousov
b9662886ef Un null_vptocnp(), cache vp->v_mount and use it for null_nodeget() call.
The vp vnode is unlocked during the execution of the VOP method and
can be reclaimed, zeroing vp->v_data.  Caching allows to use the
correct mount point.

Reported and tested by:	pho
PR: 235549
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-08 08:20:18 +00:00
Konstantin Belousov
25728e8411 Before using VTONULL(), check that the covered vnode belongs to nullfs.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-08 08:17:31 +00:00
Konstantin Belousov
930cc2dbef Some style for nullfs_mount(). Also use bool type for isvnunlocked.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-08 08:15:29 +00:00
Pedro F. Giffuni
771ec59bb7 ext2fs: Add some extra consistency checks for the superblock.
Maliciously formed, or badly corrupted, filesystems can cause kernel
panics.  In general, such acts of foot-shooting can only be accomplished
by root, but in a world with VM images that is  moving towards automated
mounts it is important to have some form of prevention.

Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert
of Fraunhofer FKIE.
Incidentaly this should also fix a memory corruption issue reported by
Dr Silvio Cesare of InfoSect.

Huge thanks to all reseachers for making us aware of the issue.

admbug:		872, 891
Reviewed by:	fsu
Obtained from:	NetBSD (with minor changes)
MFC after:	3 days
2019-01-25 22:22:29 +00:00
Mark Johnston
d9463dd4f3 nfs: Zero the buffers exported by NFSSVC_DUMPCLIENTS and DUMPLOCKS.
Note that these interfaces are available only to root.

admbugs:	765
Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	rmacklem
MFC after:	1 day
Security:	Kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
2019-01-21 23:54:33 +00:00
Oleksandr Tymoshenko
52b2c8e242 [smbfs] Allow semicolon in mounts that support long names
Semicolon is a legal character in long names but not in 8.3 format.
Move it to respective character set.

PR:		140068
Submitted by:	tom@uffner.com
MFC after:	3 weeks
2019-01-20 05:52:16 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Kirk McKusick
c0029546f8 When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by:  Christopher Krah <krah@protonmail.com>
Reported as:  FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by:  kib
MFC after:    1 week
Sponsored by: Netflix

M    sys/fs/ext2fs/ext2_vnops.c
M    sys/kern/vfs_subr.c
M    sys/ufs/ffs/ffs_snapshot.c
M    sys/ufs/ufs/ufs_vnops.c
2018-12-27 07:18:53 +00:00
Bruce Evans
416e232cc6 Fix clobbering of the fatchain cache for clustered i/o's when full
clustering is not done.  The bug caused extreme slowness for large
files in some cases.

There is no way to tell VOP_BMAP() how many blocks are wanted, so for
all file systems it has to waste time in some cases by searching for
more contiguous blocks than will be accessed.  For msdosfs, it also
clobbered the fatchain cache in these cases by advancing the cache to
point to the chain entry for block that won't be read.  This makes
the cache useless for the next sequential i/o (or VOP_BMAP()), so the
fat chain is searched from the beginning.  The cache only has 1 relevant
entry, so it is similarly useless for random i/o.

Fix this by only advancing the cache to point to the chain entry for
the first block that will be read.  Clustering uses results from
VOP_BMAP(), so when more than 1 block is read by clustering, the cache
is not advanced as optimally as before, but it is at most 1 cluster
size behind and searching the chain through the blocks for this cluster
doesn't take too long.
2018-12-21 21:17:45 +00:00
Bruce Evans
8ec22c4d65 Quick fix for initialization of mnt_iosize_max. (This limit controls
mainly clustering and read-ahead.)  Copy the initialization from ffs,
and also copy a couple of lines of ffs's nearby style for initialization
order and whitespace.

A correct fix would de-duplicate the initialization and fix bitrot in it
instead of adding another instance of the duplication.  Complications to
use the size preferred by the device have been reduced to hard-coding
slightly pessimal and/or inconsistent defaults, using large code that was
almost needed to support the complications.

For msdosfs, the result was that mnt_iosize_max was DFTLPHYS (64K) but is
now MAXPHYS (128K).
2018-12-21 20:12:43 +00:00
Rick Macklem
23114c6c2a Fix the NFSv4 server to obey vfs.nfsd.nfs_privport.
When the NFSv4 server was coded, I believed that the specification authors
did not want NFSv4 servers to require a client to use a reserved port#.
However, recently it has been noted that the Linux NFSv4 server does support
a check for a reserved port#.
Since both the FreeBSD and Linux NFSv4 clients use a reserved port# by
default, enabling vfs.nfsd.nfs_privport to require a reserved port# for
NFSv4 the same as it does for NFSv2, 3 seems reasonable.
The only case where this could cause a POLA violation is a FreeBSD NFSv4
server with vfs.nfsd.nfs_privport set, but with NFSv4 clients doing mounts
without using a reserved port# (< 1024).

Tested by:	chaz.newton58@gmail.com
PR:		234106
MFC after:	1 week
2018-12-20 22:21:41 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mark Johnston
352aaa5122 Plug memory disclosures via ptrace(2).
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory.  The same
issue existed with the register files in procfs.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18421
2018-12-03 20:54:17 +00:00
Mark Johnston
fee65dfc37 Ensure the dirent remains initialized when dirent.d_fileno is unset.
Reported by:	rmacklem
MFC with:	r340856
Sponsored by:	The FreeBSD Foundation
2018-11-23 23:07:49 +00:00
Mark Johnston
6d2e2df764 Ensure that directory entry padding bytes are zeroed.
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace.  With the ino64 work there
are also some explicit pad fields in struct dirent.  Add a subroutine
to clear these bytes and use it in the in-tree filesystems.  The
NFS client is omitted for now as it was fixed separately in r340787.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-23 22:24:59 +00:00
Rick Macklem
f86bce1770 Make sure the NFS readdir client fills in all "struct dirent" data.
The NFS client code (nfsrpc_readdir() and nfsrpc_readdirplus()) wasn't
filling in parts of the readdir reply, such as d_pad[01] and the bytes
at the end of d_name within d_reclen. As such, data left in a buffer cache
block could be leaked to userland in the readdir reply.
This patch makes sure all of the data is filled in.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib, markj
MFC after:	2 weeks
2018-11-23 00:17:47 +00:00
Mateusz Guzik
53011553fa proc: convert pfind & friends to use pidhash locks and other cleanup
pfind_locked is retired as it relied on allproc which unnecessarily
restricts locking of the hash.

Sponsored by:	The FreeBSD Foundation
2018-11-21 20:15:56 +00:00
Mateusz Guzik
30e0cf499f tmpfs: use unr64 for inode numbers
Sponsored by:	The FreeBSD Foundation
2018-11-20 15:14:30 +00:00
Rick Macklem
75772b69f2 Improve sanity checking for the dircount hint argument to
NFSv3's ReaddirPlus and NFSv4's Readdir operations. The code
checked for a zero argument, but did not check for a very large value.
This patch clips dircount at the server's maximum data size.

MFC after:	1 week
2018-11-20 01:59:57 +00:00
Rick Macklem
778f29833b nfsm_advance() would panic() when the offs argument was negative.
The code assumed that this would indicate a corrupted mbuf chain, but
it could simply be caused by bogus RPC message data.
This patch replaces the panic() with a printf() plus error return.

MFC after:	1 week
2018-11-20 01:56:34 +00:00
Rick Macklem
1d171e7971 r304026 added code that started statistics gathering for an operation
before the operation number (the variable called "op") was sanity checked.
This patch moves the code down to below the range sanity check for "op".
2018-11-20 01:52:45 +00:00
Mark Johnston
3d2a0fe762 Remove comments made obsolete by the ino64 work.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-19 17:33:44 +00:00
Konstantin Belousov
1c4ca77890 Add d_off support for multiple filesystems.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature.  Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.

Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs.  At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.

Submitted by:	Jack Halford <jack@gandi.net>
Reviewed by:	mckusick, pfg, rmacklem (previous versions)
Sponsored by:	Gandi.net
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17917
2018-11-14 14:18:35 +00:00
Rick Macklem
6ad8a6eaa4 Change nfs_advlock() so that the NFSVOPUNLOCK() is mostly done at the end.
Prior to this patch, nfs_advlock() did NFSVOPUNLOCK(); return (error);
in many places. This patch replaces these code sequenences with a "goto out;"
and does the NFSVOPUNLOCK(); return (error); at the end of the function
in order to make the vnode locking simpler.
This patch does not change the semantics of nfs_advlock().

Suggested by:	kib
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D17853
2018-11-06 22:50:50 +00:00
Brooks Davis
318f0d7720 Use declared types for caddr_t arguments.
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17852
2018-11-06 18:46:38 +00:00
Brooks Davis
1493c2ee62 Make vop_symlink take a const target path.
This will enable callers to take const paths as part of syscall
decleration improvements.

Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.

Bump __FreeBSD_version for external API consumers.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17805
2018-11-02 14:42:36 +00:00
Rick Macklem
881a9516a2 Fix NFS client vnode locking to avoid a crash during forced dismount.
A crash was reported where the crash occurred in nfs_advlock() when the
NFS_ISV4(vp) macro was being executed. This was caused by the vnode
being VI_DOOMED due to a forced dismount in progress.
This patch fixes the problem by locking the vnode before executing the
NFS_ISV4() macro.

Tested by:	rlibby
PR:		232673
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17757
2018-11-01 15:27:22 +00:00
Brooks Davis
ed34a7fcf2 Move 32-bit compat support for FIODGNAME to the right place.
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.

The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.

Unlike r339174 this change supports both places FIODGNAME is handled.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17475
2018-10-26 17:59:25 +00:00
Konstantin Belousov
8ff7fad1d7 Only call sigdeferstop() for NFS.
Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.

The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.

VFS_OP()s wrap is straightforward.

Requested and reviewed by:	mjg (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17658
2018-10-23 21:43:41 +00:00
Andriy Gapon
ca8f3d1ca2 nfsrvd_readdirplus: for some errors, do not fail the entire request
Instead, a failing entry is skipped.
This change consist of two logical changes.

A failure to vget or lookup an entry is considered to be a result of a
concurrent removal, which is the only reasonable explanation given that
the filesystem is busied.  So, the entry would be silently skipped.

In the case of a failure to get attributes of an entry for an NFSv3
request, the entry would be silently skipped.  There can be legitimate
reasons for the failure, but NFSv3 does not provide any means to report
the error, so we have two options: either fail the whole request or
ignore the failed entry.  Traditionally, the old NFS server used the
latter option, so the code is reverted to it.  Making the whole
directory unreadable because of a single entry seems to be unpractical.

Additionally, some bits of code are slightly re-arranged to account for
the new control flow and to honor style(9).

Reviewed by:	rmacklem
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D15424
2018-10-22 15:33:05 +00:00
Rick Macklem
910ccc7727 Fix the pNFS server's reporting of disk space usage for the "#<path>" case.
The pNFS server would report the total disk space used and free for all
of the DSs, even when certain DSs are assigned to the file system via
the "#<path>" suffix used in the "nfsd -p" option argument.
This patch fixes this case. It only reports usage for the file system
that the argument vnode resides on. This is consistent with the non-pNFS
NFSv4 server. In NFSv4 it is possible to have subtrees on other file
systems, but these are not included in the usage information for NFSv4.

Approved by:	re (gjb)
2018-10-09 01:10:50 +00:00
Brooks Davis
9bc603bd20 Revert r339174: Move 32-bit compat support for FIODGNAME to the right place.
A case was missed in this commit which breaks sshing into a 32-bit sshd
on a 64-bit system.

Approved by:	re (gjb)
2018-10-04 23:55:03 +00:00
Brooks Davis
23f2e22802 Move 32-bit compat support for FIODGNAME to the right place.
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.

The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.

Reviewed by:	kib
Approved by:	re (rgrimes, gjb)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Review:	https://reviews.freebsd.org/D17388
2018-10-03 20:39:48 +00:00
Mark Murray
19fa89e938 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
Fedor Uporov
28f4f62303 FUSE extattrs: fix issue when neither uio nor size were not passed to VOP_* (cosmetic only).
Reviewed by:    cem, pfg
MFC after:      2 weeks

Differential Revision:	https://reviews.freebsd.org/D13737
2018-08-21 18:50:29 +00:00
Fedor Uporov
493b4a8ccd FUSE extattrs: fix issue when neither uio nor size were not passed to VOP_*.
The requested size was returned incorrectly in case uio == NULL from listextattr because the
nameprefix/name conversion was not applied.
Also, make a_size/uio returning logic more unified with other filesystems.

Reviewed by:    cem, pfg
MFC after:      2 weeks

Differential Revision:	https://reviews.freebsd.org/D13528
2018-08-21 18:39:47 +00:00
Fedor Uporov
4c1e1d2bcc Change unused inodes counters behavior in the cylinder groups.
Make it more close to native ext4 implementation to avoid fsck errors.
2018-08-21 18:39:29 +00:00
Fedor Uporov
e49d64a7a7 Fix directory blocks checksum updating logic.
Count dirent tail in the searchslot logic in case of directory block search.
Add htree root csum update function call in case of rename.
2018-08-21 18:39:02 +00:00
Rick Macklem
fdab4d3b29 Fix LORs between vn_start_write() and vn_lock() in nfsrv_copymr().
When coding the pNFS server, I added vn_start_write() calls in nfsrv_copymr()
done while the vnodes were locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes the LORs by moving the vn_start_write() calls up to before
where the vnodes are locked. For "tvp", the vn_start_write() probaby isn't
necessary, because NFS mounts can't be suspended. However, I think doing
so is harmless.
Thanks go to kib@ for letting me know that I had introduced these LORs.
This patch only affects the behaviour of the pNFS server when pnfsdscopymr(8)
is used to recover a mirrored DS.
2018-08-18 19:14:06 +00:00
Rick Macklem
3e5ba2e187 Fix LORs between vn_start_write() and vn_lock() in the pNFS server.
When coding the pNFS server, I added several vn_start_write() calls done
while the vnode was locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes this by removing the added vn_start_write() calls and
modifying the code so that the extant vn_start_write() call before the
NFS RPC/operation is done when needed by the pNFS server.
Flags are changed so that LayoutCommit and LayoutReturn now get a
vn_start_write() done for them.
When the pNFS server is enabled, the code now also changes the flags for
Getattr, so that the vn_start_write() is done for Getattr, since it may
need to do a vn_set_extattr(). The nfs_writerpc flag array was made global
to the NFS server and renamed nfsrv_writerpc, which is consistent naming
for globals in the NFS server.
Thanks go to kib@ for reporting that doing vn_start_write() while the vnode is
locked results in a LOR.
This patch only affects the behaviour of the pNFS server.
2018-08-17 21:12:16 +00:00
Rick Macklem
9fbb0faf4f Don't set a file's size for the MDS file of a pNFS service.
When a pNFS service is running, the size of the files created on the MDS
are normally 0, since the data is written to the data files on the DS(s).
However, without this patch, if a Setattr with a non-zero size was done by
a client, the MDS file was set to that size.  This was thought to be benign,
but it turns out that files with a non-zero size plus extended attributes
can cause a "ffs_truncate3" panic in UFS. Although the exact cause of this
panic() has not been isolated, this patch avoids the panic() and leaves
the MDS files in a consistent state of always having a size == 0.
Note that these MDS files never store data. The patch also includes an
unnecessary initialization of savsize in case some compiler or static
analyser complains it might not be initialized.
This patch only affects the NFS server when pNFS is enabled via the "-p"
command line option on nfsd.
2018-08-17 12:32:38 +00:00
Jamie Gritton
284001a222 Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES).  These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do.  The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision:	D14791
2018-08-16 18:40:16 +00:00
Conrad Meyer
5cb27f0813 FUSE: Document global sysctl knobs
So that I don't have to keep grepping around the codebase to remember what each
one does.  And maybe it saves someone else some time.

Fix a trivial whitespace issue while here.

No functional change.

Sponsored by:	Dell EMC Isilon
2018-08-15 17:41:19 +00:00
Toomas Soome
527d337fdb cd9660 pointer sign issues and missing __packed attribute
The isonum_* functions are defined to take unsigend char* as an argument,
but the structure fields are defined as char. Change to u_char where needed.

Probably the full structure should be changed, but I'm not sure about the
side affects.

While there, add __packed attribute.

Differential Revision:	https://reviews.freebsd.org/D16564
2018-08-15 06:42:31 +00:00
Rick Macklem
41df1b5b47 Assorted fixes to handling of LayoutRecall callbacks, mostly error handling.
After a re-read of the appropriate section of RFC5661, I decided that a
few things should be changed related to LayoutRecall callback handling.
Here are the things fixed by this patch.
- For two of the three cases that LayoutRecall is done, I now think
  setting the clora_changed argument false is correct.
- All errors other than NFSERR_DELAY returned by LayoutRecall appear
  permanent, so don't retry for any of them. (NFSERR_DELAY is retried by
  newnfs_request(), so it is not affected by this patch.)
- Instead of waiting "forever" (actually until the process is SIGTERM'd)
  for Layouts to be returned during a mirror copy, fail and return
  ENXIO after about 1minute.
  Waiting for a <ctrl>C made sense when pnfsdscopymr() was done by itself,
  but did not make sense when done via find(1).
This patch only affects the pNFS server.
2018-08-08 20:21:45 +00:00
Pedro F. Giffuni
c820acbf0a msdosfs: fixes for Undefined Behavior.
These were found by the Undefined Behaviour GsoC project at NetBSD:

Do not change signedness bit with left shift.
While there avoid signed integer overflow.
Address both issues with using unsigned type.

msdosfs_fat.c:512:42, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:521:44, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:14, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:24, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'
msdosfs_fat.c:840:13, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:840:36, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'

Detected with micro-UBSan in the user mode.

Hinted from:	NetBSD (CVS 1.33)
MFC after:	2 weeks
Differenctial Revision:	https://reviews.freebsd.org/D16615
2018-08-08 15:08:22 +00:00
Fedor Uporov
53288b712d Split the dir_index and dir_nlink features.
Do not allow to create more that EXT4_LINK_MAX links to directory in case
if the dir_nlink is not set, like it is done in the fresh e2fsprogs updates.

MFC after:      3 months
2018-08-08 12:08:46 +00:00
Fedor Uporov
17c7b27f55 Fix directory blocks checksum updating logic.
The checksum updating functions were not called in case of dir index inode splitting
and in case of dir entry removing, when the entry was first in the block.
Fix and move the dir entry adding logic when i_count == 0 to new function.

MFC after:      3 months
2018-08-08 12:07:45 +00:00
Conrad Meyer
3dc1c7d6bc FUSE: Remove some set-but-not-used variables
No functional change.
2018-08-08 04:46:03 +00:00
Rick Macklem
93df87f208 Allow newnfs_request() to retry all callback RPCs with an NFSERR_DELAY reply.
The code in newnfs_request() retries RPCs that get a reply of NFSERR_DELAY,
but exempts certain NFSv4 operations. However, for callback RPCs, there
should not be any exemptions at this time. The code would have erroneously
exempted the CBRECALL callback, since it has the same operation number as
the CLOSE operation.
This patch fixes this by checking for a callback RPC (indicated by clp != NULL)
and not checking for exempt operations for callbacks.
This would have only affected the NFSv4 server when delegations are enabled
(they are not enabled by default) and the client replies to CBRECALL with
NFSERR_DELAY. This may never actually happen.
Spotted during code inspection.

MFC after:	2 weeks
2018-08-07 21:29:14 +00:00
Rick Macklem
25705dd5d0 Copy all bits of a file handle in case there is padding in the structure.
At least on x86, fhandle_t is a packed structure, so I believe an
assignment will copy all the bits. However, for some current/future
architectures, there might be padding in the structure that doesn't get
copied via an assignment.
Since NFS assumes a file handle is an opaque blob of bits that can be
compared via memcmp()/bcmp(), all the bits including any padding must be
copied.
This patch replaces the assignments with a call to a byte copy function.
Spotted during code inspection.
2018-08-05 19:21:50 +00:00
Rick Macklem
ac0d649588 Silence newer gcc warnings.
Newer versions of gcc generate "might not be initialized" warnings for
several variables in nfsrpc_doiods().  I have checked and all of these
variables are assigned values before they are used.
In the one case of "tdrpc", it could have passed garbage as an argument
to nfscl_dofflayoutio() when mirrorcnt is one. However nfscl_dofflayoutio() only
uses the argument when mirrorcnt > 1, so it wasn't actually broken.
This patch initializes "tdrpc" to avoid confusion and initializes the rest
to make the compiler happy.

Requested by:	mmacy
2018-08-02 20:10:59 +00:00
Conrad Meyer
dab6195cd3 FUSE: Bump maximum IO size to enable more performant operation
Various components restrict size of IO passed up to the userspace filesystem
based on the mount's f_iosize value.  The previous default of PAGE_SIZE
is anemic, even for normal filesystems, but especially considering every
FUSE operation involves a kernel <-> userspace IPC upcall.

Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations.

Anecdotally, Jakub reports IO read performance increased from 600 MB/s ->
2700 MB/s with a basic RAM-backed FUSE filesystem.

PR:		230260
Reported by:	Peter (MooseFS) <freebsd AT moosefs.com>
Tested by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com>
MFC after:	3 days
2018-08-02 19:25:43 +00:00
Ed Maste
195e6c50d3 msdosfs: trim EOL whitespace 2018-07-31 12:44:28 +00:00
Ed Maste
a6274b81d5 cd9660: replace bcopy/bzero with C standard equivalents
To reduce diffs against NetBSD.
2018-07-31 12:36:46 +00:00
Ed Maste
22e56aea3f msdosfs: use same max filesize #define as NetBSD and move to header
For use by makefs msdosfs support.

Obtained from:	NetBSD denode.h 1.6
Sponsored by:	The FreeBSD Foundation
2018-07-30 20:36:51 +00:00
Rick Macklem
743d528198 Silence newer gcc warnings.
Newer versions of gcc generate "set, but not used" warnings.
Add __unused macros to silence these warnings.
Although the variables are not being used, they are values parsed from
arguments to callback RPCs that might be needed in the future.

Requested by:	mmacy
2018-07-30 20:25:32 +00:00
Rick Macklem
8014c97147 Silence newer gcc warnings.
Newer versions of gcc generate "set, but not used" warnings in the NFS server.
Add __unused macros to silence these warnings.

Requested by:	mmacy
2018-07-29 21:51:17 +00:00
Rick Macklem
a3e709cd33 Modify the NFSv4.1 server so that it allows ReclaimComplete as done by ESXi 6.7.
I believe that a ReclaimComplete with rca_one_fs == TRUE is only
to be used after a file system has been transferred to a different
file server.  However, RFC5661 is somewhat vague w.r.t. this and
the ESXi 6.7 client does both a ReclaimComplete with rca_one_fs == TRUE
and one with ReclaimComplete with rca_one_fs == FALSE.
Therefore, just ignore the rca_one_fs == TRUE operation and return
NFS_OK without doing anything instead of replying NFS4ERR_NOTSUPP.
This allows the ESXi 6.7 NFSv4.1 client to do a mount.
After discussion on the NFSv4 IETF working group mailing list, doing this
along with setting a flag to note that a ReclaimComplete with rca_one_fs TRUE
was an appropriate way to handle this.
The flag that indicates that a ReclaimComplete with rca_one_fs == TRUE was
done may be used to disable replies of NFS4ERR_GRACE for non-reclaim
state operations in a future commit.

This patch along with r332790, r334492 and r336357 allow ESXi 6.7 NFSv4.1 mounts
work ok. ESX 6.5 NFSv4.1 mounts do not work well, due to what I believe are
violations of RFC-5661 and should not be used.

Reported by:	andreas.nagy@frequentis.com
Tested by:	andreas.nagy@frequentis.com, daniel@ftml.net (earlier version)
MFC after:	2 weeks
Relnotes:	yes
2018-07-28 20:21:04 +00:00
Eitan Adler
33f4bccaa6 Use https over http for FreeBSD pages 2018-07-27 10:40:48 +00:00
Ed Maste
6ae00e306f Revert msdosfs MAKEFS #ifdef changes from r319870
These changes are not needed for current msdosfs makefs WIP.

Submitted by:	Siva Mahadevan
Sponsored by:	The FreeBSD Foundation
2018-07-24 21:10:17 +00:00
Rick Macklem
cecf6c6e9c Set CLSET_TIMEOUT on TCP connections to pNFS DSs.
Use CLSET_TIMEOUT to set the timeout for connections to DSs instead of
specifying a timeout on each RPC. This is done so that SO_SNDTIMEO
is set on the TCP socket as well as specifying a time limit when
waiting for an RPC reply.  Useful if the send queue for the TCP
connection has become constipated, due to a failed DS.
The choice of lease_duration / 4 is fairly arbitrary, but seems to work
ok, with a lower bound of 10sec.
For client connections to a DS, set the retry limit to vfs.nfsd.dsretries,
which is 2 by default.
This patch should only affect pNFS connections to DSs.
This patch requires r336542.

MFC after:	2 weeks
2018-07-21 01:33:07 +00:00
Alan Somers
5717aa2d2a Allow mounting FUSE filesystems in jails
Reviewed by:	jamie
MFC after:	2 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D16371
2018-07-20 21:35:31 +00:00
Rick Macklem
5d54f186bb Modify the reasons for not issuing a delegation in the NFSv4.1 server.
The ESXi NFSv4.1 client will generate warning messages when the reason for
not issuing a delegation is two. Two refers to a resource limit and I do
not see why it would be considered invalid. However it probably was not the
best choice of reason for not issuing a delegation.
This patch changes the reasons used to ones that the ESXi client doesn't
complain about. This change does not affect the FreeBSD client and does
not appear to affect behaviour of the Linux NFSv4.1 client.
RFC5661 defines these "reasons" but does not give any guidance w.r.t. which
ones are more appropriate to return to a client.

Tested by:	andreas.nagy@frequentis.com
PR:		226650
MFC after:	2 weeks
2018-07-16 21:32:50 +00:00
Rick Macklem
5da3882447 Shut down the TCP connection to a DS in the pNFS client when Renew fails.
When a NFSv4.1 client mount using pNFS detects a failure trying to do a
Renew (actually just a Sequence operation), the code would simply try
again and again and again every 30sec.
This would tie up the "nfscl" thread, which should also be doing other
things like Renews on other DSs and the MDS.
This patch adds code which closes down the TCP connection and marks it
defunct when Renew detects an failure to communicate with the DS, so
further Renews will not be attempted until a new working TCP connection to
the DS is established.
It also makes the call to nfscl_cancelreqs() unconditional, since
nfscl_cancelreqs() checks the NFSCLDS_SAMECONN flag and does so while holding
the lock.
This fix only applies to the NFSv4.1 client whne using pNFS and without it
the only effect would have been an "nfscl" thread busy doing Renew attempts
on an unresponsive DS.

MFC after:	2 weeks
2018-07-15 18:54:44 +00:00
Rick Macklem
89c64a3a4f Fix the pNFS client when mirrors aren't on the same machine.
Without this patch, the client side NFSv4.1 pNFS code erroneously did writes
and commits to both DS mirrors using the TCP connection of the first one.
For my test setup this worked, since I have both DSs running on the same
machine, but it would have failed when the DSs are on separate machines.
This patch fixes the code to use the correct TCP connection for each DS.
This patch should only affect the NFSv4.1 client when using "pnfs" mounts
to mirrored DSs.

MFC after:	2 weeks
2018-07-14 19:51:44 +00:00
Rick Macklem
0e7bd20bb2 Close down the TCP connection to a pNFS DS when it is disabled.
So long as the TCP connection to a pNFS DS isn't shared with other DSs,
it can be closed down when the DS is being disabled in the pNFS client.
This causes any RPCs in progress to fail.
This patch only affects the NFSv4.1 pNFS client when errors occur
while doing I/O on a DS.

MFC after:	2 weeks
2018-07-13 20:03:05 +00:00
Rick Macklem
83f526de6a Change the pNFS client so that it does not report an NFSERR_STALE from
an I/O attempt on a DS to the server via LayoutReturn.

The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errrors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This patch only affects behaviour of the pNFS client and only when using
Flexible File layouts.

MFC after:	2 weeks
2018-07-13 12:39:27 +00:00
Rick Macklem
a6fed5f514 Modify the NFSv4.1 pNFS client to use separate TCP connections for DSs.
Without this patch, the NFSv4.1 pNFS client shared a single TCP connection
for all DSs that resided on the same machine. This made disabling one of
the DSs impossible. Although unlikely, it is possible that the storage
subsystem has failed in such a way that the storage for one DS on a machine
is no longer functioning correctly, but the storage used by another DS on
the same machine is still ok. For this case, it would be nice if a system
can fail one of the DSs without failing them all.
This patch changes the default behaviour to use separate TCP connections
for each DS even if they reside on the same machine.
I do not believe that this will be a problem for extant pNFS servers, but
a sysctl can be set to restore the old behaviour if this change causes a
problem for an extant pNFS server.
This patch only affects the NFSv4.1 pNFS client.

MFC after:	2 weeks
2018-07-12 20:46:22 +00:00
Rick Macklem
8361de2544 Ignore the cookie verifier for NFSv4.1 when the cookie is 0.
RFC5661 states that the cookie verifier should be 0 when the cookie is 0.
However, the wording is somewhat unclear and a recent discussion on the
nfsv4@ietf.org mailing list indicated that the NFSv4 server should ignore
the cookie verifier's value when the dirctory offset cookie is 0.
This patch deletes the check for this that would return NFSERR_BAD_COOKIE
when the verifier was not 0.
This was found during testing of the ESXi client against the NFSv4.1 server.

Reported by:	daniel@ftml.net (via packet trace)
MFC after:	2 weeks
2018-07-11 23:23:29 +00:00
Rick Macklem
de9a1a70ab Add support for a "forced" pnfsdskill to the pNFS server kernel code.
The pnfsdskill(8) command will normally fail if there is no valid mirror
for the DS to be disabled. However, a system administrator may need to
disable a DS which does not have a valid mirror so that the nfsd threads
can be terminated. This patch adds the kernel code needed by pnfsdskill(8)
to implement this "forced" case of disabling a DS.
This patch only affects the pNFS server.
2018-07-09 19:58:01 +00:00
Rick Macklem
acc6e58def Fix the kernel part of pnfsdscopymr() to handle holes in the file being copied.
If a mirrored DS is being recovered that has a lot of large sparse files,
pnfsdscopymr(8) would use a lot of space on the recovered mirror since it
would write the "holes" in the file being mirrored.
This patch adds code to check for a "hole" and skip doing the write.
The check is done on a "per PNFSDS_COPYSIZ size block", which is currently 64K.
I think that most file server file systems will be using a blocksize at
least this large. If the file server is using a smaller blocksize and
smaller holes need to be preserved, PNFSDS_COPYSIZ could be decreased.
The block of 0s is malloc()d, since pnfsdcopymr(8) should be an infrequent
occurrence.
2018-07-08 18:15:55 +00:00
Rick Macklem
ed66a76bca Fix handling of the hybrid DS case for a pNFS server.
After the addition of the "#mds_path" suffix for a DS specification on the
"-p" nfsd option, it is possible to have a mix of DSs assigned to an MDS
file system and DSs that store files for all DSs. This is what I referred
to as "hybrid" above.
At first, I didn't think this hybrid case would be useful, but I now believe
that some system administrators may fine it useful.
This patch modifies the file storage assignment algorithm so that it
makes the "#mds_path" DSs take priority and the all file systems DSs
are now only used for MDS file systems with no "#mds_path" DS servers.
This only affects the pNFS server for this "hybrid" case.
2018-07-07 19:27:49 +00:00
Rick Macklem
5b500ea949 Change the pNFS server so that it does not disable a mirrored DS for
an NFSERR_STALE error reported via a LayoutReturn.

The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This change only affects the pNFS server and only when a client does a
LayoutReturn with the NFSERR_STALE error report.
2018-07-06 19:18:45 +00:00
Rick Macklem
ff3b992f38 Fix the pNFS server so that it handles the "#mds_path" check for mirrors.
The recently added feature of the pNFS server will set an fsid for the
MDS file system to define the file system a DS should store files for.
For a case where a DS handling all file systems has failed, it was possible
for the code to check for a mirror with a specified fs, even though
nfsdev_mdsisset was 0, possibly causing a false successful check for a mirror.
This patch adds a check for nfsdev_mdsisset != 0 to avoid this.
It only affects the pNFS server for a rare case. Found via code inspection.
2018-07-04 19:46:26 +00:00
Rick Macklem
2f32675c83 Add an optional feature to the pNFS server.
Without this patch, the pNFS server distributes the data storage files across
all of the specified DSs.
A tester noted that it would be nice if a system administrator could control
which DSs are used to store the file data for a given exported MDS file system.
This patch adds the kernel support to do this. It also makes a slight semantic
change to nfsv4_findmirror(), since some uses of it no longer require that
the DS being searched for have a current mirror.
A patch that will be committed in a few minutes will modify the nfsd daemon
to support this feature.
The patch should only affect sites using the pNFS server (specified via the
"-p" command line option for nfsd.

Suggested by:	james.rose@framestore.com
2018-07-02 19:21:33 +00:00
Rick Macklem
1aabf3fd5e Fix the pNFS server for a case where mirror level equals number of DSs.
If a pNFS service was set up where the number of DSs equals the mirror level
and then a DS was disabled, the service would create files with duplicate
entries for the same DS. This bug occurred because I didn't realize that
TAILQ_FOREACH_FROM() would start at the beginning of the list when the
inital value of the variable was NULL.
This patch also changes the pNFS server DS file creation code so that it
creates entrie(s) with 0.0.0.0 IP address when it cannot create mirror level
files due to lack of DSs.
The patch only affects the pNFS service and only when it was created with
a number of DSs equal to the mirror level and mirroring is enabled.
2018-06-29 12:41:36 +00:00
Rick Macklem
9f4c522e6b Set the slotid and ND_HASSLOTID flag for NFSv4.1 sequenced operations.
Most NFSv4.1 compound RPCs start with a Sequence operation. For these
cases, save the slotid and note that it is saved by setting ND_HASSLOTID.
This is used by r335568 to free up the session slot and disable it.

MFC after:	2 weeks
2018-06-23 00:48:45 +00:00
Rick Macklem
b18130d330 Define ND_HASSLOTID needed by r335568.
r335568 uses a flag called ND_HASSLOTID to indicate that the slotid is set,
so it can free and invalidate it.
This flag needs to be set, which will be done in a subsequent commit.

MFC after:	2 weeks
2018-06-23 00:37:15 +00:00
Rick Macklem
ba6cce3aea Fix the handling of NFSv4.1 sessions for "soft" mounts.
When a "soft" mount is used for NFSv4.1, an RPC that fails without completing
will leave a slot in the NFSv4.1 session in an indeterminate state.
As such, all that can be done is free up the slot while making is no longer
usable.
A "soft" NFSv4.1 mount is not recommended in general, since it will leave
Open/Lock state in an indeterminate state. An exception is a pNFS mount of
a DS, since there are no Opens/Locks done for them except file creates
where loss of the Open state does not matter.
The patch also makes connections to DSs soft, so that they will fail when
a DS is non-functional or network partitioned, allowing the pNFS MDS to disable
the DS for a mirrored configuration.
This patch should not affect normal "hard" NFSv4.1 mounts.

MFC after:	2 weeks
2018-06-22 21:37:20 +00:00
Rick Macklem
2e35b8fe24 Change the NFSv4.1 pNFS client so that it returns the DS error in layoutreturn.
When the NFSv4.1 pNFS client gets an error for a DS I/O operation using a
Flexible File layout, it returns the layout with an error.
This patch changes the code slightly, so that it returns the layout for all
errors except EACCES and lets the MDS decide what to do based on the error.
It also makes a couple of changes to nfscl_layoutrecall() to ensure that
the first layoutreturn(s) will have the error in the reply.
Plus, the patch adds a wakeup() so that the "nfscl" thread won't wait 1sec
before doing the LayoutReturn.
Tested against the pNFS service.
This patch should not affect non-pNFS use of the client.
The unused "dsp" argument will be used by a future patch that disables the
connection to the DS when possible.

MFC after:	2 weeks
2018-06-22 21:25:27 +00:00
Rick Macklem
c16f407e31 Add a counter to limit the number of disabled DSs for a mirrored pNFS MDS.
This patch adds a counter that limits the number of disabled mirrored DSs
to mirror level - 1.  It also makes a small change that keeps a Write that
has failed with EACCES when attempted by a client to a DS from disabling
the DS.
This patch only affects the pNFS server.
2018-06-22 00:55:39 +00:00
Rick Macklem
755e4b7936 Revert r335263, since it can cause crashes in unusual circumstances.
This needs to be fixed in a different way.
2018-06-17 23:08:54 +00:00
Rick Macklem
2bad64241c Make the pNFS NFSv4.1 client return a Flexible File layout upon error.
The Flexible File layout LayoutReturn operation has argument fields where
an I/O error encountered when attempting I/O on a DS can be reported back
to the MDS.
This patch adds code to the client to do this for the Flexible File layout
mirrored case.
This patch should only affect mounts using the "pnfs" option against servers
that support the Flexible File layout.

MFC after:	2 weeks
2018-06-17 16:30:06 +00:00