r345742 replaced fusefs's fufh array with a fufh list. But it left a few
array idioms in place. This commit replaces those idioms with more
efficient list idioms. One location is in fuse_filehandle_close, which now
takes a pointer argument. Three other locations are places that had to loop
over all of a vnode's fuse filehandles.
Sponsored by: The FreeBSD Foundation
The FUSE protocol allows each open file descriptor to have a unique file
handle. On FreeBSD, these file handles must all be stored in the vnode.
The old method (also used by OSX and OpenBSD) is to store them all in a
small array. But that limits the total number that can be stored. This
commit replaces the array with a linked list (a technique also used by
Illumos). There is not yet any change in functionality, but this is the
first step to fixing several bugs.
PR: 236329, 236340, 236381, 236560, 236844
Discussed with: cem
Sponsored by: The FreeBSD Foundation
Previously fusefs would treat any file opened O_WRONLY as though the
FOPEN_DIRECT_IO flag were set, in an attempt to avoid issuing reads as part
of a RMW write operation on a cached part of the file. However, the FUSE
protocol explicitly allows reads of write-only files for precisely that
reason.
Sponsored by: The FreeBSD Foundation
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1. Totally redundant with dtrace FBT probes. These I deleted.
2. Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace. They
can be filtered by FILE and/or by priority.
3. More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Also, de-inline fuse_internal_cache_attrs. It's big enough to be a regular
function, and this way it gets a dtrace FBT probe.
This commit is a merge of r345304, r344914, r344703, and r344664 from
projects/fuse2.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19667
The dead code in question was a broken and incomplete attempt to support the
default_permissions mount option during VOP_SETATTR. There wasn't anything
there worth saving; I'll have to rewrite it later.
Reported by: Coverity
Coverity CID: 1008668
Sponsored by: The FreeBSD Foundation
fuse_vnop_create must close the newly created file if it can't allocate a
vnode. When it does so, it must use the same file flags for FUSE_RELEASE as
it used for FUSE_OPEN or FUSE_CREATE.
Reported by: Coverity
Coverity CID: 1066204
Sponsored by: The FreeBSD Foundation
This change also inlines several previously #define'd symbols that didn't
really have the meanings indicated by the comments.
Sponsored by: The FreeBSD Foundation
TMPFS_PAGES_MINRESERVED controls how much memory is reserved for the system
and not used by tmpfs.
On very small memory systems, the default value may be too high and this
prevents these small memory systems from using reroot, which is required
for them to install firmware updates.
Submitted by: Hiroki Mori <yamori813@yahoo.co.jp>
Reviewed by: mizhka
Differential Revision: https://reviews.freebsd.org/D13583
If a FUSE filesystem returns ENOSYS for FUSE_CREATE, then fallback to
FUSE_MKNOD/FUSE_OPEN.
Also, fix a memory leak in the error path of fuse_vnop_create. And do a
little cleanup in fuse_vnop_open.
PR: 199934
Reported by: samm@os2.kiev.ua
Sponsored by: The FreeBSD Foundation
The FUSE protocol allows for LOOKUP to return a cacheable negative response,
which means that the file doesn't exist and the kernel can cache its
nonexistence. As of this commit fusefs doesn't cache the nonexistence, but
it does correctly handle such responses. Prior to this commit attempting to
create a file, even with O_CREAT would fail with ENOENT if the daemon
returned a cacheable negative response.
PR: 236231
Sponsored by: The FreeBSD Foundation
For an unknown reason, fusefs was _always_ sending the fdatasync operation
instead of fsync. Now it correctly sends one or the other.
Also, remove the Fsync.fsync_metadata_only test, along with the recently
removed Fsync.nop. They should never have been added. The kernel shouldn't
keep track of which files have dirty data; that's the daemon's job.
PR: 236473
Sponsored by: The FreeBSD Foundation
I committed too hastily in r345390. There are cases, not directly reachable
from userland, where VOP_FSYNC ought to be asynchronous. This commit fixes
fusefs to handle VOP_FSYNC synchronously if and only if the VFS requests it.
PR: 236474
X-MFC-With: 345390
Sponsored by: The FreeBSD Foundation
If vflush() did not completely flushed the mount vnodes queue, either
retry for forced unmounts, or give up for non-forced. This situation
can occur when new vnodes are instantiated while vflush() worked.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This makes it more consistent with other filesystems, which all end in "fs",
and more consistent with its mount helper, which is already named
"mount_fusefs".
Reviewed by: cem, rgrimes
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19649
I missed these in r344664. They're basically useless because they can only
be controlled at compile-time. Also, de-inline fuse_internal_cache_attrs.
It's big enough to be a regular function, and this way it gets a dtrace FBT
probe.
Sponsored by: The FreeBSD Foundation
The ext2_nodealloccg() function unlocks the mount point
in case of successful node allocation.
The additional unlocks are not required and should be removed.
PR: 236452
Reported by: pho
MFC after: 3 days
On GENERIC kernels with empty loader.conf, there is no functional change.
DFLTPHYS and MAXBSIZE are both 64kB at the moment. This change allows
larger bufcache block sizes to be used when either MAXBSIZE (custom kernel)
or the loader.conf tunable vfs.maxbcachebuf (GENERIC) is adjusted higher
than the default.
Suggested by: ken@
When open(2) was invoked against a FUSE filesystem with an unexpected flags
value (no O_RDONLY / O_RDWR / O_WRONLY), an assertion fired, causing panic.
For now, prevent the panic by rejecting such VOP_OPENs with EINVAL.
This is not considered the correct long term fix, but does prevent an
unprivileged denial-of-service.
PR: 236329
Reported by: asomers
Reviewed by: asomers
Sponsored by: Dell EMC Isilon
of it being explicitly passed as an argument. No functional changes.
The big picture here is that I want to get rid of the 'td' argument
being passed everywhere, and this is the first piece that affects
the NFS server.
Reviewed by: rmacklem
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19417
Add more on-disk superblock consistency checks to ext2_compute_sb_data() function.
It should decrease the probability of mounting filesystems with corrupted superblock data.
Reviewed by: pfg
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19322
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1) Totally redundant with dtrace FBT probes. These I deleted.
2) Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace.
They can be filtered by FILE and/or by priority.
3) More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Sponsored by: The FreeBSD Foundation
On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
use a buf cache block size in excess of permitted size. This did not affect
most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
The issue was discovered and reported using a custom kernel with a DFLTPHYS
of 512kB.
PR: 230260 (comment #9)
Reported by: ken@
MFC after: π/𝑒 weeks
- debugfs consumers expect to be able to export names more than 48 characters
- debugfs consumers expect to be able to hold locks across calls and are able
to handle allocation failures
Reviewed by: hps@
MFC after: 1 week
Sponsored by: iX Systems
Differential Revision: https://reviews.freebsd.org/D19256
Take a pass through fixing some of the most egregious whitespace issues in
fs/fuse. Also fix some style(9) warts while here. Not 100% cleaned up, but
somewhat less painful to look at and edit.
No functional change.
that can happen when rerooting into NFSv4 rootfs with kernel
built with INVARIANTS.
I've talked to rmacklem@ (back in 2017), and while the root cause
is still unknown, the case guarded by assertion (nfscl_doclose()
being called from VOP_INACTIVE) is believed to be safe, and the
whole thing seems to run just fine.
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.
As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).
We do something similar in fuse_vnop_getattr already.
PR: 230258 (as reported in description; other issues explored in
comments are not all resolved)
Reported by: MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.
Prior to this change, we implement and default to writeback caching. This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.
In this change, add cache mode as an extension of cache enable/disable. The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).
For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data. WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement. Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.
Refs:
* https://sourceforge.net/p/fuse/mailman/message/8902254/
* https://github.com/vgough/encfs/issues/315
PR: 230258 (inspired by)
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR). In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).
PR: 230258 (related)
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.
PR: 230258 (inspired by; does not fix)
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.
One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy. It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?
Some minor APIs changed to facilitate this:
1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").
2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0. It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).
3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.
4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache). The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.
PR: 230258 (inspired by; does not fix)
The vp vnode is unlocked during the execution of the VOP method and
can be reclaimed, zeroing vp->v_data. Caching allows to use the
correct mount point.
Reported and tested by: pho
PR: 235549
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Maliciously formed, or badly corrupted, filesystems can cause kernel
panics. In general, such acts of foot-shooting can only be accomplished
by root, but in a world with VM images that is moving towards automated
mounts it is important to have some form of prevention.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert
of Fraunhofer FKIE.
Incidentaly this should also fix a memory corruption issue reported by
Dr Silvio Cesare of InfoSect.
Huge thanks to all reseachers for making us aware of the issue.
admbug: 872, 891
Reviewed by: fsu
Obtained from: NetBSD (with minor changes)
MFC after: 3 days
Note that these interfaces are available only to root.
admbugs: 765
Reported by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by: rmacklem
MFC after: 1 day
Security: Kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Semicolon is a legal character in long names but not in 8.3 format.
Move it to respective character set.
PR: 140068
Submitted by: tom@uffner.com
MFC after: 3 weeks
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.
Reported by: Christopher Krah <krah@protonmail.com>
Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
M sys/fs/ext2fs/ext2_vnops.c
M sys/kern/vfs_subr.c
M sys/ufs/ffs/ffs_snapshot.c
M sys/ufs/ufs/ufs_vnops.c
clustering is not done. The bug caused extreme slowness for large
files in some cases.
There is no way to tell VOP_BMAP() how many blocks are wanted, so for
all file systems it has to waste time in some cases by searching for
more contiguous blocks than will be accessed. For msdosfs, it also
clobbered the fatchain cache in these cases by advancing the cache to
point to the chain entry for block that won't be read. This makes
the cache useless for the next sequential i/o (or VOP_BMAP()), so the
fat chain is searched from the beginning. The cache only has 1 relevant
entry, so it is similarly useless for random i/o.
Fix this by only advancing the cache to point to the chain entry for
the first block that will be read. Clustering uses results from
VOP_BMAP(), so when more than 1 block is read by clustering, the cache
is not advanced as optimally as before, but it is at most 1 cluster
size behind and searching the chain through the blocks for this cluster
doesn't take too long.
mainly clustering and read-ahead.) Copy the initialization from ffs,
and also copy a couple of lines of ffs's nearby style for initialization
order and whitespace.
A correct fix would de-duplicate the initialization and fix bitrot in it
instead of adding another instance of the duplication. Complications to
use the size preferred by the device have been reduced to hard-coding
slightly pessimal and/or inconsistent defaults, using large code that was
almost needed to support the complications.
For msdosfs, the result was that mnt_iosize_max was DFTLPHYS (64K) but is
now MAXPHYS (128K).
When the NFSv4 server was coded, I believed that the specification authors
did not want NFSv4 servers to require a client to use a reserved port#.
However, recently it has been noted that the Linux NFSv4 server does support
a check for a reserved port#.
Since both the FreeBSD and Linux NFSv4 clients use a reserved port# by
default, enabling vfs.nfsd.nfs_privport to require a reserved port# for
NFSv4 the same as it does for NFSv2, 3 seems reasonable.
The only case where this could cause a POLA violation is a FreeBSD NFSv4
server with vfs.nfsd.nfs_privport set, but with NFSv4 clients doing mounts
without using a reserved port# (< 1024).
Tested by: chaz.newton58@gmail.com
PR: 234106
MFC after: 1 week
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory. The same
issue existed with the register files in procfs.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Security: kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18421
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace. With the ino64 work there
are also some explicit pad fields in struct dirent. Add a subroutine
to clear these bytes and use it in the in-tree filesystems. The
NFS client is omitted for now as it was fixed separately in r340787.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
The NFS client code (nfsrpc_readdir() and nfsrpc_readdirplus()) wasn't
filling in parts of the readdir reply, such as d_pad[01] and the bytes
at the end of d_name within d_reclen. As such, data left in a buffer cache
block could be leaked to userland in the readdir reply.
This patch makes sure all of the data is filled in.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib, markj
MFC after: 2 weeks
NFSv3's ReaddirPlus and NFSv4's Readdir operations. The code
checked for a zero argument, but did not check for a very large value.
This patch clips dircount at the server's maximum data size.
MFC after: 1 week
The code assumed that this would indicate a corrupted mbuf chain, but
it could simply be caused by bogus RPC message data.
This patch replaces the panic() with a printf() plus error return.
MFC after: 1 week
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature. Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.
Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs. At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.
Submitted by: Jack Halford <jack@gandi.net>
Reviewed by: mckusick, pfg, rmacklem (previous versions)
Sponsored by: Gandi.net
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17917
Prior to this patch, nfs_advlock() did NFSVOPUNLOCK(); return (error);
in many places. This patch replaces these code sequenences with a "goto out;"
and does the NFSVOPUNLOCK(); return (error); at the end of the function
in order to make the vnode locking simpler.
This patch does not change the semantics of nfs_advlock().
Suggested by: kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D17853
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852
This will enable callers to take const paths as part of syscall
decleration improvements.
Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.
Bump __FreeBSD_version for external API consumers.
Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17805
A crash was reported where the crash occurred in nfs_advlock() when the
NFS_ISV4(vp) macro was being executed. This was caused by the vnode
being VI_DOOMED due to a forced dismount in progress.
This patch fixes the problem by locking the vnode before executing the
NFS_ISV4() macro.
Tested by: rlibby
PR: 232673
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D17757
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.
Unlike r339174 this change supports both places FIODGNAME is handled.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17475
Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.
The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.
VFS_OP()s wrap is straightforward.
Requested and reviewed by: mjg (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17658
Instead, a failing entry is skipped.
This change consist of two logical changes.
A failure to vget or lookup an entry is considered to be a result of a
concurrent removal, which is the only reasonable explanation given that
the filesystem is busied. So, the entry would be silently skipped.
In the case of a failure to get attributes of an entry for an NFSv3
request, the entry would be silently skipped. There can be legitimate
reasons for the failure, but NFSv3 does not provide any means to report
the error, so we have two options: either fail the whole request or
ignore the failed entry. Traditionally, the old NFS server used the
latter option, so the code is reverted to it. Making the whole
directory unreadable because of a single entry seems to be unpractical.
Additionally, some bits of code are slightly re-arranged to account for
the new control flow and to honor style(9).
Reviewed by: rmacklem
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D15424
The pNFS server would report the total disk space used and free for all
of the DSs, even when certain DSs are assigned to the file system via
the "#<path>" suffix used in the "nfsd -p" option argument.
This patch fixes this case. It only reports usage for the file system
that the argument vnode resides on. This is consistent with the non-pNFS
NFSv4 server. In NFSv4 it is possible to have subtrees on other file
systems, but these are not included in the usage information for NFSv4.
Approved by: re (gjb)
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.
Reviewed by: kib
Approved by: re (rgrimes, gjb)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17388
given in random(4).
This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.
Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.
PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898
The requested size was returned incorrectly in case uio == NULL from listextattr because the
nameprefix/name conversion was not applied.
Also, make a_size/uio returning logic more unified with other filesystems.
Reviewed by: cem, pfg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13528
When coding the pNFS server, I added vn_start_write() calls in nfsrv_copymr()
done while the vnodes were locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes the LORs by moving the vn_start_write() calls up to before
where the vnodes are locked. For "tvp", the vn_start_write() probaby isn't
necessary, because NFS mounts can't be suspended. However, I think doing
so is harmless.
Thanks go to kib@ for letting me know that I had introduced these LORs.
This patch only affects the behaviour of the pNFS server when pnfsdscopymr(8)
is used to recover a mirrored DS.
When coding the pNFS server, I added several vn_start_write() calls done
while the vnode was locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes this by removing the added vn_start_write() calls and
modifying the code so that the extant vn_start_write() call before the
NFS RPC/operation is done when needed by the pNFS server.
Flags are changed so that LayoutCommit and LayoutReturn now get a
vn_start_write() done for them.
When the pNFS server is enabled, the code now also changes the flags for
Getattr, so that the vn_start_write() is done for Getattr, since it may
need to do a vn_set_extattr(). The nfs_writerpc flag array was made global
to the NFS server and renamed nfsrv_writerpc, which is consistent naming
for globals in the NFS server.
Thanks go to kib@ for reporting that doing vn_start_write() while the vnode is
locked results in a LOR.
This patch only affects the behaviour of the pNFS server.
When a pNFS service is running, the size of the files created on the MDS
are normally 0, since the data is written to the data files on the DS(s).
However, without this patch, if a Setattr with a non-zero size was done by
a client, the MDS file was set to that size. This was thought to be benign,
but it turns out that files with a non-zero size plus extended attributes
can cause a "ffs_truncate3" panic in UFS. Although the exact cause of this
panic() has not been isolated, this patch avoids the panic() and leaves
the MDS files in a consistent state of always having a size == 0.
Note that these MDS files never store data. The patch also includes an
unnecessary initialization of savsize in case some compiler or static
analyser complains it might not be initialized.
This patch only affects the NFS server when pNFS is enabled via the "-p"
command line option on nfsd.
jails since FreeBSD 7.
Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.
Differential Revision: D14791
So that I don't have to keep grepping around the codebase to remember what each
one does. And maybe it saves someone else some time.
Fix a trivial whitespace issue while here.
No functional change.
Sponsored by: Dell EMC Isilon
The isonum_* functions are defined to take unsigend char* as an argument,
but the structure fields are defined as char. Change to u_char where needed.
Probably the full structure should be changed, but I'm not sure about the
side affects.
While there, add __packed attribute.
Differential Revision: https://reviews.freebsd.org/D16564
After a re-read of the appropriate section of RFC5661, I decided that a
few things should be changed related to LayoutRecall callback handling.
Here are the things fixed by this patch.
- For two of the three cases that LayoutRecall is done, I now think
setting the clora_changed argument false is correct.
- All errors other than NFSERR_DELAY returned by LayoutRecall appear
permanent, so don't retry for any of them. (NFSERR_DELAY is retried by
newnfs_request(), so it is not affected by this patch.)
- Instead of waiting "forever" (actually until the process is SIGTERM'd)
for Layouts to be returned during a mirror copy, fail and return
ENXIO after about 1minute.
Waiting for a <ctrl>C made sense when pnfsdscopymr() was done by itself,
but did not make sense when done via find(1).
This patch only affects the pNFS server.
These were found by the Undefined Behaviour GsoC project at NetBSD:
Do not change signedness bit with left shift.
While there avoid signed integer overflow.
Address both issues with using unsigned type.
msdosfs_fat.c:512:42, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:521:44, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:14, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:24, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'
msdosfs_fat.c:840:13, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:840:36, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'
Detected with micro-UBSan in the user mode.
Hinted from: NetBSD (CVS 1.33)
MFC after: 2 weeks
Differenctial Revision: https://reviews.freebsd.org/D16615
Do not allow to create more that EXT4_LINK_MAX links to directory in case
if the dir_nlink is not set, like it is done in the fresh e2fsprogs updates.
MFC after: 3 months
The checksum updating functions were not called in case of dir index inode splitting
and in case of dir entry removing, when the entry was first in the block.
Fix and move the dir entry adding logic when i_count == 0 to new function.
MFC after: 3 months
The code in newnfs_request() retries RPCs that get a reply of NFSERR_DELAY,
but exempts certain NFSv4 operations. However, for callback RPCs, there
should not be any exemptions at this time. The code would have erroneously
exempted the CBRECALL callback, since it has the same operation number as
the CLOSE operation.
This patch fixes this by checking for a callback RPC (indicated by clp != NULL)
and not checking for exempt operations for callbacks.
This would have only affected the NFSv4 server when delegations are enabled
(they are not enabled by default) and the client replies to CBRECALL with
NFSERR_DELAY. This may never actually happen.
Spotted during code inspection.
MFC after: 2 weeks
At least on x86, fhandle_t is a packed structure, so I believe an
assignment will copy all the bits. However, for some current/future
architectures, there might be padding in the structure that doesn't get
copied via an assignment.
Since NFS assumes a file handle is an opaque blob of bits that can be
compared via memcmp()/bcmp(), all the bits including any padding must be
copied.
This patch replaces the assignments with a call to a byte copy function.
Spotted during code inspection.
Newer versions of gcc generate "might not be initialized" warnings for
several variables in nfsrpc_doiods(). I have checked and all of these
variables are assigned values before they are used.
In the one case of "tdrpc", it could have passed garbage as an argument
to nfscl_dofflayoutio() when mirrorcnt is one. However nfscl_dofflayoutio() only
uses the argument when mirrorcnt > 1, so it wasn't actually broken.
This patch initializes "tdrpc" to avoid confusion and initializes the rest
to make the compiler happy.
Requested by: mmacy
Various components restrict size of IO passed up to the userspace filesystem
based on the mount's f_iosize value. The previous default of PAGE_SIZE
is anemic, even for normal filesystems, but especially considering every
FUSE operation involves a kernel <-> userspace IPC upcall.
Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations.
Anecdotally, Jakub reports IO read performance increased from 600 MB/s ->
2700 MB/s with a basic RAM-backed FUSE filesystem.
PR: 230260
Reported by: Peter (MooseFS) <freebsd AT moosefs.com>
Tested by: Jakub Kruszona-Zawadzki <acid AT moosefs.com>
MFC after: 3 days
Newer versions of gcc generate "set, but not used" warnings.
Add __unused macros to silence these warnings.
Although the variables are not being used, they are values parsed from
arguments to callback RPCs that might be needed in the future.
Requested by: mmacy
I believe that a ReclaimComplete with rca_one_fs == TRUE is only
to be used after a file system has been transferred to a different
file server. However, RFC5661 is somewhat vague w.r.t. this and
the ESXi 6.7 client does both a ReclaimComplete with rca_one_fs == TRUE
and one with ReclaimComplete with rca_one_fs == FALSE.
Therefore, just ignore the rca_one_fs == TRUE operation and return
NFS_OK without doing anything instead of replying NFS4ERR_NOTSUPP.
This allows the ESXi 6.7 NFSv4.1 client to do a mount.
After discussion on the NFSv4 IETF working group mailing list, doing this
along with setting a flag to note that a ReclaimComplete with rca_one_fs TRUE
was an appropriate way to handle this.
The flag that indicates that a ReclaimComplete with rca_one_fs == TRUE was
done may be used to disable replies of NFS4ERR_GRACE for non-reclaim
state operations in a future commit.
This patch along with r332790, r334492 and r336357 allow ESXi 6.7 NFSv4.1 mounts
work ok. ESX 6.5 NFSv4.1 mounts do not work well, due to what I believe are
violations of RFC-5661 and should not be used.
Reported by: andreas.nagy@frequentis.com
Tested by: andreas.nagy@frequentis.com, daniel@ftml.net (earlier version)
MFC after: 2 weeks
Relnotes: yes
Use CLSET_TIMEOUT to set the timeout for connections to DSs instead of
specifying a timeout on each RPC. This is done so that SO_SNDTIMEO
is set on the TCP socket as well as specifying a time limit when
waiting for an RPC reply. Useful if the send queue for the TCP
connection has become constipated, due to a failed DS.
The choice of lease_duration / 4 is fairly arbitrary, but seems to work
ok, with a lower bound of 10sec.
For client connections to a DS, set the retry limit to vfs.nfsd.dsretries,
which is 2 by default.
This patch should only affect pNFS connections to DSs.
This patch requires r336542.
MFC after: 2 weeks
The ESXi NFSv4.1 client will generate warning messages when the reason for
not issuing a delegation is two. Two refers to a resource limit and I do
not see why it would be considered invalid. However it probably was not the
best choice of reason for not issuing a delegation.
This patch changes the reasons used to ones that the ESXi client doesn't
complain about. This change does not affect the FreeBSD client and does
not appear to affect behaviour of the Linux NFSv4.1 client.
RFC5661 defines these "reasons" but does not give any guidance w.r.t. which
ones are more appropriate to return to a client.
Tested by: andreas.nagy@frequentis.com
PR: 226650
MFC after: 2 weeks
When a NFSv4.1 client mount using pNFS detects a failure trying to do a
Renew (actually just a Sequence operation), the code would simply try
again and again and again every 30sec.
This would tie up the "nfscl" thread, which should also be doing other
things like Renews on other DSs and the MDS.
This patch adds code which closes down the TCP connection and marks it
defunct when Renew detects an failure to communicate with the DS, so
further Renews will not be attempted until a new working TCP connection to
the DS is established.
It also makes the call to nfscl_cancelreqs() unconditional, since
nfscl_cancelreqs() checks the NFSCLDS_SAMECONN flag and does so while holding
the lock.
This fix only applies to the NFSv4.1 client whne using pNFS and without it
the only effect would have been an "nfscl" thread busy doing Renew attempts
on an unresponsive DS.
MFC after: 2 weeks
Without this patch, the client side NFSv4.1 pNFS code erroneously did writes
and commits to both DS mirrors using the TCP connection of the first one.
For my test setup this worked, since I have both DSs running on the same
machine, but it would have failed when the DSs are on separate machines.
This patch fixes the code to use the correct TCP connection for each DS.
This patch should only affect the NFSv4.1 client when using "pnfs" mounts
to mirrored DSs.
MFC after: 2 weeks
So long as the TCP connection to a pNFS DS isn't shared with other DSs,
it can be closed down when the DS is being disabled in the pNFS client.
This causes any RPCs in progress to fail.
This patch only affects the NFSv4.1 pNFS client when errors occur
while doing I/O on a DS.
MFC after: 2 weeks
an I/O attempt on a DS to the server via LayoutReturn.
The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errrors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This patch only affects behaviour of the pNFS client and only when using
Flexible File layouts.
MFC after: 2 weeks
Without this patch, the NFSv4.1 pNFS client shared a single TCP connection
for all DSs that resided on the same machine. This made disabling one of
the DSs impossible. Although unlikely, it is possible that the storage
subsystem has failed in such a way that the storage for one DS on a machine
is no longer functioning correctly, but the storage used by another DS on
the same machine is still ok. For this case, it would be nice if a system
can fail one of the DSs without failing them all.
This patch changes the default behaviour to use separate TCP connections
for each DS even if they reside on the same machine.
I do not believe that this will be a problem for extant pNFS servers, but
a sysctl can be set to restore the old behaviour if this change causes a
problem for an extant pNFS server.
This patch only affects the NFSv4.1 pNFS client.
MFC after: 2 weeks
RFC5661 states that the cookie verifier should be 0 when the cookie is 0.
However, the wording is somewhat unclear and a recent discussion on the
nfsv4@ietf.org mailing list indicated that the NFSv4 server should ignore
the cookie verifier's value when the dirctory offset cookie is 0.
This patch deletes the check for this that would return NFSERR_BAD_COOKIE
when the verifier was not 0.
This was found during testing of the ESXi client against the NFSv4.1 server.
Reported by: daniel@ftml.net (via packet trace)
MFC after: 2 weeks
The pnfsdskill(8) command will normally fail if there is no valid mirror
for the DS to be disabled. However, a system administrator may need to
disable a DS which does not have a valid mirror so that the nfsd threads
can be terminated. This patch adds the kernel code needed by pnfsdskill(8)
to implement this "forced" case of disabling a DS.
This patch only affects the pNFS server.
If a mirrored DS is being recovered that has a lot of large sparse files,
pnfsdscopymr(8) would use a lot of space on the recovered mirror since it
would write the "holes" in the file being mirrored.
This patch adds code to check for a "hole" and skip doing the write.
The check is done on a "per PNFSDS_COPYSIZ size block", which is currently 64K.
I think that most file server file systems will be using a blocksize at
least this large. If the file server is using a smaller blocksize and
smaller holes need to be preserved, PNFSDS_COPYSIZ could be decreased.
The block of 0s is malloc()d, since pnfsdcopymr(8) should be an infrequent
occurrence.
After the addition of the "#mds_path" suffix for a DS specification on the
"-p" nfsd option, it is possible to have a mix of DSs assigned to an MDS
file system and DSs that store files for all DSs. This is what I referred
to as "hybrid" above.
At first, I didn't think this hybrid case would be useful, but I now believe
that some system administrators may fine it useful.
This patch modifies the file storage assignment algorithm so that it
makes the "#mds_path" DSs take priority and the all file systems DSs
are now only used for MDS file systems with no "#mds_path" DS servers.
This only affects the pNFS server for this "hybrid" case.
an NFSERR_STALE error reported via a LayoutReturn.
The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This change only affects the pNFS server and only when a client does a
LayoutReturn with the NFSERR_STALE error report.
The recently added feature of the pNFS server will set an fsid for the
MDS file system to define the file system a DS should store files for.
For a case where a DS handling all file systems has failed, it was possible
for the code to check for a mirror with a specified fs, even though
nfsdev_mdsisset was 0, possibly causing a false successful check for a mirror.
This patch adds a check for nfsdev_mdsisset != 0 to avoid this.
It only affects the pNFS server for a rare case. Found via code inspection.
Without this patch, the pNFS server distributes the data storage files across
all of the specified DSs.
A tester noted that it would be nice if a system administrator could control
which DSs are used to store the file data for a given exported MDS file system.
This patch adds the kernel support to do this. It also makes a slight semantic
change to nfsv4_findmirror(), since some uses of it no longer require that
the DS being searched for have a current mirror.
A patch that will be committed in a few minutes will modify the nfsd daemon
to support this feature.
The patch should only affect sites using the pNFS server (specified via the
"-p" command line option for nfsd.
Suggested by: james.rose@framestore.com
If a pNFS service was set up where the number of DSs equals the mirror level
and then a DS was disabled, the service would create files with duplicate
entries for the same DS. This bug occurred because I didn't realize that
TAILQ_FOREACH_FROM() would start at the beginning of the list when the
inital value of the variable was NULL.
This patch also changes the pNFS server DS file creation code so that it
creates entrie(s) with 0.0.0.0 IP address when it cannot create mirror level
files due to lack of DSs.
The patch only affects the pNFS service and only when it was created with
a number of DSs equal to the mirror level and mirroring is enabled.
Most NFSv4.1 compound RPCs start with a Sequence operation. For these
cases, save the slotid and note that it is saved by setting ND_HASSLOTID.
This is used by r335568 to free up the session slot and disable it.
MFC after: 2 weeks
r335568 uses a flag called ND_HASSLOTID to indicate that the slotid is set,
so it can free and invalidate it.
This flag needs to be set, which will be done in a subsequent commit.
MFC after: 2 weeks
When a "soft" mount is used for NFSv4.1, an RPC that fails without completing
will leave a slot in the NFSv4.1 session in an indeterminate state.
As such, all that can be done is free up the slot while making is no longer
usable.
A "soft" NFSv4.1 mount is not recommended in general, since it will leave
Open/Lock state in an indeterminate state. An exception is a pNFS mount of
a DS, since there are no Opens/Locks done for them except file creates
where loss of the Open state does not matter.
The patch also makes connections to DSs soft, so that they will fail when
a DS is non-functional or network partitioned, allowing the pNFS MDS to disable
the DS for a mirrored configuration.
This patch should not affect normal "hard" NFSv4.1 mounts.
MFC after: 2 weeks
When the NFSv4.1 pNFS client gets an error for a DS I/O operation using a
Flexible File layout, it returns the layout with an error.
This patch changes the code slightly, so that it returns the layout for all
errors except EACCES and lets the MDS decide what to do based on the error.
It also makes a couple of changes to nfscl_layoutrecall() to ensure that
the first layoutreturn(s) will have the error in the reply.
Plus, the patch adds a wakeup() so that the "nfscl" thread won't wait 1sec
before doing the LayoutReturn.
Tested against the pNFS service.
This patch should not affect non-pNFS use of the client.
The unused "dsp" argument will be used by a future patch that disables the
connection to the DS when possible.
MFC after: 2 weeks
This patch adds a counter that limits the number of disabled mirrored DSs
to mirror level - 1. It also makes a small change that keeps a Write that
has failed with EACCES when attempted by a client to a DS from disabling
the DS.
This patch only affects the pNFS server.
The Flexible File layout LayoutReturn operation has argument fields where
an I/O error encountered when attempting I/O on a DS can be reported back
to the MDS.
This patch adds code to the client to do this for the Flexible File layout
mirrored case.
This patch should only affect mounts using the "pnfs" option against servers
that support the Flexible File layout.
MFC after: 2 weeks
Normally "soft,retrans=2" cannot be safely used on NFSv4 mounts, since
the RPC can fail and leave the open/lock state in an undefined state.
Doing I/O on a pNFS DS is an exception to this, since no open/lock state
is maintained on the DS server.
It is useful to do "soft,retrans=2" connections to a DS when it is mirrored,
so that the client can detect failure of the DS. As such, mounts from the MDS
to the DSs should use these mount options when mirroring is enabled.
However, the NFSv4.1 client still leaves the session in an undefined state
when this happens.
This patch fixes the problem by setting the session defunct, so it will
no longer be used.
The patch also sets "retries=2" on the connections done by the client to
a DS, which is the internal equivalent of "soft,retrans=2".
The client does not know if the server implements mirroring at connection
time, but always doing this should be safe, since it will fall back on doing
I/O via the MDS as a proxy when there is a failure doing an I/O RPC to the DS.
This patch should not affect non-pNFS client mounts.
MFC after: 2 weeks
Four functions nfscl_reqstart(), nfscl_fillsattr(), nfsm_stateidtom()
and nfsmnt_mdssession() are now called from within the nfsd.
As such, they needed to be moved from nfscl.ko to nfscommon.ko so that
nfsd.ko would load when nfscl.ko wasn't loaded.
Reported by: herbert@gojira.at
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers. This
change breaks compatibility with the previous encoding (which was only
used in -current).
Fix truncation to (essentially) 16-bit dev_t in newnfs v3.
Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility. Extra bits give the much
larger complication that the translations need to compress into fewer
bits. Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.
The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2. But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102. ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.
The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them. I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current. E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.
I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation. So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set. Add some error checking for when bits are lost. Keep not
doing any error checking for translations for almost everything in
compat/linux.
compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.
compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.
fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself. Also fix nearby style bugs.
kern/vfs_syscalls.c:
As for freebsd32. Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.
sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros. Describe the encoding and
some of the reasons for it. 16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding. My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits. This minimizes discontiguities.
Reviewed by: kib (except for rewrite of the comment in linux_stats.c)
This code merge adds a pNFS service to the NFSv4.1 server. Although it is
a large commit it should not affect behaviour for a non-pNFS NFS server.
Some documentation on how this works can be found at:
http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
and will hopefully be turned into a proper document soon.
This is a merge of the kernel code. Userland and man page changes will
come soon, once the dust settles on this merge.
It has passed a "make universe", so I hope it will not cause build problems.
It also adds NFSv4.1 server support for the "current stateid".
Here is a brief overview of the pNFS service:
A pNFS service separates the Read/Write oeprations from all the other NFSv4.1
Metadata operations. It is hoped that this separation allows a pNFS service
to be configured that exceeds the limits of a single NFS server for either
storage capacity and/or I/O bandwidth.
It is possible to configure mirroring within the data servers (DSs) so that
the data storage file for an MDS file will be mirrored on two or more of
the DSs.
When this is used, failure of a DS will not stop the pNFS service and a
failed DS can be recovered once repaired while the pNFS service continues
to operate. Although two way mirroring would be the norm, it is possible
to set a mirroring level of up to four or the number of DSs, whichever is
less.
The Metadata server will always be a single point of failure,
just as a single NFS server is.
A Plan B pNFS service consists of a single MetaData Server (MDS) and K
Data Servers (DS), all of which are recent FreeBSD systems.
Clients will mount the MDS as they would a single NFS server.
When files are created, the MDS creates a file tree identical to what a
single NFS server creates, except that all the regular (VREG) files will
be empty. As such, if you look at the exported tree on the MDS directly
on the MDS server (not via an NFS mount), the files will all be of size 0.
Each of these files will also have two extended attributes in the system
attribute name space:
pnfsd.dsfile - This extended attrbute stores the information that
the MDS needs to find the data storage file(s) on DS(s) for this file.
pnfsd.dsattr - This extended attribute stores the Size, AccessTime, ModifyTime
and Change attributes for the file, so that the MDS doesn't need to
acquire the attributes from the DS for every Getattr operation.
For each regular (VREG) file, the MDS creates a data storage file on one
(or more if mirroring is enabled) of the DSs in one of the "dsNN"
subdirectories. The name of this file is the file handle
of the file on the MDS in hexadecimal so that the name is unique.
The DSs use subdirectories named "ds0" to "dsN" so that no one directory
gets too large. The value of "N" is set via the sysctl vfs.nfsd.dsdirsize
on the MDS, with the default being 20.
For production servers that will store a lot of files, this value should
probably be much larger.
It can be increased when the "nfsd" daemon is not running on the MDS,
once the "dsK" directories are created.
For pNFS aware NFSv4.1 clients, the FreeBSD server will return two pieces
of information to the client that allows it to do I/O directly to the DS.
DeviceInfo - This is relatively static information that defines what a DS
is. The critical bits of information returned by the FreeBSD
server is the IP address of the DS and, for the Flexible
File layout, that NFSv4.1 is to be used and that it is
"tightly coupled".
There is a "deviceid" which identifies the DeviceInfo.
Layout - This is per file and can be recalled by the server when it
is no longer valid. For the FreeBSD server, there is support
for two types of layout, call File and Flexible File layout.
Both allow the client to do I/O on the DS via NFSv4.1 I/O
operations. The Flexible File layout is a more recent variant
that allows specification of mirrors, where the client is
expected to do writes to all mirrors to maintain them in a
consistent state. The Flexible File layout also allows the
client to report I/O errors for a DS back to the MDS.
The Flexible File layout supports two variants referred to as
"tightly coupled" vs "loosely coupled". The FreeBSD server always
uses the "tightly coupled" variant where the client uses the
same credentials to do I/O on the DS as it would on the MDS.
For the "loosely coupled" variant, the layout specifies a
synthetic user/group that the client uses to do I/O on the DS.
The FreeBSD server does not do striping and always returns
layouts for the entire file. The critical information in a layout
is Read vs Read/Writea and DeviceID(s) that identify which
DS(s) the data is stored on.
At this time, the MDS generates File Layout layouts to NFSv4.1 clients
that know how to do pNFS for the non-mirrored DS case unless the sysctl
vfs.nfsd.default_flexfile is set non-zero, in which case Flexible File
layouts are generated.
The mirrored DS configuration always generates Flexible File layouts.
For NFS clients that do not support NFSv4.1 pNFS, all I/O operations
are done against the MDS which acts as a proxy for the appropriate DS(s).
When the MDS receives an I/O RPC, it will do the RPC on the DS as a proxy.
If the DS is on the same machine, the MDS/DS will do the RPC on the DS as
a proxy and so on, until the machine runs out of some resource, such as
session slots or mbufs.
As such, DSs must be separate systems from the MDS.
Tested by: james.rose@framestore.com
Relnotes: yes
There were a couple of cases in newnfs_request() that it assumed that it
was an NFSv4.1 mount with a session. This should always be the case when
a Sequence operation is in the reply or the server replies NFSERR_BADSESSION.
However, if a server was broken and sent an erroneous reply, these safety
belt checks should avoid trouble.
The one check required a small tweak to nfsmnt_mdssession() so that it
returns NULL when there is no session instead of the offset of the field
in the structure (0x8 for i386).
This patch should have no effect on normal operation of the client.
Found by inspection during pNFS server development.
MFC after: 2 weeks
The Flexible File layout case wasn't handled by LayoutRecall callbacks
because it just checked for File layout and returned NFSERR_NOMATCHLAYOUT
otherwise. This patch adds the Flexible File layout handling.
Found during testing of the pNFS server.
MFC after: 2 weeks
These macros were added because they were used by the pNFS server last
year. However, they are no longer used by the pNFS server code and
might as well be deleted.
This is a partial reversion of r326735.
NFSDEV_MIRRORSTR was defined for the pNFS server, but has not been used,
so this patch deletes it. It also cleans up the comment and hopefully
makes it more readable.
gcc8 warns that "verf" was set but not used. This was because the code
that uses it is disabled via a "#if 0".
This patch adds a "#if 0" to the variable's declaration and assignment
to get rid of the warning.
This way the code could be re-enabled without difficulty.
Requested by: mmacy
MFC after: 2 weeks
The intent was that the default would be based on number of CPUs, but the
code disabled using taskqueue() by default.
This code is only executed when mounting a NFSv4.1 server that supports the
Flexible File layout for pNFS and, since such servers are rare, this change
shouldn't result in a POLA violation.
(The FreeBSD pNFS server is still a project and the only other one that
uses Flexible File layout is being developed by Primary Data and I don't
know if they have even shipped any to customers yet.)
Found while testing the pNFS server.
Under some fairly unusual circumstances, the Linux NFSv4.1 client is
doing a BindConnectiontoSession operation for TCP connections.
It is also used by the ESXi6.5 NFSv4.1 client.
This patch adds this operation to the NFSv4.1 server.
Reported by: andreas.nagy@frequentis.com
Tested by: andreas.nagy@frequentis.com
MFC after: 2 weeks
If a client did a DestroySession on a session while it was still in use,
the server might try to use the session structure after it is free'd.
I think the client has violated RFC5661 if it does this, but this patch
makes DestroySession block all other nfsd threads so no thread could
be using the session when it is free'd. After the DestroySession, nfsd
threads will not be able to find the session. The patch also adds a check
for nd_sessionid being set, although if that was not the case it would have
been all 0s and unlikely to have a false match.
This might fix the crashes described in PR#228497 for the FreeNAS server.
PR: 228497
MFC after: 1 week
The sleep for I/O completion during an NFSv4.1 pNFS layout recall used
the wrong event value and could result in the "[nfscl]" thread hung
for the mount.
This patch fixes the event to be the correct.
This bug will only affect NFSv4.1 pnfs mounts and only when the server
does a layout recall callback, so it won't affect many. Without the patch,
a mount without the "pnfs" option will avoid the problem.
Found during testing of the pNFS server.
MFC after: 1 week
Since NFSv4.1 clients normally create a single session which supports
both fore and back channels, it is unlikely that a callback will fail
due to a lack of a back channel.
However, if this failure occurred, the session wasn't being dereferenced
and would never be free'd.
Found by inspection during pNFS server development.
Tested by: andreas.nagy@frequentis.com
MFC after: 2 months
(https://svnweb.freebsd.org/base?view=revision&revision=304232)
converting clrbuf() (which clears the entire buffer) to vfs_bio_clrbuf()
(which clears only the new pages that have been added to the buffer).
Failure to properly remove pages from the buffer cache can make
pages that appear not to need clearing to actually have bad random
data in them. See for example base r304232
(https://svnweb.freebsd.org/base?view=revision&revision=304232)
which noted the need to set B_INVAL and B_NOCACHE as well as clear
the B_CACHE flag before calling brelse() to release the buffer.
Rather than trying to find all the incomplete brelse() calls, it
is simpler, though more slightly expensive, to simply clear the
entire buffer when it is newly allocated.
PR: 213507
Submitted by: Damjan Jovanovic
Reviewed by: kib
The NFSv4 protocol requires that the server only allow reclaim of state
and not issue any new open/lock state for a grace period after booting.
The NFSv4.0 protocol required this grace period to be greater than the
lease duration (over 2minutes). For NFSv4.1, the client tells the server
that it has done reclaiming state by doing a ReclaimComplete operation.
If all NFSv4 clients are NFSv4.1, the grace period can end once all the
clients have done ReclaimComplete, shortening the time period considerably.
This patch does this. If there are any NFSv4.0 mounts, the grace period
will still be over 2minutes.
This change is only an optimization and does not affect correct operation.
Tested by: andreas.nagy@frequentis.com
MFC after: 2 months
In the reply to an ExchangeID operation, the NFSv4.1 server returns a
"scope" value (eir_server_scope). If this value is the same, it indicates
that two servers share state, which is never the case for FreeBSD servers.
As such, the value needs to be unique and it was without this patch.
However, I just found out that it is not supposed to change when the
server reboots and without this patch, it did change.
This patch fixes eir_server_scope so that it does not change when the
server is rebooted.
The only affect not having this patch has is that Linux clients don't
reclaim opens and locks after a server reboot, which meant they lost
any byte range locks held before the server rebooted.
It only affects NFSv4.1 mounts and the FreeBSD NFSv4.1 client was not
affected by this bug.
MFC after: 1 week
For a fairly rare case of a client doing an ExchangeID after a hard reboot,
the old confirmed clientid still exists, but some clients use a new
co_verifier. For this case, the server was not freeing up the sessions on
the old confirmed clientid.
This patch fixes this case. It also adds two LIST_INIT() macros, which are
actually no-ops, since the structure is malloc()d with M_ZERO so the pointer
is already set to NULL.
It should have minimal impact, since the only way I could exercise this
code path was by doing a hard power cycle (pulling the plus) on a machine
running Linux with a NFSv4.1 mount on the server.
Originally spotted during testing of the ESXi 6.5 client.
Tested by: andreas.nagy@frequentis.com
MFC after: 2 months
When an NFSv4.1 session is busy due to a callback being in progress,
nfsrv_freesession() should return NFSERR_BACKCHANBUSY instead of NFS_OK.
The only effect this has is that the DestroySession operation will report
the failure for this case and this probably has little or no effect on a
client. Spotted by inspection and no failures related to this have been
reported.
MFC after: 2 months
The Linux client now uses the TestStateID operation, so this patch adds
support for it to the NFSv4.1 server. The FreeBSD client never uses this
operation, so it should not be affected.
MFC after: 2 months
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Most filesystems, with the notable exceptions of msdosfs and autofs use
only vfs_timestamp() to read the current time. This has the benefit of
configurable granularity (using the vfs.timestamp_precision sysctl).
For convenience, use it on msdosfs too.
Submitted by: Damjan Jovanovic
Differential Revision: https://reviews.freebsd.org/D15297
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
This fixes a regression that happened in r120492 (2003) where libkiconv
was introduced and we went from checking unlen to checking for '\0'.
PR: 111843
Patch by: Damjan Jovanovic
MFC after: 1 week
This patch adds two missing LIST_INIT()s. Found by inspection.
In practice, these are currently no-ops, since the structure they are
in is malloc'd with M_ZERO and all LIST_INIT does is set the pointer
in the list head to NULL. (In other words, the M_ZERO has already
correctly initialized it.)
MFC after: 2 months
Using a pointer after setting it NULL is probably not a good plan.
Spotted by inspection during changes for Flexible File Layout Ioerr handling.
This code path obviously isn't normally executed.
MFC after: 1 week
The NFSv4.1 RFC specifies that the OPEN_SHARE_ACCESS_WANT bits can be set
in the OpenDowngrade share_access argument and are basically ignored.
I do not know of a extant NFSv4.1 client that does this, but this little
patch fixes it just in case.
It also changes the error from NFSERR_BADXDR to NFSERR_INVAL since the NFSv4.1
RFC specifies this as the error to be returned if bogus bits are set.
(The NFSv4.0 RFC didn't specify any error for this, so the error reply can
be changed for NFSv4.0 as well.)
Found by inspection while looking at a problem with OpenDowngrade reported
for the ESXi 6.5 NFSv4.1 client.
Reported by: andreas.nagy@frequentis.com
PR: 227214
MFC after: 1 week
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
This is part of a project for adding the ability to create hybrid CD/USB boot
images. In the BIOS case when booting from something that isn't a CD we need
some extra boot code to actually find our next stage (loader) within an
ISO9660 filesystem. This code will reside in a GPT partition (similar to
gptboot(8) from which it is derived) and looks for /boot/loader in an
ISO9660 filesystem on the image.
Reviewed by: imp
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14914
Followup to r313780. Also prefix ext2's and nandfs's versions with
EXT2_ and NANDFS_.
Reported by: kib
Reviewed by: kib, mckusick
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9623
ncookies cannot be negative or the allocator will fail. This should only
happen if a caller is very broken but we can still try to survive the
event.
We should probably also verify for uio_resid > MAXPHYS but in that case
it is not clear that just clipping the ncookies value is an adequate
response.
MFC after: 2 weeks
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
This was a hack to be able to mount ext4 filesystems read-only while not
supporting all the features. We now support all those features so it
doesn't make sense to keep the undocumented hack.
Discussed with: fsu
Delay the initialization of variables until the are needed.
In the case of ext4_ext_rm_leaf(), make sure 'error' value is not
undefined.
Reported by: Clang's static analyzer
Differential Revision: https://reviews.freebsd.org/D14193
Sanitize the values that will be assigned to ncookies so that we ensure
they are sane and we can handle them.
Let ncookies signed as it was before r328346. The valid range is such
that unsigned values are not required and we are not able to avoid at
least one cast anyways.
Hinted by: bde
Mechanically replace uses of MALLOC/FREE with appropriate invocations of
malloc(9) / free(9) (a series of sed expressions). Something like:
* MALLOC(a, b, ... -> a = malloc(...
* FREE( -> free(
* free((caddr_t) -> free(
No functional change.
For now, punt on modifying contrib ipfilter code, leaving a definition of
the macro in its KMALLOC().
Reported by: jhb
Reviewed by: cy, imp, markj, rmacklem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14035