Just like /dev/devctl, /dev/fuse will now report the number of operations
available for immediate read in the kevent.data field during kevent(2).
Sponsored by: The FreeBSD Foundation
/dev/fuse was already pollable with poll and select. Add support for
kqueue, too. And add tests for polling with poll, select, and kqueue.
Sponsored by: The FreeBSD Foundation
If the daemon dies, return ENOTCONN for all operations that have already
been sent to the daemon, as well as any new ones.
Sponsored by: The FreeBSD Foundation
If the daemon is known to ignore FUSE_INTERRUPT, then we may as well block
all signals while waiting for a response.
Sponsored by: The FreeBSD Foundation
When a FUSE daemon dies or closes /dev/fuse, all of that daemon's pending
requests must be terminated. Previously that was done in /dev/fuse's
.d_close method. However, d_close only gets called on the *last* close of
the device. That means that if multiple daemons were running concurrently,
all but the last daemon to close would leave their I/O hanging around. The
problem was easily visible just by running "kyua -v parallelism=2 test" in
fusefs's test directory.
Fix this bug by terminating a daemon's pending I/O during /dev/fuse's
cdvpriv dtor method instead. That method runs on every close of a file.
Also, fix some potential races in the tests:
* Clear SA_RESTART when registering the daemon's signal handler so read(2)
will return EINTR.
* Wait for the daemon to die before unmounting the mountpoint, so we won't
see an unwanted FUSE_DESTROY operation in the mock file system.
Sponsored by: The FreeBSD Foundation
libfuse expects sockets to be created with FUSE_MKNOD, not FUSE_CREATE,
because that's how Linux does it. My first attempt at creating sockets
(r346894) used FUSE_CREATE because FreeBSD uses VOP_CREATE for this purpose.
There are no backwards-compatibility concerns with this change, because
socket support hasn't yet been merged to head.
Sponsored by: The FreeBSD Foundation
Any change to a directory's contents should cause its mtime and ctime to be
updated by the FUSE daemon. Clear its attribute cache so we'll get the new
attributs the next time that they're needed. This affects the following
VOPs: VOP_CREATE, VOP_LINK, VOP_MKDIR, VOP_MKNOD, VOP_REMOVE, VOP_RMDIR, and
VOP_SYMLINK
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
If the file to be renamed is a directory and it's going to get a new parent,
then the user must have write permissions to that directory, because the
".." dirent must be changed.
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
FUSE_LINK returns a new set of attributes. fusefs should cache them just
like it does during other VOPs. This is not only a matter of performance
but of correctness too; without caching the new attributes the vnode's nlink
value would be out-of-date.
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
Even an unprivileged user should be able to chown a file to its current
owner, or chgrp it to its current group. Those are no-ops.
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
fuse file systems have far too much variability for the standard
posix_fallocate implementation to work. A future protocol revision (7.19)
adds a FUSE_FALLOCATE operation, but we don't support that yet. Better to
simply return EINVAL until then.
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
ftruncate should succeed as long as the file descriptor is writable, even if
the file doesn't have write permission. This is important when combined
with O_CREAT.
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
Don't allow unprivileged users to set SGID on files to whose group they
don't belong. This is slightly different than what POSIX says we should do
(clear sgid on return from a successful chmod), but it matches what UFS
currently does.
Reported by: pjdfstest
Sponsored by: The FreeBSD Foundation
The readonly mount check had a special case allowing the sizes of files to
be changed if they weren't regular files. I don't know why. Neither UFS,
ZFS, nor ext2 have such a special case, and I don't know when you would ever
change the size of a non-regular file anyway.
Sponsored by: The FreeBSD Foundation
These panics all lie in the error path. The only one I've hit is caused by
a buggy FUSE server unexpectedly changing the type of a vnode.
Sponsored by: The FreeBSD Foundation
When mounted with -o default_permissions fusefs is supposed to validate all
permissions in the kernel, not the file system. This commit fixes two
permissions that I had previously overlooked.
* Only root may chown a file
* Non-root users may only chgrp a file to a group to which they belong
PR: 216391
Sponsored by: The FreeBSD Foundation
As of r346162 fuse now invalidates the cache during writes. But it can't do
that when writing from VOP_PUTPAGES, because the write is coming _from_ the
cache. Trying to invalidate the cache in that situation causes a deadlock
in vm_object_page_remove, because the pages in question have already been
busied by the same thread.
PR: 235774
Sponsored by: The FreeBSD Foundation
Though it's not documented, Linux will interpret a FUSE_INTERRUPT response
of ENOSYS as "the file system does not support FUSE_INTERRUPT".
Subsequently it will never send FUSE_INTERRUPT again to the same mount
point. This change matches Linux's behavior.
PR: 346357
Sponsored by: The FreeBSD Foundation
* Block stop signals in fticket_wait_answer
* Hold ps_mtx while checking signal disposition
* style(9) changes
PR: 346357
Reported by: kib
Sponsored by: The FreeBSD Foundation
The main difference is to replace some custom logic with bread. No
functional change at this point, but this is one step towards adding
readahead.
Sponsored by: The FreeBSD Foundation
When interrupting a FUSE operation, send the FUSE_INTERRUPT op to the daemon
ASAP, ahead of other unrelated operations.
PR: 236530
Sponsored by: The FreeBSD Foundation
fusefs's VOP_SETEXTATTR calls uiomove(9) before blocking, so it can't be
restarted. It must be interrupted instead.
PR: 236530
Sponsored by: The FreeBSD Foundation
If a pending FUSE operation hasn't yet been sent to the daemon, then there's
no reason to inform the daemon that it's been interrupted. Instead, simply
remove it from the fuse message queue and set its status to EINTR or
ERESTART as appropriate.
PR: 346357
Sponsored by: The FreeBSD Foundation
* If a process receives a fatal signal while blocked on a fuse operation,
return ASAP without waiting for the operation to complete. But still send
the FUSE_INTERRUPT op to the daemon.
* Plug memory leaks from r346339
Interruptibility is now fully functional, but it could be better:
* Operations that haven't been sent to the server yet should be aborted
without sending FUSE_INTERRUPT.
* It would be great if write operations could be made restartable.
That would require delaying uiomove until the last possible moment, which
would be sometime during fuse_device_read.
* It would be nice if we didn't have to guess which EAGAIN responses were
for FUSE_INTERRUPT operations.
PR: 236530
Sponsored by: The FreeBSD Foundation
The fuse protocol includes a FUSE_INTERRUPT operation that the client can
send to the server to indicate that it wants to abort an in-progress
operation. It's required to interrupt any syscall that is blocking on a
fuse operation.
This commit adds basic FUSE_INTERRUPT support. If a process receives any
signal while it's blocking on a FUSE operation, it will send a
FUSE_INTERRUPT and wait for the original operation to complete. But there
is still much to do:
* The current code will leak memory if the server ignores FUSE_INTERRUPT,
which many do. It will also leak memory if the server completes the
original operation before it receives the FUSE_INTERRUPT.
* An interrupted read(2) will incorrectly appear to be successful.
* fusefs should return immediately for fatal signals.
* Operations that haven't been sent to the server yet should be aborted
without sending FUSE_INTERRUPT.
* Test coverage should be better.
* It would be great if write operations could be made restartable.
That would require delaying uiomove until the last possible moment, which
would be sometime during fuse_device_read.
PR: 236530
Sponsored by: The FreeBSD Foundation
fusefs's default cache mode is "writethrough", although it currently works
more like "write-around"; writes bypass the cache completely. Since writes
bypass the cache, they were leaving stale previously-read data in the cache.
This commit invalidates that stale data. It also adds a new global
v_inval_buf_range method, like vtruncbuf but for a range of a file.
PR: 235774
Reported by: cem
Sponsored by: The FreeBSD Foundation
For many FUSE opcodes, an error of ENOSYS has special meaning. fusefs
already handled some of those; this commit adds handling for the remainder:
* FUSE_FSYNC, FUSE_FSYNCDIR: ENOSYS means "success, and automatically return
success without calling the daemon from now on"
* All extattr operations: ENOSYS means "fail EOPNOTSUPP, and automatically
do it without calling the daemon from now on"
PR: 236557
Sponsored by: The FreeBSD Foundation
The fuse protocol is designed with security in mind. It prevents users from
spying on each others' activities. And it doesn't grant users any
privileges that they didn't already have. So it's appropriate to make it
available to everyone. Plus, it's necessary in order for kyua to run tests
as an unprivileged user.
Sponsored by: The FreeBSD Foundation
fusefs tracks each vnode's parent. The rename code was already correctly
updating it. Delete a comment that said otherwise, and add a regression
test for it.
Sponsored by: The FreeBSD Foundation
Don't panic if the server changes the file type of a file without us first
deleting it. That could indicate a buggy server, but it could also be the
result of one of several race conditions. Return EAGAIN as we do elsewhere.
Sponsored by: The FreeBSD Foundation
When the entry cache expires, it's only necessary to purge the cache.
Disappearing a vnode also purges the attribute cache, which is unnecessary,
and invalidates the data cache, which could be harmful.
Sponsored by: The FreeBSD Foundation
I got most of -o default_permissions working in r346088. This commit adds
sticky bit checks. One downside is that sometimes there will be an extra
FUSE_GETATTR call for the parent directory during unlink or rename. But in
actual use I think those attributes will almost always be cached.
PR: 216391
Sponsored by: The FreeBSD Foundation
fuse_vnop_lookup was using a FUSE_GETATTR operation when looking up "." and
"..", even though the only information it needed was the file type and file
size. "." and ".." are obviously always going to be directories; there's no
need to double check.
Sponsored by: The FreeBSD Foundation
fuse_vnop_lookup contained an awkward hack meant to reduce daemon activity
during long lookup chains. However, the hack is no longer necessary now
that we properly cache file attributes. Also, I'm 99% certain that it
could've bypassed permission checks when using openat to open a file
relative to a directory that lacks execute permission.
Sponsored by: The FreeBSD Foundation
* Eliminate fuse_access_param. Whatever it was supposed to do, it seems
like it was never complete. The only real function it ever seems to have
had was a minor performance optimization, which I've already eliminated.
* Make extended attribute operations obey the allow_other mount option.
* Allow unprivileged access to the SYSTEM extattr namespace when
-o default_permissions is not in use.
* Disallow setextattr and deleteextattr on read-only mounts.
* Add tests for a few more error cases.
Sponsored by: The FreeBSD Foundation
Normally all permission checking is done in the fuse server. But when -o
default_permissions is used, it should be done in the kernel instead. This
commit adds appropriate permission checks through fusefs when -o
default_permissions is used. However, sticky bit checks aren't working yet.
I'll handle those in a follow-up commit.
There are no checks for file flags, because those aren't supported by our
version of the FUSE protocol. Nor is there any support for ACLs, though
that could be added if there were any demand.
PR: 216391
Reported by: hiyorin@gmail.com
Sponsored by: The FreeBSD Foundation
The FUSE protocol includes a way for a server to tell the client that a
negative lookup response is cacheable for a certain amount of time.
PR: 236226
Sponsored by: The FreeBSD Foundation
Provide a convenience function to avoid the hack with filling fake
struct vop_fsync_args and then calling vop_stdfsync().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Follow-up to r346046. These two commits implement fuse cache timeouts for
both entries and attributes. They also remove the vfs.fusefs.lookup_cache
enable sysctl, which is no longer needed now that cache timeouts are
honored.
PR: 235773
Sponsored by: The FreeBSD Foundation
The FUSE protocol allows the server to specify the timeout period for the
client's attribute and entry caches. This commit implements the timeout
period for the attribute cache. The entry cache's timeout period is
currently disabled because it panics, and is guarded by the
vfs.fusefs.lookup_cache_expire sysctl.
PR: 235773
Reported by: cem
Sponsored by: The FreeBSD Foundation
FUSE_LOOKUP, FUSE_GETATTR, FUSE_SETATTR, FUSE_MKDIR, FUSE_LINK,
FUSE_SYMLINK, FUSE_MKNOD, and FUSE_CREATE all return file attributes with a
cache validity period. fusefs will now cache the attributes, if the server
returns a non-zero cache validity period.
This change does _not_ implement finite attr cache timeouts. That will
follow as part of PR 235773.
PR: 235775
Reported by: cem
Sponsored by: The FreeBSD Foundation
Don't page fault if the file descriptor provided with "-o fd" is invalid.
This is a merge of r345419 from the projects/fuse2 branch.
Reviewed by: ngie
Tested by: Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19836
VOP_ACCESS was never fully implemented in fusefs. This change:
* Removes the FACCESS_DO_ACCESS flag, which pretty much disabled the whole
vop.
* Removes a quixotic special case for VEXEC on regular files. I don't know
why that was in there.
* Removes another confusing special case for VADMIN.
* Removes the FACCESS_NOCHECKSPY flag. It seemed to be a performance
optimization, but I'm unconvinced that it was a net positive.
* Updates test cases.
This change does NOT implement -o default_permissions. That will be handled
separately.
PR: 236291
Sponsored by: The FreeBSD Foundation
When -o allow_other is not in use, fusefs is supposed to prevent access to
the filesystem by any user other than the one who owns the daemon. Our
fusefs implementation was only enforcing that restriction at the mountpoint
itself. That was usually good enough because lookup usually descends from
the mountpoint. However, there are cases when it doesn't, such as when
using openat relative to a file beneath the mountpoint.
PR: 237052
Sponsored by: The FreeBSD Foundation
If a fuse file system returne FOPEN_KEEP_CACHE in the open or create
response, then the client is supposed to _not_ clear its caches for that
file. I don't know why clearing the caches would be the default given that
there's a separate flag to bypass the cache altogether, but that's the way
it is. fusefs(5) will now honor this flag.
Our behavior is slightly different than Linux's because we reuse file
handles. That means that open(2) wont't clear the cache if there's a
reusable file handle, even if the file server wouldn't have sent
FOPEN_KEEP_CACHE had we opened a new file handle like Linux does.
PR: 236560
Sponsored by: The FreeBSD Foundation
This bug was long present, but was exacerbated by r345876.
The problem is that fiov_refresh was bzero()ing a buffer _before_ it
reallocated that buffer. That's obviously the wrong order. I fixed the
order in r345876, which exposed the main problem. Previously, the first 160
bytes of the buffer were getting bzero()ed when it was first allocated in
fiov_init. Subsequently, as that buffer got recycled between callers, the
portion used by the _previous_ caller was getting bzero()ed by the current
caller in fiov_refresh. The problem was never visible simply because no
caller was trying to use more than 160 bytes.
Now the buffer gets properly bzero()ed both at initialization time and any
time it gets enlarged or reallocated.
Sponsored by: The FreeBSD Foundation
If a FUSE daemon returns FOPEN_DIRECT_IO when a file is opened, then it's
allowed to write less data than was requested during a FUSE_WRITE operation
on that file handle. fusefs should simply return a short write to userland.
The old code attempted to resend the unsent data. Not only was that
incorrect behavior, but it did it in an ineffective way, by attempting to
"rewind" the uio and uiomove the unsent data again.
This commit correctly handles short writes by returning directly to
userland if FOPEN_DIRECT_IO was set. If it wasn't set (making the short
write technically a protocol violation), then we resend the unsent data.
But instead of rewinding the uio, just resend the data that's already in the
kernel.
That necessitated a few changes to fuse_ipc.c to reduce the amount of bzero
activity. fusefs may be marginally faster as a result.
PR: 236381
Sponsored by: The FreeBSD Foundation
The original fusefs import, r238402, contained a bug in fuse_vnop_close that
could close a directory's file handle while there were still other open file
descriptors. The code looks deliberate, but there is no explanation for it.
This necessitated a workaround in fuse_vnop_readdir that would open a new
file handle if, "for some mysterious reason", that vnode didn't have any
open file handles. r345781 had the effect of causing the workaround to
panic, making the problem more visible.
This commit removes the workaround and the original bug, which also fixes
the panic.
Sponsored by: The FreeBSD Foundation
The FUSE protocol says that FUSE_FLUSH should be send every time a file
descriptor is closed. That's not quite possible in FreeBSD because multiple
file descriptors can share a single struct file, and closef doesn't call
fo_close until the last close. However, we can still send FUSE_FLUSH on
every VOP_CLOSE, which is probably good enough.
There are two purposes for FUSE_FLUSH. One is to allow file systems to
return EIO if they have an error when writing data that's cached
server-side. The other is to release POSIX file locks (which fusefs(5) does
not yet support).
PR: 236405, 236327
Sponsored by: The FreeBSD Foundation
During truncate, fusefs was discarding entire cached blocks, but it wasn't
zeroing out the unused portion of a final partial block. This resulted in
reads returning stale data.
PR: 233783
Reported by: fsx
Sponsored by: The FreeBSD Foundation
This commit cleans up after recent commits, especially 345766, 345768, and
345781. There is no functional change. The most important change is to add
comments documenting why we can't send flags like O_APPEND in
FUSE_WRITE_OPEN.
PR: 236340
Sponsored by: The FreeBSD Foundation
By default, FUSE performs authorization in the server. That means that it's
insecure for the client to reuse FUSE file handles between different users,
groups, or processes. Linux handles this problem by creating a different
FUSE file handle for every file descriptor. FreeBSD can't, due to
differences in our VFS design.
This commit adds credential information to each fuse_filehandle. During
open(2), fusefs will now only reuse a file handle if it matches the exact
same access mode, pid, uid, and gid of the calling process.
PR: 236844
Sponsored by: The FreeBSD Foundation
O_EXEC is useful for fexecve(2) and fchdir(2). Treat it as another fufh
type alongside the existing RDONLY, WRONLY, and RDWR. Prior to r345742 this
would've caused a memory and performance penalty.
PR: 236329
Sponsored by: The FreeBSD Foundation
r345742 replaced fusefs's fufh array with a fufh list. But it left a few
array idioms in place. This commit replaces those idioms with more
efficient list idioms. One location is in fuse_filehandle_close, which now
takes a pointer argument. Three other locations are places that had to loop
over all of a vnode's fuse filehandles.
Sponsored by: The FreeBSD Foundation
The FUSE protocol allows each open file descriptor to have a unique file
handle. On FreeBSD, these file handles must all be stored in the vnode.
The old method (also used by OSX and OpenBSD) is to store them all in a
small array. But that limits the total number that can be stored. This
commit replaces the array with a linked list (a technique also used by
Illumos). There is not yet any change in functionality, but this is the
first step to fixing several bugs.
PR: 236329, 236340, 236381, 236560, 236844
Discussed with: cem
Sponsored by: The FreeBSD Foundation
Previously fusefs would treat any file opened O_WRONLY as though the
FOPEN_DIRECT_IO flag were set, in an attempt to avoid issuing reads as part
of a RMW write operation on a cached part of the file. However, the FUSE
protocol explicitly allows reads of write-only files for precisely that
reason.
Sponsored by: The FreeBSD Foundation
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1. Totally redundant with dtrace FBT probes. These I deleted.
2. Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace. They
can be filtered by FILE and/or by priority.
3. More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Also, de-inline fuse_internal_cache_attrs. It's big enough to be a regular
function, and this way it gets a dtrace FBT probe.
This commit is a merge of r345304, r344914, r344703, and r344664 from
projects/fuse2.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19667
The dead code in question was a broken and incomplete attempt to support the
default_permissions mount option during VOP_SETATTR. There wasn't anything
there worth saving; I'll have to rewrite it later.
Reported by: Coverity
Coverity CID: 1008668
Sponsored by: The FreeBSD Foundation
fuse_vnop_create must close the newly created file if it can't allocate a
vnode. When it does so, it must use the same file flags for FUSE_RELEASE as
it used for FUSE_OPEN or FUSE_CREATE.
Reported by: Coverity
Coverity CID: 1066204
Sponsored by: The FreeBSD Foundation
This change also inlines several previously #define'd symbols that didn't
really have the meanings indicated by the comments.
Sponsored by: The FreeBSD Foundation
If a FUSE filesystem returns ENOSYS for FUSE_CREATE, then fallback to
FUSE_MKNOD/FUSE_OPEN.
Also, fix a memory leak in the error path of fuse_vnop_create. And do a
little cleanup in fuse_vnop_open.
PR: 199934
Reported by: samm@os2.kiev.ua
Sponsored by: The FreeBSD Foundation
The FUSE protocol allows for LOOKUP to return a cacheable negative response,
which means that the file doesn't exist and the kernel can cache its
nonexistence. As of this commit fusefs doesn't cache the nonexistence, but
it does correctly handle such responses. Prior to this commit attempting to
create a file, even with O_CREAT would fail with ENOENT if the daemon
returned a cacheable negative response.
PR: 236231
Sponsored by: The FreeBSD Foundation
For an unknown reason, fusefs was _always_ sending the fdatasync operation
instead of fsync. Now it correctly sends one or the other.
Also, remove the Fsync.fsync_metadata_only test, along with the recently
removed Fsync.nop. They should never have been added. The kernel shouldn't
keep track of which files have dirty data; that's the daemon's job.
PR: 236473
Sponsored by: The FreeBSD Foundation
I committed too hastily in r345390. There are cases, not directly reachable
from userland, where VOP_FSYNC ought to be asynchronous. This commit fixes
fusefs to handle VOP_FSYNC synchronously if and only if the VFS requests it.
PR: 236474
X-MFC-With: 345390
Sponsored by: The FreeBSD Foundation
This makes it more consistent with other filesystems, which all end in "fs",
and more consistent with its mount helper, which is already named
"mount_fusefs".
Reviewed by: cem, rgrimes
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19649
I missed these in r344664. They're basically useless because they can only
be controlled at compile-time. Also, de-inline fuse_internal_cache_attrs.
It's big enough to be a regular function, and this way it gets a dtrace FBT
probe.
Sponsored by: The FreeBSD Foundation
On GENERIC kernels with empty loader.conf, there is no functional change.
DFLTPHYS and MAXBSIZE are both 64kB at the moment. This change allows
larger bufcache block sizes to be used when either MAXBSIZE (custom kernel)
or the loader.conf tunable vfs.maxbcachebuf (GENERIC) is adjusted higher
than the default.
Suggested by: ken@
When open(2) was invoked against a FUSE filesystem with an unexpected flags
value (no O_RDONLY / O_RDWR / O_WRONLY), an assertion fired, causing panic.
For now, prevent the panic by rejecting such VOP_OPENs with EINVAL.
This is not considered the correct long term fix, but does prevent an
unprivileged denial-of-service.
PR: 236329
Reported by: asomers
Reviewed by: asomers
Sponsored by: Dell EMC Isilon
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1) Totally redundant with dtrace FBT probes. These I deleted.
2) Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace.
They can be filtered by FILE and/or by priority.
3) More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Sponsored by: The FreeBSD Foundation
On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
use a buf cache block size in excess of permitted size. This did not affect
most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
The issue was discovered and reported using a custom kernel with a DFLTPHYS
of 512kB.
PR: 230260 (comment #9)
Reported by: ken@
MFC after: π/𝑒 weeks
Take a pass through fixing some of the most egregious whitespace issues in
fs/fuse. Also fix some style(9) warts while here. Not 100% cleaned up, but
somewhat less painful to look at and edit.
No functional change.
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.
As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).
We do something similar in fuse_vnop_getattr already.
PR: 230258 (as reported in description; other issues explored in
comments are not all resolved)
Reported by: MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.
Prior to this change, we implement and default to writeback caching. This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.
In this change, add cache mode as an extension of cache enable/disable. The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).
For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data. WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement. Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.
Refs:
* https://sourceforge.net/p/fuse/mailman/message/8902254/
* https://github.com/vgough/encfs/issues/315
PR: 230258 (inspired by)
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR). In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).
PR: 230258 (related)
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.
PR: 230258 (inspired by; does not fix)
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.
One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy. It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?
Some minor APIs changed to facilitate this:
1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").
2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0. It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).
3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.
4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache). The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.
PR: 230258 (inspired by; does not fix)
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace. With the ino64 work there
are also some explicit pad fields in struct dirent. Add a subroutine
to clear these bytes and use it in the in-tree filesystems. The
NFS client is omitted for now as it was fixed separately in r340787.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
This will enable callers to take const paths as part of syscall
decleration improvements.
Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.
Bump __FreeBSD_version for external API consumers.
Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17805
The requested size was returned incorrectly in case uio == NULL from listextattr because the
nameprefix/name conversion was not applied.
Also, make a_size/uio returning logic more unified with other filesystems.
Reviewed by: cem, pfg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13528
So that I don't have to keep grepping around the codebase to remember what each
one does. And maybe it saves someone else some time.
Fix a trivial whitespace issue while here.
No functional change.
Sponsored by: Dell EMC Isilon
Various components restrict size of IO passed up to the userspace filesystem
based on the mount's f_iosize value. The previous default of PAGE_SIZE
is anemic, even for normal filesystems, but especially considering every
FUSE operation involves a kernel <-> userspace IPC upcall.
Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations.
Anecdotally, Jakub reports IO read performance increased from 600 MB/s ->
2700 MB/s with a basic RAM-backed FUSE filesystem.
PR: 230260
Reported by: Peter (MooseFS) <freebsd AT moosefs.com>
Tested by: Jakub Kruszona-Zawadzki <acid AT moosefs.com>
MFC after: 3 days
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Set FUSE_LINK_MAX to UINT32_MAX instead of LINK_MAX to match the maximum
link count possible in the 'nlink' field of 'struct fuse_attr'.
Sponsored by: Chelsio Communications
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.
Requested by: bde (though perhaps not this exact implementation)
Reviewed by: kib (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
This method handles _PC_FILESIZEBITS, _PC_SYMLINK_MAX, and _PC_NO_TRUNC.
For other values it defers to vop_stdpathconf().
MFC after: 1 month
Sponsored by: Chelsio Communications
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
This finishes what r245164 started and makes open(..., O_APPEND) work again
after r299753.
- Pass ioflags, incl. IO_APPEND, down to the direct write backend (r245164
added it to only the bio backend).
- (r299753 changed the WRONLY backend from bio to direct.)
PR: 220185
Reported by: Ben RUBSON <ben.rubson at gmail.com>
Reviewed by: bapt@, rmacklem@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11348
Implement FUSE open flag FOPEN_KEEP_CACHE. Without this flag, cached file
contents should be invalidated on open. Apparently, fusefs-encfs relies
upon this behavior.
PR: 218636
Submitted by: Ben RUBSON <ben.rubson at gmail.com>
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
on a fuse mounted file system, it will crash. Although it may be
possible to make this work correctly, this patch avoids the crash
in the meantime.
I removed the MPASS(), since panicing for the FIFO case didn't make
a lot of sense when it returns an error for the others.
PR: 195000
Submitted by: henry.hu.sh@gmail.com (earlier version)
MFC after: 2 weeks
When "cp" of a file with read-only (mode 0444) to a fuse mounted
file system was attempted it would fail with EACCES. This was because
fuse would attempt to open the file WRONLY and the open would fail.
This patch changes the fuse_vnop_open() to test for an extant read-write
open and use that, if it is available.
This makes the "cp" of a read-only file to the fuse mounted file system
work ok.
There are simpler ways to fix this than adding the fuse_filehandle_validrw()
function, but this function is useful for future patches related to
exporting a fuse filesystem via NFS.
MFC after: 2 weeks
When I/O on a file under fuse is switched from buffered to DIRECT_IO,
it was possible to read stale (before a recent modification) data from
the buffer cache. This patch invalidates the buffer cache for the
file to fix this.
PR: 194293
MFC after: 2 weeks
When a file is opened write-only and a partial block was written,
buffered I/O would try and read the whole block in. This would
result in a hung thread, since there was no open (fuse filehandle)
that allowed reading. This patch avoids the problem by forcing
DIRECT_IO for this case.
It also sets DIRECT_IO when the file system specifies the FN_DIRECTIO
flag in its reply to the open.
Tested by: nishida@asusa.net, freebsd@moosefs.com
PR: 194293, 206238
MFC after: 2 weeks
With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.
While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.
Pointy hat to: glebius
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.
Reviewed by: kib
MFC after: 2 weeks
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.
Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter(). Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.
Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.
Suggested by: Chris Torek <torek@pi-coral.com>
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl