Don't page fault if the file descriptor provided with "-o fd" is invalid.
This is a merge of r345419 from the projects/fuse2 branch.
Reviewed by: ngie
Tested by: Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19836
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1. Totally redundant with dtrace FBT probes. These I deleted.
2. Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace. They
can be filtered by FILE and/or by priority.
3. More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Also, de-inline fuse_internal_cache_attrs. It's big enough to be a regular
function, and this way it gets a dtrace FBT probe.
This commit is a merge of r345304, r344914, r344703, and r344664 from
projects/fuse2.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19667
This makes it more consistent with other filesystems, which all end in "fs",
and more consistent with its mount helper, which is already named
"mount_fusefs".
Reviewed by: cem, rgrimes
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19649
I missed these in r344664. They're basically useless because they can only
be controlled at compile-time. Also, de-inline fuse_internal_cache_attrs.
It's big enough to be a regular function, and this way it gets a dtrace FBT
probe.
Sponsored by: The FreeBSD Foundation
On GENERIC kernels with empty loader.conf, there is no functional change.
DFLTPHYS and MAXBSIZE are both 64kB at the moment. This change allows
larger bufcache block sizes to be used when either MAXBSIZE (custom kernel)
or the loader.conf tunable vfs.maxbcachebuf (GENERIC) is adjusted higher
than the default.
Suggested by: ken@
When open(2) was invoked against a FUSE filesystem with an unexpected flags
value (no O_RDONLY / O_RDWR / O_WRONLY), an assertion fired, causing panic.
For now, prevent the panic by rejecting such VOP_OPENs with EINVAL.
This is not considered the correct long term fix, but does prevent an
unprivileged denial-of-service.
PR: 236329
Reported by: asomers
Reviewed by: asomers
Sponsored by: Dell EMC Isilon
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1) Totally redundant with dtrace FBT probes. These I deleted.
2) Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace.
They can be filtered by FILE and/or by priority.
3) More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Sponsored by: The FreeBSD Foundation
On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
use a buf cache block size in excess of permitted size. This did not affect
most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
The issue was discovered and reported using a custom kernel with a DFLTPHYS
of 512kB.
PR: 230260 (comment #9)
Reported by: ken@
MFC after: π/𝑒 weeks
Take a pass through fixing some of the most egregious whitespace issues in
fs/fuse. Also fix some style(9) warts while here. Not 100% cleaned up, but
somewhat less painful to look at and edit.
No functional change.
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.
As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).
We do something similar in fuse_vnop_getattr already.
PR: 230258 (as reported in description; other issues explored in
comments are not all resolved)
Reported by: MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.
Prior to this change, we implement and default to writeback caching. This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.
In this change, add cache mode as an extension of cache enable/disable. The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).
For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data. WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement. Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.
Refs:
* https://sourceforge.net/p/fuse/mailman/message/8902254/
* https://github.com/vgough/encfs/issues/315
PR: 230258 (inspired by)
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR). In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).
PR: 230258 (related)
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.
PR: 230258 (inspired by; does not fix)
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.
One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy. It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?
Some minor APIs changed to facilitate this:
1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").
2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0. It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).
3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.
4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache). The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.
PR: 230258 (inspired by; does not fix)
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace. With the ino64 work there
are also some explicit pad fields in struct dirent. Add a subroutine
to clear these bytes and use it in the in-tree filesystems. The
NFS client is omitted for now as it was fixed separately in r340787.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
This will enable callers to take const paths as part of syscall
decleration improvements.
Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.
Bump __FreeBSD_version for external API consumers.
Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17805
The requested size was returned incorrectly in case uio == NULL from listextattr because the
nameprefix/name conversion was not applied.
Also, make a_size/uio returning logic more unified with other filesystems.
Reviewed by: cem, pfg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13528
So that I don't have to keep grepping around the codebase to remember what each
one does. And maybe it saves someone else some time.
Fix a trivial whitespace issue while here.
No functional change.
Sponsored by: Dell EMC Isilon
Various components restrict size of IO passed up to the userspace filesystem
based on the mount's f_iosize value. The previous default of PAGE_SIZE
is anemic, even for normal filesystems, but especially considering every
FUSE operation involves a kernel <-> userspace IPC upcall.
Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations.
Anecdotally, Jakub reports IO read performance increased from 600 MB/s ->
2700 MB/s with a basic RAM-backed FUSE filesystem.
PR: 230260
Reported by: Peter (MooseFS) <freebsd AT moosefs.com>
Tested by: Jakub Kruszona-Zawadzki <acid AT moosefs.com>
MFC after: 3 days
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Set FUSE_LINK_MAX to UINT32_MAX instead of LINK_MAX to match the maximum
link count possible in the 'nlink' field of 'struct fuse_attr'.
Sponsored by: Chelsio Communications
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.
Requested by: bde (though perhaps not this exact implementation)
Reviewed by: kib (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
This method handles _PC_FILESIZEBITS, _PC_SYMLINK_MAX, and _PC_NO_TRUNC.
For other values it defers to vop_stdpathconf().
MFC after: 1 month
Sponsored by: Chelsio Communications
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
This finishes what r245164 started and makes open(..., O_APPEND) work again
after r299753.
- Pass ioflags, incl. IO_APPEND, down to the direct write backend (r245164
added it to only the bio backend).
- (r299753 changed the WRONLY backend from bio to direct.)
PR: 220185
Reported by: Ben RUBSON <ben.rubson at gmail.com>
Reviewed by: bapt@, rmacklem@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11348
Implement FUSE open flag FOPEN_KEEP_CACHE. Without this flag, cached file
contents should be invalidated on open. Apparently, fusefs-encfs relies
upon this behavior.
PR: 218636
Submitted by: Ben RUBSON <ben.rubson at gmail.com>
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
on a fuse mounted file system, it will crash. Although it may be
possible to make this work correctly, this patch avoids the crash
in the meantime.
I removed the MPASS(), since panicing for the FIFO case didn't make
a lot of sense when it returns an error for the others.
PR: 195000
Submitted by: henry.hu.sh@gmail.com (earlier version)
MFC after: 2 weeks
When "cp" of a file with read-only (mode 0444) to a fuse mounted
file system was attempted it would fail with EACCES. This was because
fuse would attempt to open the file WRONLY and the open would fail.
This patch changes the fuse_vnop_open() to test for an extant read-write
open and use that, if it is available.
This makes the "cp" of a read-only file to the fuse mounted file system
work ok.
There are simpler ways to fix this than adding the fuse_filehandle_validrw()
function, but this function is useful for future patches related to
exporting a fuse filesystem via NFS.
MFC after: 2 weeks
When I/O on a file under fuse is switched from buffered to DIRECT_IO,
it was possible to read stale (before a recent modification) data from
the buffer cache. This patch invalidates the buffer cache for the
file to fix this.
PR: 194293
MFC after: 2 weeks
When a file is opened write-only and a partial block was written,
buffered I/O would try and read the whole block in. This would
result in a hung thread, since there was no open (fuse filehandle)
that allowed reading. This patch avoids the problem by forcing
DIRECT_IO for this case.
It also sets DIRECT_IO when the file system specifies the FN_DIRECTIO
flag in its reply to the open.
Tested by: nishida@asusa.net, freebsd@moosefs.com
PR: 194293, 206238
MFC after: 2 weeks
With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.
While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.
Pointy hat to: glebius
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix