Specifically, this allows (via "-V vhostname") telling nfsd what principal
to use, instead of the hostname. This is used at iXsystems for fail-over in
HA systems.
Reviewed by: macklem
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D19191
back to the lever before r343030. For 64-bit machines reduce it slightly,
too. Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.
Provide a loader tunable to change vnode pager pbufs count. Document it.
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.
As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).
We do something similar in fuse_vnop_getattr already.
PR: 230258 (as reported in description; other issues explored in
comments are not all resolved)
Reported by: MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.
Prior to this change, we implement and default to writeback caching. This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.
In this change, add cache mode as an extension of cache enable/disable. The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).
For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data. WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement. Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.
Refs:
* https://sourceforge.net/p/fuse/mailman/message/8902254/
* https://github.com/vgough/encfs/issues/315
PR: 230258 (inspired by)
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR). In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).
PR: 230258 (related)
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.
PR: 230258 (inspired by; does not fix)
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.
One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy. It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?
Some minor APIs changed to facilitate this:
1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").
2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0. It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).
3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.
4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache). The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.
PR: 230258 (inspired by; does not fix)
Building binaries as PIE allows the executable itself to be loaded at a
random address when ASLR is enabled (not just its shared libraries).
With this change PIE objects have a .pieo extension and INTERNALLIB
libraries libXXX_pie.a.
MK_PIE is disabled for some kerberos5 tools, Clang, and Subversion, as
they explicitly reference .a libraries in their Makefiles. These can
be addressed on an individual basis later. MK_PIE is also disabled for
rtld-elf because it is already position-independent using bespoke
Makefile rules.
Currently only dynamically linked binaries will be built as PIE.
Discussed with: dim
Reviewed by: kib
MFC after: 1 month
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18423
net_open previously casted the first vararg to a char * and this was
half-OK: at first, it is passed to netif_open, which would cast it back to
the struct devdesc * that it really is and use it properly. It is then
strdup()d and used as the netdev_name, which is objectively wrong.
Correct it so that the first vararg is properly casted to a struct devdesc *
and the netdev_name gets set properly to make it more clear at a glance that
it's not doing something horribly wrong.
Reported by: mmel
Reviewed by: imp, mmel, tsoome
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19206
was incorrectly implemented leading to a possible double free.
It is possible for both the conditional free,
and the unconditional free added in r340044 to be done,
fix that by initializing uopt to NULL,
removing the conditional free,
and only using the unconditional free at the end.
Reported by: Patrick Mooney (patrick.mooney@joyent.com)
Reviewed by: jhb (maintainer), Patrick Mooney (joyent/illumos)
Approved by: bde (mentor)
CID: 1357336
MFC after: 3 days
MFC with: 340044
Differential Revision: https://reviews.freebsd.org/D19202
nopt is the only allocated space,
xopt and cp are aliases into that allocated space.
Remove the 2 unneeded free's
Reported by: Patrick Mooney (@pmooney_pfmooney.com)
Reviewed by: jhb (maintainer), Patrick Mooney (joyent/illumos)
Approved by: bde (mentor)
CID: 1305412
MFC after: 3 days
MFC with: 340042
Differential Revision: https://reviews.freebsd.org/D19200
In out of order mode Rx buffer are accesses by req_id.
Accessing and validating mbuf using ntc is causing false error.
Increase driver revision after latest RX OOO completion fixes.
Submitted by: Rafal Kozik <rk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
MFC after: 1 week
Requested ID should be validated when the packet is received and not
when the driver is repopulating the mbufs.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
MFC after: 1 week
This commit essentially has three parts:
* Add the AES-CCM encryption hooks. This is in and of itself fairly small,
as there is only a small difference between CCM and the other ICM-based
algorithms.
* Hook the code into the OpenCrypto framework. This is the bulk of the
changes, as the algorithm type has to be checked for, and the differences
between it and GCM dealt with.
* Update the cryptocheck tool to be aware of it. This is invaluable for
confirming that the code works.
This is a software-only implementation, meaning that the performance is very
low.
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D19090
This adds the CBC-MAC code to the kernel, but does not hook it up to
anything (that comes in the next commit).
https://tools.ietf.org/html/rfc3610 describes the algorithm.
Note that this is a software-only implementation, which means it is
fairly slow.
Sponsored by: iXsystems Inc
Differential Revision: https://reviews.freebsd.org/D18592
The previous fix was unnecessarily very slow up to 105 hours where the
simple formula used previously worked, and unnecessarily slow by a factor
of about 5/3 up to 388 days, and didn't work above 388 days. 388 days is
not a long time, since it is a reasonable uptime, and for processes the
times being calculated are aggregated over all threads, so with N CPUs
running the same thread a runtime of 388 days is reachable after only
388 / N physical days.
The PRs document overflow at 388 days, but don't try to fix it.
Use the simple formula up to 76 hours. Then use a complicated general
method that reduces to the simple formula up to a bit less than 105
hours, then reduces to the previous method without its extra work up
to almost 388 days, then does more complicated reductions, usually
many bits at a time so that this is not slow. This works up to half
of maximum representable time (292271 years), with accumulated rounding
errors of at most 32 usec.
amd64 can do all this with no avoidable rounding errors in an inline
asm with 2 instructions, but this is too special to use. __uint128_t
can do the same with 100's of instructions on 64-bit arches. Long
doubles with at least 64 bits of precision are the easiest method to
use on i386 userland, but are hard to use in the kernel.
PR: 76972 and duplicates
Reviewed by: kib
Don't use a struct if_irq for IFLIB_INTR_IOV type interrupts since that results
in get_core_offset() being called on them, and get_core_offset() doesn't
handle IFLIB_INTR_IOV type interrupts, which results in an assert() being triggered
in iflib_irq_set_affinity().
PR: 235730
Reported by: Jeffrey Pieper <jeffrey.e.pieper@intel.com>
MFC after: 1 day
Sponsored by: Intel Corporation
Make the clustering enabling knob more fine-grained by providing a
setting where the allocation with hint is not clustered. This is aimed
to be somewhat more compatible with e.g. go 1.4 which expects that
hinted mmap without MAP_FIXED does not change the allocation address.
Now the vm.cluster_anon can be set to 1 to only cluster when no hints,
and to 2 to always cluster. Default value is 1.
Requested by: peter
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19194
When sigreturn() restored a thread's context, SRR1 was being restored
to its previous value, but pcb_flags was not being touched.
This could cause a mismatch between the thread's MSR and its pcb_flags.
For instance, when the thread used the FPU for the first time inside
the signal handler, sigreturn() would clear SRR1, but not pcb_flags.
Then, the thread would return with the FPU bit cleared in MSR and,
the next time it tried to use the FPU, it would fail on a KASSERT
that checked if the FPU was disabled.
This change clears the FPU bit in both pcb_flags and frame->srr1,
as the code that restores the context expects to use the FPU trap
to re-enable it.
PR: 234539
Reported by: sbruno
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D19166
In particular, use ifuncs for __getcontextx_size(), also calculate the
size of the extended save area in resolver. Same for __fillcontextx2().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Some older compilers, when generating PIC code, cannot handle inline
asm that clobbers %ebx (because %ebx is used as the GOT offset
register). Userspace versions avoid clobbering %ebx by saving it to
stack before executing the CPUID instruction.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
[MC] Make symbol version errors non-fatal
We stil don't have a source location, which is pretty lame, but at
least we won't tell the user to file a clang bug report anymore.
Fixes PR40712
This will make errors for symbols with @@ versions that are not defined
non-fatal. For example:
void f(void)
{
__asm__(".symver foo,bar@@baz");
}
will now result in:
error: versioned symbol bar@@baz must be defined
instead of clang crashing with a diagnostic report.
PR: 234671
Upstream PR: https://bugs.llvm.org/show_bug.cgi?id=40712
MFC after: 3 days
This reduces the overhead of TLB invalidations by ensuring that we
only interrupt CPUs which are using the given pmap. Tracking is
performed in pmap_activate(), which gets called during context switches:
from cpu_throw(), if a thread is exiting or an AP is starting, or
cpu_switch() for a regular context switch.
For now, pmap_sync_icache() still must interrupt all CPUs.
Reviewed by: kib (earlier version), jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18874
This includes support for pmap_enter(..., psind=1) as described in the
commit log message for r321378.
The changes are largely modelled after amd64. arm64 has more stringent
requirements around superpage creation to avoid the possibility of TLB
conflict aborts, and these requirements do not apply to RISC-V, which
like amd64 permits simultaneous caching of 4KB and 2MB translations for
a given page. RISC-V's PTE format includes only two software bits, and
as these are already consumed we do not have an analogue for amd64's
PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire
2MB address range.
pmap_ts_referenced() is modified to clear PTE_A, now that we support
both hardware- and software-managed reference and dirty bits. Also
fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel
mappings.
Reviewed by: kib (earlier version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18863
Differential Revision: https://reviews.freebsd.org/D18864
Differential Revision: https://reviews.freebsd.org/D18865
Differential Revision: https://reviews.freebsd.org/D18866
Differential Revision: https://reviews.freebsd.org/D18867
Differential Revision: https://reviews.freebsd.org/D18868
But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables,
and thus it needs VNET context, that is missing during gtaskqueue
executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred().
PR: 235684
MFC after: 1 week
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D19032
be_destroy is documented to recursively destroy a boot environment. In the
case of snapshots, one would take this to mean that these are also
recursively destroyed. However, this was previously not the case.
be_destroy would descend into the be_destroy callback and attempt to
zfs_iter_children on the top-level snapshot, which is bogus.
Our alternative approach is to take note of the snapshot name and iterate
through all of fs children of the BE to try destruction in the children.
The -o option is also fixed to work properly with deep BEs. If the BE was
created with `bectl create -e otherDeepBE newDeepBE`, for instance, then a
recursive snapshot of otherDeepBE would have been taken for construction of
newDeepBE but a subsequent destroy with BE_DESTROY_ORIGIN set would only
clean up the snapshot at the root of otherDeepBE: ${BEROOT}/otherDeepBE@...
The most recent iteration instead pretends not to know how these things
work, verifies that the origin is another BE and then passes that back
through be_destroy to DTRT when snapshots and deep BEs may be in play.
MFC after: 1 week
Newer cores have the 'tlbilx' instruction, which doesn't broadcast over
CoreNet. This is significantly faster than walking the TLB to invalidate
the PID mappings. tlbilx with the arguments given takes 131 clock cycles to
complete, as opposed to 512 iterations through the loop plus tlbre/tlbwe at
each iteration.
MFC after: 3 weeks
The panic message lead people to believe some userland CAM request had
caused a problem when in reallity it was for a kernel request (eg the
USER bit was cleared). Reword message. Also, improve a couple of
comments to reflect that the periph shouldn't be completely torn down
before we get here (so the path and sim pointers should be valid, but
aren't and the code is designed to be robust enough in the face of
that to give a specific panic message).
Set up zpools with a more unique name, stash the zpool name away in a file pointed
to by `$ZPOOL_NAME_FILE` (which is relative to a per-testcase generated temporary
directory), then remove the file based on `$ZPOOL_NAME_FILE` in the cleanup
routines.
This is a more concurrency-safe solution and will allow the testcases to be safely
executed in parallel.
Reviewed by: kevans, jtl
Approved by: jtl (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19024