If an exception or NMI occurs before CPU switched to a pmap different
from vmspace0, PCPU kcr3 is left zero for pti config, which causes
triple-fault in the handler.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16626
has SMP enabled, lines can get intermixed with other console output
making these lines hard to read...
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D16689
This is an amalgam of a patch by Doug Ambrisko to
generalize uart_acpi_find_device, imp moving the
ACPI table to uart_dev_ns8250.c and advice by jhb
to work around a bug in the EPYC 3151 BIOS
(the BIOS incorrectly marks the serial ports as
disabled)
Reviewed by: imp
MFC after: 8 weeks
Differential Revision: https://reviews.freebsd.org/D16432
This is effectively a merge from amd64 of r312888, r323235, and r333486.
I've been running this on my POWER9 Talos for some time now with no ill
effects.
Suggested by: mjg
Recent DTS use the syscon for the emac controller.
We support this but since U-Boot is still using old DTS it was never
needed for us to add this support, but this is a problem when using upstream
recent DTS and will be when U-Boot will catch up.
While here add a new compatible to the aw_syscon driver as Linux changed it ...
Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D. If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers. There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.
Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded. If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.
Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16790
ObsoleteFiles.inc:
Remove manual pages for arc4random_addrandom(3) and
arc4random_stir(3).
contrib/ntp/lib/isc/random.c:
contrib/ntp/sntp/libevent/evutil_rand.c:
Eliminate in-tree usage of arc4random_addrandom().
crypto/heimdal/lib/roken/rand.c:
crypto/openssh/config.h:
Eliminate in-tree usage of arc4random_stir().
include/stdlib.h:
Remove arc4random_stir() and arc4random_addrandom() prototypes,
provide temporary shims for transistion period.
lib/libc/gen/Makefile.inc:
Hook arc4random-compat.c to build, add hint for Chacha20 source for
kernel, and remove arc4random_addrandom(3) and arc4random_stir(3)
links.
lib/libc/gen/arc4random.c:
Adopt OpenBSD arc4random.c,v 1.54 with bare minimum changes, use the
sys/crypto/chacha20 implementation of keystream.
lib/libc/gen/Symbol.map:
Remove arc4random_stir and arc4random_addrandom interfaces.
lib/libc/gen/arc4random.h:
Adopt OpenBSD arc4random.h,v 1.4 but provide _ARC4_LOCK of our own.
lib/libc/gen/arc4random.3:
Adopt OpenBSD arc4random.3,v 1.35 but keep FreeBSD r114444 and
r118247.
lib/libc/gen/arc4random-compat.c:
Compatibility shims for arc4random_stir and arc4random_addrandom
functions to preserve ABI. Log once when called but do nothing
otherwise.
lib/libc/gen/getentropy.c:
lib/libc/include/libc_private.h:
Fold __arc4_sysctl into getentropy.c (renamed to arnd_sysctl).
Remove from libc_private.h as a result.
sys/crypto/chacha20/chacha.c:
sys/crypto/chacha20/chacha.h:
Make it possible to use the kernel implementation in libc.
PR: 182610
Reviewed by: cem, markm
Obtained from: OpenBSD
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D16760
The MPTable probe code was using PMAP_MAP_LOW as the PA -> VA offset
when searching for the table signature but still using KERNBASE once
it had found the table. As a result, the mpfps table pointed into a
random part of the kernel text instead of the actual MP Table.
Rather than adding more #ifdef's, use BIOS_PADDRTOVADDR from
<machine/pc/bios.h> which already uses PMAP_MAP_LOW on i386 and KERNBASE
on amd64.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16802
blocks of a file as contiguously as possible. Since the filesystem
does not know how large a file will grow when it is first being
written, it initially places the file in a set of blocks in which
it currently fits. As it grows, it is relocated to areas with
larger contiguous blocks. In this way it saves its large contiguous
sets of blocks for the files that need them and thus avoids
unnecessaily fragmenting its disk space.
We used to skip reallocating the blocks of a file into a contiguous
sequence if the underlying flash device requested BIO_DELETE
notifications, because devices that benefit from BIO_DELETE also
benefit from not moving the data. However, in the algorithm described
above that reallocates the blocks, the destination for the data is
usually moved before the data is written to the initially allocated
location. So we rarely suffer the penalty of extra writes. With
the addition of the consolodation of contiguous blocks into single
BIO_DELETE operations, having fewer but larger contiguous blocks
reduces the number of (slow and expensive) BIO_DELETE operations.
So when doing BIO_DELETE consolodation, we do block reallocation.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
When deleting files on filesystems that are stored on flash-memory
(solid-state) disk drives, the filesystem notifies the underlying
disk of the blocks that it is no longer using. The notification
allows the drive to avoid saving these blocks when it needs to
flash (zero out) one of its flash pages. These notifications of
no-longer-being-used blocks are referred to as TRIM notifications.
In FreeBSD these TRIM notifications are sent from the filesystem
to the drive using the BIO_DELETE command.
Until now, the filesystem would send a separate message to the drive
for each block of the file that was deleted. Each Gigabyte of file
size resulted in over 3000 TRIM messages being sent to the drive.
This burst of messages can overwhelm the drive's task queue causing
multiple second delays for read and write requests.
This implementation collects runs of contiguous blocks in the file
and then consolodates them into a single BIO_DELETE command to the
drive. The BIO_DELETE command describes the run of blocks as a
single large block being deleted. Each Gigabyte of file size can
result in as few as two BIO_DELETE commands and is typically less
than ten. Though these larger BIO_DELETE commands take longer to
run, they do not clog the drive task queue, so read and write
commands can intersperse effectively with them.
Though this new feature has been throughly reviewed and tested, it
is being added disabled by default so as to minimize the possibility
of disrupting the upcoming 12.0 release. It can be enabled by running
``sysctl vfs.ffs.dotrimcons=1''. Users are encouraged to test it.
If no problems arise, we will consider requesting that it be enabled
by default for 12.0.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
The support for lazy pmap invalidations on i386 was removed in r281707.
This removes the constant for the IPI and stops accounting for it when
sizing the interrupt count arrays.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16801
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.
Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16636
becomes -1, except these are unsigned integers, so they become very large
numbers. Thus are always larger than the maximum bucket; the hash table
insertion fails causing NAT to fail.
This commit ensures that if the index is already zero it is not reduced
prior to insertion into the hash table.
PR: 208566
are situated next to error counters and/or in one instance prior to the
-1 return from various functions. This was useful in diagnosis of
PR/208566 and will be handy in the future diagnosing NAT failures.
PR: 208566
MFC after: 3 days
Without this patch, some CS4614 cards will need users to reload the driver manually or
the hardware won't be initialised properly. Something like:
# kldload snd_csa
# kldunload snd_csa
# kldload snd_csa
Tested with: Terratec SiXPack 5.1+
I was not aware Warner was making or planning to make forward progress in
this area and have since been informed of that.
It's easy to apply/reapply when churn dies down.
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to
have correct pointer (or array) type. Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.
Mostly done with the coccinelle 'spatch' tool:
$ cat modpnpsize0.cocci
@normaltables@
identifier b,c;
expression a,d,e;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,d,
-sizeof(d[0]),
e);
@singletons@
identifier b,c,d;
expression a;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,&d,
-sizeof(d),
1);
$ rg -l MODULE_PNP_INFO -- sys | \
xargs spatch --in-place --sp-file modpnpsize0.cocci
(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not. So I had to link gdiff into
PATH as diff to use spatch.)
Tinderbox'd (-DMAKE_JUST_KERNELS).
driven by problems found with the algorithms being tested for TRIM
consolodation.
Reported by: Peter Holm
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix
instead of the size of the whole bge_devs array.
This should stop kldxref searching beyond the end of .rodata when it
processes relocations, and emitting "unhandled relocation type" errors,
at least on i386.
This is very primitive code to inspect the PCI error state and AER
error state, dump the log and clear errors, from ddb.
pci_print_faulted_dev() is made external to allow calling it from
other places. It was called from NMI handler but this chunk is not
included.
Also there is a tunable-controlled code to clear AER on device attach,
disabled by default.
All this code was useful to me when I debugged ACPI_DMAR failures (not
faults) long time ago.
Reviewed by: cem, imp (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7813
- In configurations with a pseudo devices section, move 'device crypto'
into that section.
- Use a consistent comment. Note that other things common in kernel
configs such as GELI also require 'device crypto', not just IPSEC.
Reviewed by: rgrimes, cem, imp
Differential Revision: https://reviews.freebsd.org/D16775
The fallback logic was broken if hints were found in multiple environments.
If we found a hint in either the loader environment or the static
environment, fallback would be incremented excessively when we returned to
the environment-selection bits. These checks should have also been guarded
by the fbacklvl checks. As a result, fbacklvl could quickly get to a point
where we skip either the static environment and/or the static hints
depending on which environments contained valid hints.
The impact of this bug is minimal, mostly affecting mips boards that use
static hints and may have hints in either the loader environment or the
static environment.
There may be better ways to express the searchable environments and
describing their characteristics (immutable, already searched, etc.) but
this may be revisited after 12 branches.
Reported by: Dan Nelson <dnelson_1901@yahoo.com>
Triaged by: Dan Nelson <dnelson_1901@yahoo.com>
MFC after: 3 days
When coding the pNFS server, I added vn_start_write() calls in nfsrv_copymr()
done while the vnodes were locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes the LORs by moving the vn_start_write() calls up to before
where the vnodes are locked. For "tvp", the vn_start_write() probaby isn't
necessary, because NFS mounts can't be suspended. However, I think doing
so is harmless.
Thanks go to kib@ for letting me know that I had introduced these LORs.
This patch only affects the behaviour of the pNFS server when pnfsdscopymr(8)
is used to recover a mirrored DS.
The domain and flags parameters suffice. In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D16713
When coding the pNFS server, I added several vn_start_write() calls done
while the vnode was locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes this by removing the added vn_start_write() calls and
modifying the code so that the extant vn_start_write() call before the
NFS RPC/operation is done when needed by the pNFS server.
Flags are changed so that LayoutCommit and LayoutReturn now get a
vn_start_write() done for them.
When the pNFS server is enabled, the code now also changes the flags for
Getattr, so that the vn_start_write() is done for Getattr, since it may
need to do a vn_set_extattr(). The nfs_writerpc flag array was made global
to the NFS server and renamed nfsrv_writerpc, which is consistent naming
for globals in the NFS server.
Thanks go to kib@ for reporting that doing vn_start_write() while the vnode is
locked results in a LOR.
This patch only affects the behaviour of the pNFS server.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update. The fences are needed for
the assertions to be correct in the face of store reordering.
Reported and tested by: jhibbits
Reviewed by: kib, mjg
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16756
Relax allocation throttling for ditto blocks. Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment. Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.
Sponsored by: iXsystems, Inc.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Similar to the network stack issue fixed in r337782 pf did not limit the number
of fragments per packet, which could be exploited to generate high CPU loads
with a crafted series of packets.
Limit each packet to no more than 64 fragments. This should be sufficient on
typical networks to allow maximum-sized IP frames.
This addresses the issue for both IPv4 and IPv6.
MFC after: 3 days
Security: CVE-2018-5391
Sponsored by: Klara Systems
Capsicum in past allowed to change the process title.
This was broken with r335939.
PR: 230584
Submitted by: Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by: ian@niw.com.au
MFC after: 1 week
When a pNFS service is running, the size of the files created on the MDS
are normally 0, since the data is written to the data files on the DS(s).
However, without this patch, if a Setattr with a non-zero size was done by
a client, the MDS file was set to that size. This was thought to be benign,
but it turns out that files with a non-zero size plus extended attributes
can cause a "ffs_truncate3" panic in UFS. Although the exact cause of this
panic() has not been isolated, this patch avoids the panic() and leaves
the MDS files in a consistent state of always having a size == 0.
Note that these MDS files never store data. The patch also includes an
unnecessary initialization of savsize in case some compiler or static
analyser complains it might not be initialized.
This patch only affects the NFS server when pNFS is enabled via the "-p"
command line option on nfsd.
Fix a regression introduced in r336439.
Rather than allowing any linked list of algorithms, allow at most two
(typically, some combination of encrypt and/or MAC). Removes a WAITOK
malloc in an unsleepable context (classic LOR) by placing both software
algorithm contexts within the OCF-managed session object.
Tested with 'cryptocheck -a all -d cryptosoft0', which includes some
encrypt-and-MAC modes.
PR: 230304
Reported by: sef@
Summary:
PowerISA 3.0 adds a 'darn' instruction to "deliver a random number". This
driver was modeled after (rather, copied and gutted of) the Ivy Bridge
rdrand driver.
This uses the "Conditional Random Number" behavior to remove input bias.
From the ISA reference the 'darn' instruction, and the random number
generator backing it, conforms to the NIST SP800-90B and SP800-90C
standards, compliant to the extent possible at the time the hardware was
designed, and guarantees a minimum 0.5 bits of entropy per bit returned.
Reviewed By: markm, secteam (delphij)
Approved by: secteam (delphij)
Differential Revision: https://reviews.freebsd.org/D16552
This change allows one to set kern.boot_tag="" and not get a blank line
preceding other boot messages. While this isn't super critical- blank lines
are easy to filter out both mentally and in processing dmesg later- it
allows for a mode of operation that matches previous behavior.
I intend to MFC this whole series to stable/11 by the end of the month with
boot_tag empty by default to make this effectively a nop in the stable
branch.
Missed in r337940.
(It's not like there are any crypto files IPsec doesn't pull in, so it is
unclear what not defining the crypto option was supposed to achieve.)
Reported by: np@
The wrapper is a thin shim around libsodium's Poly-1305 implementation. For
now, we just use the C algorithm and do not attempt to build the
SSE-optimized variant for x86 processors.
The algorithm support has not yet been plumbed through cryptodev, or added
to cryptosoft.
The idea is untouched upstream sources live in sys/contrib/libsodium.
sys/crypto/libsodium are support routines or compatibility headers to allow
building unmodified upstream code.
This is not yet integrated into the build system, so no functional change.
Bring in https://github.com/jedisct1/libsodium at
461ac93b260b91db8ad957f5a576860e3e9c88a1 (August 7, 2018), unmodified.
libsodium is derived from Daniel J. Bernstein et al.'s 2011 NaCl
("Networking and Cryptography Library," pronounced "salt") software library.
At the risk of oversimplifying, libsodium primarily exists to make it easier
to use NaCl. NaCl and libsodium provide high quality implementations of a
number of useful cryptographic concepts (as well as the underlying
primitics) seeing some adoption in newer network protocols.
I considered but dismissed cleaning up the directory hierarchy and
discarding artifacts of other build systems in favor of remaining close to
upstream (and easing future updates).
Nothing is integrated into the build system yet, so in that sense, no
functional change.
toe_l2_resolve to fill up the complete vtag and not just the vid.
Reviewed by: kib@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D16752
jails since FreeBSD 7.
Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.
Differential Revision: D14791
ZoL did the same mistake, and fixed it with separate commit 863522b1f9:
dsl_scan_scrub_cb: don't double-account non-embedded blocks
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.
The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.
This was apparently a regression introduced in commit 00c405b4b5e8,
which was an incorrect port of this OpenZFS commit:
https://github.com/openzfs/openzfs/commit/d8a447a7
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes#7720Closes#7738
Reported by: sef
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
- add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
goes to zero OR when the interface goes away
- decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
- add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
door wider still to races
- call inp_gcmoptions synchronously after dropping the the inpcb lock
Fundamentally multicast needs a rewrite - but keep applying band-aids for now.
Tested by: kp
Reported by: novel, kp, lwhsu
created and before exec.start is called. [1]
- Bump __FreeBSD_version.
This allows to attach ZFS datasets and various other things to be
done before any command/service/rc-script is started in the new
jail.
PR: 228066 [1]
Reviewed by: jamie [1]
Submitted by: Stefan Grönke <stefan@gronke.net> [1]
Differential Revision: https://reviews.freebsd.org/D15330 [1]
So that I don't have to keep grepping around the codebase to remember what each
one does. And maybe it saves someone else some time.
Fix a trivial whitespace issue while here.
No functional change.
Sponsored by: Dell EMC Isilon
Ensure that the valid PCID state is created for proc0 pmap, since it
might be used by efirt enter() before first context switch on the BSP.
Sponsored by: The FreeBSD Foundation
MFC after: 6 days
The isonum_* functions are defined to take unsigend char* as an argument,
but the structure fields are defined as char. Change to u_char where needed.
Probably the full structure should be changed, but I'm not sure about the
side affects.
While there, add __packed attribute.
Differential Revision: https://reviews.freebsd.org/D16564
hashfilters. Two because the driver needs to look up a hashfilter by
its 4-tuple or tid.
A couple of fixes while here:
- Reject attempts to add duplicate hashfilters.
- Do not assume that any part of the 4-tuple that isn't specified is 0.
This makes it consistent with all other mandatory parameters that
already require explicit user input.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Remove the non-INTRNG code.
Remove left over cut and paste code from the lpc code that was the start for the port.
Set KERNPHYSADDR and KERNVIRTADDR
Tested on Buffalo_WZR2-G300N
Differential Revision: https://reviews.freebsd.org/D16622
The libkqueue tests have several places that leak memory by using an
idiom like:
puts(kevent_to_str(kevp));
Rework to save the pointer returned from kevent_to_str() and then
free() it after it has been used.
Reported by: asomers (pointer to Coverity), Coverity
CID: 1296063, 1296064, 1296065, 1296066, 1296067, 1350287, 1394960
Sponsored by: Dell EMC
Currently, the limits are quite high. On machines with millions of
mbuf clusters, the reassembly queue limits can also run into
the millions. Lower these values.
Also, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.
Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
In particular, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.
Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
On the guest entry in bhyve, flush L1 data cache, using either L1D
flush command MSR if available, or by reading enough uninteresting
data to fill whole cache.
Flush is automatically enabled on CPUs which do not report RDCL_NO,
and can be disabled with the hw.vmm.l1d_flush tunable/kenv.
Security: CVE-2018-3646
Reviewed by: emaste. jhb, Tony Luck <tony.luck@intel.com>
Sponsored by: The FreeBSD Foundation
Currently, we process IPv6 fragments with 0 bytes of payload, add them
to the reassembly queue, and do not recognize them as duplicating or
overlapping with adjacent 0-byte fragments. An attacker can exploit this
to create long fragment queues.
There is no legitimate reason for a fragment with no payload. However,
because IPv6 packets with an empty payload are acceptable, allow an
"atomic" fragment with no payload.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
There is a hashing algorithm which should distribute IPv6 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv6 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.
Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet6.ip6.maxfragbucketsize).
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
The IPv4 fragment reassembly code supports a limit on the number of
fragments per packet. The default limit is currently 17 fragments.
Among other things, this limit serves to limit the number of fragments
the code must parse when trying to reassembly a packet.
Add a limit to the IPv6 reassembly code. By default, limit a packet
to 65 fragments (64 on the queue, plus one final fragment to complete
the packet). This allows an average fragment size of 1,008 bytes, which
should be sufficient to hold a fragment. (Recall that the IPv6 minimum
MTU is 1280 bytes. Therefore, this configuration allows a full-size
IPv6 packet to be fragmented on a link with the minimum MTU and still
carry approximately 272 bytes of headers before the fragmented portion
of the packet.)
Users can adjust this limit using the net.inet6.ip6.maxfragsperpacket
sysctl.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
The IPv6 reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
on enough VNETs), it is possible that the sum of all the VNET fragment
limits will exceed the number of mbuf clusters available in the system.
Given the fact that the fragment limits are intended (at least in part) to
regulate access to a global resource, the IPv6 fragment limit should
be applied on a global basis.
Note that it is still possible to disable fragmentation for a particular
VNET by setting the net.inet6.ip6.maxfragpackets sysctl to 0 for that
VNET. In addition, it is now possible to disable fragmentation globally
by setting the net.inet6.ip6.maxfrags sysctl to 0.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
There is a hashing algorithm which should distribute IPv4 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv4 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.
Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet.ip.maxfragbucketsize).
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
The IP reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
of enough VNETs), it is possible that the sum of all the VNET limits
will exceed the number of mbuf clusters available in the system.
Given the fact that the fragment limit is intended (at least in part) to
regulate access to a global resource, the fragment limit should
be applied on a global basis.
VNET-specific limits can be adjusted by modifying the
net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket
sysctls.
To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0.
To disable fragment reassembly for a particular VNET, set
net.inet.ip.maxfragpackets to 0.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
Currently, all IPv6 fragment reassembly queues are kept in a flat
linked list. This has a number of implications. Two significant
implications are: all reassembly operations share a common lock,
and it is possible for the linked list to grow quite large.
Improve IPv6 reassembly performance by hashing fragments into buckets,
each of which has its own lock. Calculate the hash key using a Jenkins
hash with a random seed.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
Currently, IPv4 fragments are hashed into buckets based on a 32-bit
key which is calculated by (src_ip ^ ip_id) and combined with a random
seed. However, because an attacker can control the values of src_ip
and ip_id, it is possible to construct an attack which causes very
deep chains to form in a given bucket.
To ensure more uniform distribution (and lower predictability for
an attacker), calculate the hash based on a key which includes all
the fields we use to identify a reassembly queue (dst_ip, src_ip,
ip_id, and the ip protocol) as well as a random seed.
Reviewed by: jhb
Security: FreeBSD-SA-18:10.ip
Security: CVE-2018-6923
We always zero the invalidated PTE/PDE for superpage, which means that
L1TF CPU vulnerability (CVE-2018-3620) can be only used for reading
from the page at zero.
Note that both i386 and amd64 exclude the page from phys_avail[]
array, so this change is redundant, but I think that phys_avail[] on
UEFI-boot does not need to do that. Eventually the blacklisting
should be made conditional on CPUs which report that they are not
vulnerable to L1TF.
Reviewed by: emaste. jhb
Sponsored by: The FreeBSD Foundation
curpmap.
When performing context switch on a machine without PCID, if current
%cr3 equals to the new pmap %cr3, which is typical for kernel_pmap
vs. kernel process, I overlooked to update PCPU curpmap value. Remove
check for %cr3 not equal to pm_cr3 for doing the update. It is
believed that this case cannot happen at all, due to other changes in
this revision.
Also, do not set the very first curpmap to kernel_pmap, it should be
vmspace0 pmap instead to match curproc.
Move the common code to activate the initial pmap both on BSP and APs
into pmap_activate_boot() helper.
Reported by: eadler, ambrisko
Discussed with: kevans
Reviewed by: alc, markj (previous version)
Tested by: ambrisko (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16618
Rather than hard-coding the number of CPUs to 2, look up the PVPE field
in MVPConf0, as the valid VPE numbers are from 0 to PVPE inclusive.
Submitted by: "James Clarke" <jrtc4@cam.ac.uk>
Reviewed by: br
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16644
At some point memcpy() may be an ifunc, ifunc resolution cannot be done
until CPU identification has been performed, and CPU identification must
be done after loading any microcode updates.
X-MFC with: r337715
Sponsored by: The FreeBSD Foundation
Trap reads to the arm64 ID registers and write a safe value into them. This
will allow us to put more useful values in these later and have userland
check them to find what features the hardware supports.
These are currently safe defaults, but will later be populated with better
values from the hardware.
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16533
It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled
only when sending redirects for corresponding IP version is disabled via
sysctl. Otherwise will be used default forwarding function.
PR: 221137
Submitted by: mckay@
MFC after: 2 weeks
r330951 by smh fixed the mps driver to avoid deadlocks when panicing.
The same code is needed for mpr, so port it here, along with the fix
which allows the CCBs scheduled to complete avoiding at least a scary
message and likely other unintended consequences.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
When we're shutting down, we send a number of start/stop commands to
the known targets. We have to wait for them to complete. During a
panic, the interrupts are off, and using pause to wait for them to
fire and complete won't work: we have to poll after pause returns so
the completion routines of the CCBs run so we decrement work
outstanding counts.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
xpt_sim_poll takes the sim to poll as an argument. It will do the
proper locking protocol, call the SIM polling routine, and then call
camisr_runqueue to process completions on any CCBs the SIM's poll
routine completed. It will be used during late shutdown when a SIM is
waiting for CCBs it sent during shutdown to finish and the scheduler
isn't running because we've panic'd.
This sequence was used twice in cam_xpt, so refactor those to use this
new function.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
Move evdev_ev_kbd_event() helper from evdev to kbd.c as otherwise evdev
unconditionally requires all keyboard and console stuff to be compiled
into the kernel. This dependency happens as evdev_ev_kbd_event() helper
references kbdsw global variable defined in kbd.c through use of
kbdd_ioctl() macro.
While here make all keyboard drivers respect evdev_rcpt_mask while setting
typematic rate and LEDs with evdev interface.
Requested by: Milan Obuch <bsd@dino.sk>
Reviewed by: hselasky, gonzo
Differential Revision: https://reviews.freebsd.org/D16614
Now softc should be retrieved from struct edvev * pointer
with evdev_get_softc() helper.
wmt(4) is a sample of driver that support both KPI.
Reviewed by: hselasky, gonzo
Differential Revision: https://reviews.freebsd.org/D16614
Newer chips may require assert/deassert after power down for proper
startup. Check respective flag in DEVIDLE_CTRL and perform operation
if neccesssary.
PR: 221777
Submitted by: marc.priggemeyer@gmail.com
Obtained from: DragonFly BSD
Tested on: Thinkpad T470
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel. Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method. For the time being only Intel updates are supported.
Microcode update files are passed to the kernel via loader(8). The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update. Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update. Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16370
If faultin() was called outside swapper (from PHOLD()), do not allow
swapper to initiate additional swap-ins. Swapper' initiated swap-ins
are serialized because they are synchronous and executed in the
context of the thread0. With the added limitation, we only allow
parallel swap-ins from PHOLD(), which is up to PHOLD() users to
manage, usually they do not need to.
Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds
interval, counting faultin() swapins.
Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16610
This is the output of
$ cat opcodes opcodes-rvc-pseudo opcodes-rvc opcodes-custom |
./parse-opcodes -c
It is confirmed by author that the output of parse-opcodes is
in the public domain.
This will be required for DDB disassembler.
Discussed with: Andrew Waterman <waterman@eecs.berkeley.edu>
Obtained from: https://github.com/riscv/riscv-opcodes
Sponsored by: DARPA, AFRL
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.
If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)
Reviewed by: jtl, hselasky
Sponsored by: Netflix
Fix a NULL dereference that would occur any time an ioctl() was done, due to a
missing ipmi_enqueue_request callback. Just use the default for now, until we
decide to properly enable IPMI interrupts.
Reported by: kbowling
TODO: KSTAT_TYPE_NAMED support
commit 5e021f56d3437d3523904652fe3cc23ea1f4cb70
Author: Giuseppe Di Natale <dinatale2@users.noreply.github.com>
Date: Mon Jan 29 10:24:52 2018 -0800
Add dbuf hash and dbuf cache kstats
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.
Correct format of dbuf kstat file.
Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.
Introduce field filtering in the dbufstat python script.
Introduce a no header option to the dbufstat python script.
Introduce a test case to test basic mru->mfu list movement
in the ARC.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6906
commit fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Thu Aug 26 10:52:00 2010 -0700
Fix stack dbuf_hold_impl()
This commit preserves the recursive function dbuf_hold_impl() but moves
the local variables and function arguments to the heap to minimize
the stack frame size. Enough space is initially allocated on the
stack for 20 levels of recursion. This technique was based on commit
34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
traverse_visitbp().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
commit 60948de1ef976aabaa3630707bcc8b5867508507
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Thu Aug 26 10:58:36 2010 -0700
Fix stack noinline
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively. This patch flags all
such functions I've found as 'noinline' to prevent gcc from making
the optimization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
commit 81edd3e83409218879e7af293daa86b0c40eb015
Author: Peng <peng.hse@xtaotech.com>
Date: Wed Jun 8 15:22:07 2016 +0800
Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.
Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
db_blkptr = NULL and it's dirtied.
Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
xattr fits into the bonus buffer, so it's removed. The dbuf is
undirtied in this txg, but it's still referenced and cannot be
destroyed.
Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
(not done yet).
* A new change makes the spill buffer necessary again.
sa_build_layouts() ends up calling dbuf_find() to locate the
dbuf. It finds the old dbuf because it has not been destroyed yet
(it will be destroyed when the previous write is done and there
are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
referenced, so it's not destroyed.
Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
directly copied into the dnode, overwriting the blkptr area because,
in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
gets corrupted.
Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3937
NB: disabled pending the addition of KSTAT_TYPE_RAW support to the
SPL
commit e0b0ca983d6897bcddf05af2c0e5d01ff66f90db
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Wed Oct 2 17:11:19 2013 -0700
Add visibility in to cached dbufs
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
commit 50c957f702ea6d08a634e42f73e8a49931dd8055
Author: Ned Bass <bass6@llnl.gov>
Date: Wed Mar 16 18:25:34 2016 -0700
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3542
Taken from:
ommit f6046738365571bd647f804958dfdff8a32fbde4
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Sat May 30 09:57:53 2015 -0500
Make arc_prune() asynchronous
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock. This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous. Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.
To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:
* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq. This makes it
suitable to use in the context of the arc_adapt_thread().
* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete. This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.
This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured. It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There used to be one control queue per adapter (the mgmtq) that was
initialized during adapter init and one per port that was initialized
later during port init. This change moves all the control queues (one
per port/channel) to the adapter so that they are initialized during
adapter init and are available before any port is up. This allows the
driver to issue ctrlq work requests over any channel without having to
bring up any port.
MFH: 2 weeks
Sponsored by: Chelsio Communications
In addition import most recent arc_prune_async implementation as dependency
commit 25458cbef9e59ef9ee6a7e729ab2522ed308f88f
Author: Tim Chase <tim@chase2k.com>
Date: Wed Jul 13 07:42:40 2016 -0500
Limit the amount of dnode metadata in the ARC
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773Closes#4858
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.
Rather than relying on the name explicitly check for group memberships.
Obtained from: OpenBSD (pf_if.c,v 1.62, pf_if.c,v 1.63)
Sponsored by: Essen Hackathon
commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date: Thu Feb 26 12:24:11 2015 -0800
Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3082
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently. I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).
Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it. With 12-related activities coming up,
while that is ongoing, just take this back for now.
[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path. Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system. This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.
(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic. But most
NICs, etc. don't try to send spin-down commands at shutdown. ;-))
Discussed with: imp, markj
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.
The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
Bring in the functionality for timespec_get from NetBSD. I've lightly
edited the .c file to remove _DIAGASSERT because FreeBSD doesn't have
that functionality and the typical #define'ing it to assert isn't
right here. The man page is verbatim from NetBSD, but will be revised
as part of a larger cleanup of the time man pages (they are
inconsistent and vague in all the wrong places).
Differential Review: https://reviews.freebsd.org/D16649
The comment above ieee80211_ageq_cleanup specifically notes that the queue
is assumed to be empty, and in order to make it so, ieee80211_ageq_drain
must be used.
Submitted by: Augustin Cavalier <waddlesplash@gmail.com>
Obtained from: Haiku (dffc3e235360cd7b71261239ee8507b7d62a1471)
MFC after: 1 week
MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date: Sat Aug 27 20:12:53 2016 +0200
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5033
The canonical form of sync is:
sync L, E (if Category Elemental Memory Barriers implemented)
The L bits (2) denote the type of sync:
0 -- hwsync
1 -- lwsync
2 -- ptesync or hwsync
It's been found that most 32-bit CPUs designed prior to the introduction of
lwsync will ignore the L bits. However, some cores, particularly the e500 core,
will trigger an illegal instruction exception. Adding these variants will make
it easier to see which sync variant is actually being used in case of a trap.
in ipf_nat_checkout() and report it in the frb_natv4out and frb_natv4in
dtrace probes.
This is currently being used to diagnose NAT failures in PR/208566. It's
rather handy so this commit makes it available for future diagnosis and
debugging efforts.
PR: 208566
MFC after: 1 week
No functional change.
Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags. An earlier
version of this patch mistakenly set the flag before the fill routines.
Submitted by: Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by: bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by: Dell EMC Isilon
Before r329882 the target would be computed after lowmem handlers run
and free pages. On some systems a significant amount of page
reclamation happens this way. However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.
Instead, adjust the target after running lowmem handlers. Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.
Reviewed by: alc, kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16606
BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great
for changing it or removing it. Move it into subr_prf.c and add options for
it to opt_printf.h.
One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the
buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add
a loader tunable by the same name that we'll fetch upon initialization of
the msgbuf.
This allows for flexibility and also ensures that there's a consistent way
to figure out the boot tag of the running kernel, rather than relying on
headers to be in-sync.
Prodded super-super-lightly by: imp
own region in the TCAM starting with T6, unlike previous chips where
they were in the same region as normal filters.
These filters "hit" before anything else in the LE's lookup. The exact
order is:
a) High priority filters
b) TOE's active region (TCAM and/or hash)
c) Servers (TOE hw listeners)
d) Normal filters
MFC after: 1 week
Sponsored by: Chelsio Communications
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.
This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.
2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.
PR: Bug 230168
Reported by: sbruno
Reviewed by: jhibbits, mmacy, sbruno
Approved by: jhibbits
Differential Revision: D16633
We use ps to collect the information of all processes in textdump. But
it doesn't contain process arguments which however sometimes are very
useful for debugging. The new 'a' modifier adds that capability.
While here, remove 'm' modifier from ddb.4. It was in the manual page
from its very first revision, but I could not find any evidence of the
code ever supporting it.
Submitted by: Terry Hu <thu@panzura.com>
Reviewed by: kib
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D16603
From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by
default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it
easier to do textproc magic to locate the start of each boot and, of
particular interest to some, the dmesg of the current boot.
The PR has a dmesg(8) component as well that I've opted not to include for
the moment- it was the more contentious part of this PR.
bde@ also made the statement that this boot tag should be written with an
ordinary printf, which I've- for the moment- declined to change about this
patch to keep it more transparent to observer of the boot process.
PR: 43434
Submitted by: dak <aurelien.nephtali@wanadoo.fr> (basically rewritten)
MFC after: maybe never
If OPAL_RTC_READ is busy and does not return the information on the first run,
as returning OPAL_BUSY_EVENT, the system will crash since ymd and hmsm variable
will contain junk values.
This is happening because we were not calling OPAL_RTC_READ again after
OPAL_POLL_EVENTS' return, which would finally replace the old/junk hmsm and ymd
values.
The code was also mixing OPAL_RTC_READ and OPAL_POLL_EVENTS return values.
This patch fix this logic and guarantee that we call OPAL_RTC_READ after
OPAL_POLL_EVENTS return, and guarantee the code will only proceed if
OPAL_RTC_READ returns OPAL_SUCCESS.
Reviewed by: jhibbits
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D16617
After a re-read of the appropriate section of RFC5661, I decided that a
few things should be changed related to LayoutRecall callback handling.
Here are the things fixed by this patch.
- For two of the three cases that LayoutRecall is done, I now think
setting the clora_changed argument false is correct.
- All errors other than NFSERR_DELAY returned by LayoutRecall appear
permanent, so don't retry for any of them. (NFSERR_DELAY is retried by
newnfs_request(), so it is not affected by this patch.)
- Instead of waiting "forever" (actually until the process is SIGTERM'd)
for Layouts to be returned during a mirror copy, fail and return
ENXIO after about 1minute.
Waiting for a <ctrl>C made sense when pnfsdscopymr() was done by itself,
but did not make sense when done via find(1).
This patch only affects the pNFS server.
add support for explicitly requesting that pmap_enter() create a 1 MB page
mapping. (Essentially, this feature allows the machine-independent layer
to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail
or reclaim a PV entry when one is not available.
Refactor pmap_enter_pte1() into two functions, one by the same name, that
is a general-purpose function for creating pte1 mappings, and another,
pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute-
only mappings for execve(2), mmap(2), and shmat(2).
In addition, as an optimization to pmap_enter(..., psind=0), eliminate the
use of pte2_is_managed() from pmap_enter(). Unlike the x86 pmap
implementations, armv6 does not have a managed bit defined within the PTE.
So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n)
in the number of vm_phys_segs[]. All but one call to PHYS_TO_VM_PAGE() in
pmap_enter() can be avoided.
Reviewed by: kib, markj, mmel
Tested by: mmel
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16555
These were found by the Undefined Behaviour GsoC project at NetBSD:
Do not change signedness bit with left shift.
While there avoid signed integer overflow.
Address both issues with using unsigned type.
msdosfs_fat.c:512:42, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:521:44, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:14, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:24, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'
msdosfs_fat.c:840:13, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:840:36, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'
Detected with micro-UBSan in the user mode.
Hinted from: NetBSD (CVS 1.33)
MFC after: 2 weeks
Differenctial Revision: https://reviews.freebsd.org/D16615
Do not allow to create more that EXT4_LINK_MAX links to directory in case
if the dir_nlink is not set, like it is done in the fresh e2fsprogs updates.
MFC after: 3 months
The checksum updating functions were not called in case of dir index inode splitting
and in case of dir entry removing, when the entry was first in the block.
Fix and move the dir entry adding logic when i_count == 0 to new function.
MFC after: 3 months
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one. To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.
Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib, markj (an earlier version)
Tested by: pho
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13707
traffic class for rate limiting.
Add experimental knobs that allow the user to specify a default pktsize
and burstsize for traffic classes associated with a port:
dev.<ifname>.<instance>.tc.pktsize
dev.<ifname>.<instance>.tc.burstsize
Sponsored by: Chelsio Communications
The code in newnfs_request() retries RPCs that get a reply of NFSERR_DELAY,
but exempts certain NFSv4 operations. However, for callback RPCs, there
should not be any exemptions at this time. The code would have erroneously
exempted the CBRECALL callback, since it has the same operation number as
the CLOSE operation.
This patch fixes this by checking for a callback RPC (indicated by clp != NULL)
and not checking for exempt operations for callbacks.
This would have only affected the NFSv4 server when delegations are enabled
(they are not enabled by default) and the client replies to CBRECALL with
NFSERR_DELAY. This may never actually happen.
Spotted during code inspection.
MFC after: 2 weeks
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages. In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table. Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient. To simplify cleanup, control message truncation
is now only performed at control message boundaries.
The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
message boundary, e.g., if the caller received two control messages
and provided only the exact amount of buffer space needed for the
first.
PR: 131876
Reviewed by: ed (previous version)
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16561
The VGA "text mode" buffer has a pair of bytes for each character: One
byte for the character symbol, and an "attribute" byte encoding the
foreground and background colours. When updating the screen, we were
writing these two bytes separately.
On some virtualized systems, every write results in a glyph being redrawn
into a (graphical) virtual screen; writing these two bytes separately
results in twice as much work being done to draw characters, whereas if
we perform a single 16-bit write instead, the character only needs to be
redrawn once.
On an EC2 c5.4xlarge instance, this change cuts 1.30s from the kernel boot,
speeding it up from 8.90s to 7.60s.
MFC after: 1 week
is defined in sys/socket.h where it's defined as 28.
A bit of trivia: On NetBSD AF_INET6 is defined as 24. On Solaris it is
defined as 26. This is probably why Darren defaulted to 26, because
ipfilter was originally written for SunOS 4 and Solaris many moons ago.
MFC after: 2 weeks
The <sys/cdefs.h> and <stdatomic.h> headers already included support for
C11 atomics via intrinsincs in modern versions of GCC, but these versions
tried to "hide" atomic variables inside a wrapper structure. This wrapper
is not compatible with GCC's internal <stdatomic.h> header, so that if
GCC's <stdatomic.h> was used together with <sys/cdefs.h>, use of C11
atomics would fail to compile. Fix this by not hiding atomic variables
in a structure for modern versions of GCC. The headers already avoid
using a wrapper structure on clang.
Note that this wrapper was only used if C11 was not enabled (e.g.
via -std=c99), so this also fixes compile failures if a modern version
of GCC was used with -std=c11 but with FreeBSD's <stdatomic.h> instead
of GCC's <stdatomic.h> and this change fixes that case as well.
Reported by: Mark Millard
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D16585
them display the current value of the bitfield rather than the fixed
value that was provided when the sysctl node was created.
MFC after: 1 week
Sponsored by: Chelsio Communications
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.
The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.
Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix
Currently, the per-queue limit is a function of the receive buffer
size and the MSS. In certain cases (such as connections with large
receive buffers), the per-queue segment limit can be quite large.
Because we process segments as a linked list, large queues may not
perform acceptably.
The better long-term solution is to make the queue more efficient.
But, in the short-term, we can provide a way for a system
administrator to set the maximum queue size.
We set the default queue limit to 100. This is an effort to balance
performance with a sane resource limit. Depending on their
environment, goals, etc., an administrator may choose to modify this
limit in either direction.
Reviewed by: jhb
Approved by: so
Security: FreeBSD-SA-18:08.tcp
Security: CVE-2018-6922
Now that aw_sid expose nvmem interface, use that to read the calibration
data.
Add support for H5 SoC.
Fix the bindings, we used to have non-upstreamed bindings. Switch to the
one that have been sent upstream. They are not stable yet, so we switch
from custom, wrong, bindings to correct, proposed bindings
Rework aw_sid so it can work with the nvmem interface.
Each SoC expose a set of fuses (for now rootkey/boardid and, if available,
the thermal calibration data). A fuse can be private or public, reading private
fuse needs to be done via some registers instead of reading directly.
Each fuse is exposed as a sysctl.
For now leave the possibility for a driver to read any fuse without using
the nvmem interface as the awg and emac driver use this to generate a mac
address.
At least on x86, fhandle_t is a packed structure, so I believe an
assignment will copy all the bits. However, for some current/future
architectures, there might be padding in the structure that doesn't get
copied via an assignment.
Since NFS assumes a file handle is an opaque blob of bits that can be
compared via memcmp()/bcmp(), all the bits including any padding must be
copied.
This patch replaces the assignments with a call to a byte copy function.
Spotted during code inspection.