This unbreaks the i386 kqueue timer tests after a recent change switched
NOTE_ABSTIME over to using microseconds. Notably, the data argument (which
holds useconds) is an int64_t, but we were passing it to timer2sbintime
which takes an intptr_t. Perhaps in a previous incarnation, intptr_t would
have made sense, but now it just leads to the timestamp getting truncated
and subsequently rejected when it no longer fits in an intptr_t.
PR: 245768
Reported by: lwhsu / CI
MFC after: 1 week
The arm_physmem interface found in arm's MD code provides a convenient
set of routines for adding/excluding physical memory regions and
initializing important kernel globals such as Maxmem, realmem,
phys_avail[], and dump_avail[]. It is especially convenient for FDT
systems, since we can use FDT parsing functions and pass the result
directly to one of these physmem routines. This interface is already in
use on arm and arm64, and can be used to simplify this early
initialization on RISC-V as well.
This requires only a couple trivial changes:
- Move arm_physmem_kernel_addr to arm/machdep.c. It is unused on arm64,
and manipulated entirely in arm MD code.
- Convert arm32_btop/arm64_btop to atop. This is equivalently defined
on all architectures.
- Drop the "arm" prefix.
Reviewed by: manu, emaste ("looks reasonable")
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24153
If we create two (vnet) jails and create a bridge interface in each we end up
with the same mac address on both bridge interfaces.
These very often conflicts, resulting in same mac address in both jails.
Mitigate this problem by including the jail name in the mac address.
Reviewed by: kevans, melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D24383
It must contain fully restored contigous run of the wired pages from
the object, except possible trimmed tail.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
It is true for zfs, but it is not for e.g. vnode or buffer pagers.
When fixing bogus pages, fix them in both places. Rely on the fact
that pa[0] must have been invalid so it cannot be bogus.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
A later change, currently being iterated on in D24459, will in-fact change
the lock type to an sx so that TTY drivers can sleep on it if they need to.
Committing this ahead of time to make the review in question a little more
palatable.
tty_lock_assert() is unfortunately still needed for now in two places to
make sure that the tty lock has not been recursed upon, for those scenarios
where it's supplied by the TTY driver and possibly a mutex that is allowed
to recurse.
Suggested by: markj
Use AUXARGS_ENTRY_PTR to export these pointers. This is a followup to
r359987 and r359988.
Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24446
prison0's hostuuid will get set by the hostid rc script, either after
generating it and saving it to /etc/hostid or by simply reading /etc/hostid.
Some things (e.g. arbitrary MAC address generation) may use the hostuuid as
a factor in early boot, so providing a way to read /etc/hostid (if it's
available) and using it before userland starts up is desirable. The code is
written such that the preload doesn't *have* to be /etc/hostid, thus not
assuming that there will be newline at the end of the buffer or even the
exact shape of the newline. White trailing whitespace/non-printables
trimmed, the result will be validated as a valid uuid before it's used for
early boot purposes.
The preload can be turned off with hostuuid_load="NO" in /boot/loader.conf,
just as other preloads; it's worth noting that this is a 37-byte file, the
overhead is believed to be generally minimal.
It doesn't seem necessary at this time to be concerned with kern.hostid.
One does wonder if we should consider validating hostuuids coming in
via jail_set(2); some bits seem to care about uuid form and we bother
validating format of smbios-provided uuid and in-fact whatever uuid comes
from /etc/hostid.
Reviewed by: karels, delphij, jamie
MFC after: 1 week (don't preload by default, probably)
Differential Revision: https://reviews.freebsd.org/D24288
This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime. Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv. This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24407
This makes the naming annoyance (validate_uuid vs. parse_uuid) less of an
issue and centralizes all of the functionality into the new KPI while still
making the extra validation optional. The end-result is all the same as far
as hostuuid validation-only goes.
Hide debug print showing use of sign extended ioctl command argument
under INVARIANTS. The print is available to all and can easily fill
up the logs.
No functional change intended.
MFC after: 1 week
Sponsored by: Mellanox Technologies
This new KPI, validate_uuid, strictly validates the formatting of the input
UUID and, optionally, populates a given struct uuid.
As noted in the header, the key differences are that the new KPI won't
recognize an empty string as a nil UUID and it won't do any kind of semantic
validation on it. Also key is that populating a struct uuid is optional, so
the caller doesn't necessarily need to allocate a bogus one on the stack
just to validate the string.
This KPI has specifically been broken out in support of D24288, which will
preload /etc/hostid in loader so that early boot hostuuid users (e.g.
anything that calls ether_gen_addr) can have a valid hostuuid to work with
once it's been stashed in /etc/hostid.
There's no point in pre-checking that we can access the user's rmtp
pointer before we do it in copyout().
While here, improve style(9) compliance.
Reviewed by: imp
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24409
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.
Input from: cem, jhb
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24275
Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.
Reviewed by: markj (previous version), kib
Differential Revision: https://reviews.freebsd.org/D24399
sonewconn() emits debug-level messages when a listen socket's queue
overflows. Currently, sonewconn() tracks overflows on a global basis. It
will only log one message every 60 seconds, regardless of how many sockets
experience overflows. And, when it next logs at the end of the 60 seconds,
it records a single message referencing a single PCB with the total number
of overflows across all sockets.
This commit changes to per-socket overflow tracking. The code will now
log one message every 60 seconds per socket. And, the code will provide
per-socket queue length and overflow counts. It also provides a way to
change the period between log messages using a sysctl.
Reviewed by: jhb (previous version), bcr (manpages)
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24316
When a socket's listen queue overflows, sonewconn() emits a debug-level
log message. These messages are sometimes useful to systems administrators
in highlighting a process which is not keeping up with its listen queue.
This commit attempts to enhance the usefulness of this message by printing
more details about the socket's address. If all else fails, it will at
least print the domain name of the socket.
Reviewed by: bz, jhb, kbowling
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24272
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
Similar to mmap'ing vnodes, posixshm should count any mapping where maxprot
contains VM_PROT_WRITE (i.e. fd opened r/w with no write-seal applied) as
writable and thus blocking of any write-seal.
The memfd tests have been amended to reflect the fixes here, which notably
includes:
1. Fix for error return bug; EPERM is not a documented failure mode for mmap
2. Fix rejection of write-seal with active mappings that can be upgraded via
mprotect(2).
Reported by: markj
Discussed with: markj, kib
Previously the unpcb pointer of the newly connected remote socket was
not initialized correctly, so attempting to lock it would result in a
null pointer dereference.
Reported by: syzkaller
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
When creating a private mapping of a POSIX shared memory object,
VM_PROT_WRITE should always be included in maxprot regardless of
permissions on the underlying FD. Otherwise it is possible to open a
shm object read-only, map it with MAP_PRIVATE and PROT_WRITE, and
violate the invariant in vm_map_insert() that (prot & maxprot) == prot.
Reported by: syzkaller
Reviewed by: kevans, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24398
close_range will clamp the range between [0, fdp->fd_lastfile], but failed
to take into account that fdp->fd_lastfile can become -1 if all fds are
closed. =-( In this scenario, just return because there's nothing further we
can do at the moment.
Add a test case for this, fork() and simply closefrom(0) twice in the child;
on the second invocation, fdp->fd_lastfile == -1 and will trigger a panic
before this change.
X-MFC-With: r359836
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.
sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.
The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.
This patch is based on a syscall of the same design that is expected to be
merged into Linux.
Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627
If LOCAL_CREDS is set on a unix socket and sendfile() is called,
sendfile will call uipc_send(PRUS_NOTREADY), prepending a control
message to the M_NOTREADY mbufs. uipc_send() then calls
sbappendcontrol() instead of sbappend(), and sbappendcontrol() would
erroneously clear M_NOTREADY.
Pass send flags to sbappendcontrol(), like we do for sbappend(), to
preserve M_READY when necessary.
Reported by: syzkaller
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24333
When transmitting over a unix socket, data is placed directly into the
receiving socket's receive buffer, instead of the transmitting socket's
send buffer. This means that when pru_ready is called during
sendfile(), the passed socket does not contain M_NOTREADY mbufs in its
buffers; uipc_ready() must locate the linked socket.
Currently uipc_ready() frees the mbufs if the socket is disconnected,
but this is wrong since the mbufs may still be present in the receiving
socket's buffer after a disconnect. This can result in a use-after-free
and potentially a double free if the receive buffer is flushed after
uipc_ready() frees the mbufs.
Fix the problem by trying harder to locate the correct socket buffer and
calling sbready(): use the global list of SOCK_STREAM unix sockets to
search for a sockbuf containing the now-ready mbufs. Only free the
mbufs if we fail this search.
Reviewed by: jah, kib
Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24332
This NFS lock device driver was replaced by the kernel NLM around FreeBSD7 and
has not normally been used since then.
To use it, the kernel had to be built without "options NFSLOCKD" and
the nfslockd.ko had to be deleted as well.
Since it uses Giant and is no longer used, this patch removes it.
With this device driver removed, there is now a lot of unused code
in the userland rpc.lockd. That will be removed on a future commit.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22933
Also, print a little more information for otherwise unhandled inhibited states.
Finally, improve the grammar of some prints. Some of the print statements
missing verb.
Sponsored by: Dell EMC Isilon
Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.
PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837
KTLS uses the embedded header and trailer fields of unmapped
mbufs. This can lead to "silly" buffer lengths, where we have an
mbuf chain that will create a scatter/gather lists with a
regular pattern of 13 bytes followed by 16 bytes between each
adjacent TLS record.
For software ktls we typically wind up with a pattern where we
have several TLS records encrypted, and made ready at once. When
these records are made ready, we can coalesce these silly buffers
in sbready_compress by copying 13b TLS header of the next record
into the 16b TLS trailer of the current record. After doing so,
we now have a small 29 byte chunk between each TLS record.
This marginally increases PCIe bus efficiency. We've seen an
almost 1Gb/s increase in peak throughput on Broadwell based Xeons
running a 100% software TLS workload with Mellanox ConnectX-4
NICs.
Note that this change is ifdef'ed for KTLS, as KTLS is currently
the only user of the hdr/trailer feature of unmapped mbufs, and
peeking into them is expensive, since the ext_pgs struct lives in
separately allocated memory, and may be cold in cache.
This optimization is not applicable to HW ("NIC") TLS, as that
depends on having the entire TLS record described by a single
unmapped mbuf, so we cannot shift parts of the record between
mbufs for HW TLS.
Reviewed by: jhb, hselasky, scottl
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24204
- Do not call into a vnode pager while leaving some pages from the
same block as the current run, xbusy. This immediately deadlocks if
pager needs to instantiate the buffer.
- Only relookup bogus pages after io finished, otherwise we might
obliterate the valid pages by out of date disk content. While there,
expand the comment explaining this pecularity.
- Do not double-unbusy on error. Split unbusy for error case, which
is left in the sendfile_swapin(), from the more properly coded
normal case in sendfile_iodone().
- Add an XXXKIB comment explaining the serious bug in the validation
algorithm, not fixed by this patch series.
PR: 244713
Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038
It is already done by sendfile_iodone(), now consistently for all errors.
This de-facto reverts r358597, after r359466.
Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038
Now sfio leaks are more easily seen in the malloc statistics than
e.g. just wired or busy pages leak.
Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038
taskqgroup initialization was broken into two steps:
1. allocate the taskqgroup structure, at SI_SUB_TASKQ;
2. initialize taskqueues, start taskqueue threads, enqueue "binder"
tasks to bind threads to specific CPUs, at SI_SUB_SMP.
Step 2 tries to handle the case where tasks have already been attached
to a queue, by migrating them to their intended queue. In particular,
tasks can't be enqueued before step 2 has completed. This breaks NFS
mountroot on systems using an iflib-based driver when EARLY_AP_STARTUP
is not defined, since mountroot happens before SI_SUB_SMP in this case.
Simplify initialization: do all initialization except for CPU binding at
SI_SUB_TASKQ. This means that until CPU binding is completed, group
tasks may be executed on a CPU other than that to which they were bound,
but this should not be a problem for existing users of the taskqgroup
KPIs.
Reported by: sbruno
Tested by: bdragon, sbruno
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24188
- The linked list of cryptoini structures used in session
initialization is replaced with a new flat structure: struct
crypto_session_params. This session includes a new mode to define
how the other fields should be interpreted. Available modes
include:
- COMPRESS (for compression/decompression)
- CIPHER (for simply encryption/decryption)
- DIGEST (computing and verifying digests)
- AEAD (combined auth and encryption such as AES-GCM and AES-CCM)
- ETA (combined auth and encryption using encrypt-then-authenticate)
Additional modes could be added in the future (e.g. if we wanted to
support TLS MtE for AES-CBC in the kernel we could add a new mode
for that. TLS modes might also affect how AAD is interpreted, etc.)
The flat structure also includes the key lengths and algorithms as
before. However, code doesn't have to walk the linked list and
switch on the algorithm to determine which key is the auth key vs
encryption key. The 'csp_auth_*' fields are always used for auth
keys and settings and 'csp_cipher_*' for cipher. (Compression
algorithms are stored in csp_cipher_alg.)
- Drivers no longer register a list of supported algorithms. This
doesn't quite work when you factor in modes (e.g. a driver might
support both AES-CBC and SHA2-256-HMAC separately but not combined
for ETA). Instead, a new 'crypto_probesession' method has been
added to the kobj interface for symmteric crypto drivers. This
method returns a negative value on success (similar to how
device_probe works) and the crypto framework uses this value to pick
the "best" driver. There are three constants for hardware
(e.g. ccr), accelerated software (e.g. aesni), and plain software
(cryptosoft) that give preference in that order. One effect of this
is that if you request only hardware when creating a new session,
you will no longer get a session using accelerated software.
Another effect is that the default setting to disallow software
crypto via /dev/crypto now disables accelerated software.
Once a driver is chosen, 'crypto_newsession' is invoked as before.
- Crypto operations are now solely described by the flat 'cryptop'
structure. The linked list of descriptors has been removed.
A separate enum has been added to describe the type of data buffer
in use instead of using CRYPTO_F_* flags to make it easier to add
more types in the future if needed (e.g. wired userspace buffers for
zero-copy). It will also make it easier to re-introduce separate
input and output buffers (in-kernel TLS would benefit from this).
Try to make the flags related to IV handling less insane:
- CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv'
member of the operation structure. If this flag is not set, the
IV is stored in the data buffer at the 'crp_iv_start' offset.
- CRYPTO_F_IV_GENERATE means that a random IV should be generated
and stored into the data buffer. This cannot be used with
CRYPTO_F_IV_SEPARATE.
If a consumer wants to deal with explicit vs implicit IVs, etc. it
can always generate the IV however it needs and store partial IVs in
the buffer and the full IV/nonce in crp_iv and set
CRYPTO_F_IV_SEPARATE.
The layout of the buffer is now described via fields in cryptop.
crp_aad_start and crp_aad_length define the boundaries of any AAD.
Previously with GCM and CCM you defined an auth crd with this range,
but for ETA your auth crd had to span both the AAD and plaintext
(and they had to be adjacent).
crp_payload_start and crp_payload_length define the boundaries of
the plaintext/ciphertext. Modes that only do a single operation
(COMPRESS, CIPHER, DIGEST) should only use this region and leave the
AAD region empty.
If a digest is present (or should be generated), it's starting
location is marked by crp_digest_start.
Instead of using the CRD_F_ENCRYPT flag to determine the direction
of the operation, cryptop now includes an 'op' field defining the
operation to perform. For digests I've added a new VERIFY digest
mode which assumes a digest is present in the input and fails the
request with EBADMSG if it doesn't match the internally-computed
digest. GCM and CCM already assumed this, and the new AEAD mode
requires this for decryption. The new ETA mode now also requires
this for decryption, so IPsec and GELI no longer do their own
authentication verification. Simple DIGEST operations can also do
this, though there are no in-tree consumers.
To eventually support some refcounting to close races, the session
cookie is now passed to crypto_getop() and clients should no longer
set crp_sesssion directly.
- Assymteric crypto operation structures should be allocated via
crypto_getkreq() and freed via crypto_freekreq(). This permits the
crypto layer to track open asym requests and close races with a
driver trying to unregister while asym requests are in flight.
- crypto_copyback, crypto_copydata, crypto_apply, and
crypto_contiguous_subsegment now accept the 'crp' object as the
first parameter instead of individual members. This makes it easier
to deal with different buffer types in the future as well as
separate input and output buffers. It's also simpler for driver
writers to use.
- bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer.
This understands the various types of buffers so that drivers that
use DMA do not have to be aware of different buffer types.
- Helper routines now exist to build an auth context for HMAC IPAD
and OPAD. This reduces some duplicated work among drivers.
- Key buffers are now treated as const throughout the framework and in
device drivers. However, session key buffers provided when a session
is created are expected to remain alive for the duration of the
session.
- GCM and CCM sessions now only specify a cipher algorithm and a cipher
key. The redundant auth information is not needed or used.
- For cryptosoft, split up the code a bit such that the 'process'
callback now invokes a function pointer in the session. This
function pointer is set based on the mode (in effect) though it
simplifies a few edge cases that would otherwise be in the switch in
'process'.
It does split up GCM vs CCM which I think is more readable even if there
is some duplication.
- I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC
as an auth algorithm and updated cryptocheck to work with it.
- Combined cipher and auth sessions via /dev/crypto now always use ETA
mode. The COP_F_CIPHER_FIRST flag is now a no-op that is ignored.
This was actually documented as being true in crypto(4) before, but
the code had not implemented this before I added the CIPHER_FIRST
flag.
- I have not yet updated /dev/crypto to be aware of explicit modes for
sessions. I will probably do that at some point in the future as well
as teach it about IV/nonce and tag lengths for AEAD so we can support
all of the NIST KAT tests for GCM and CCM.
- I've split up the exising crypto.9 manpage into several pages
of which many are written from scratch.
- I have converted all drivers and consumers in the tree and verified
that they compile, but I have not tested all of them. I have tested
the following drivers:
- cryptosoft
- aesni (AES only)
- blake2
- ccr
and the following consumers:
- cryptodev
- IPsec
- ktls_ocf
- GELI (lightly)
I have not tested the following:
- ccp
- aesni with sha
- hifn
- kgssapi_krb5
- ubsec
- padlock
- safe
- armv8_crypto (aarch64)
- glxsb (i386)
- sec (ppc)
- cesa (armv7)
- cryptocteon (mips64)
- nlmsec (mips64)
Discussed with: cem
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D23677
The goal of this change is to make the atomic_load_acq_{8,16},
atomic_testandset{,_acq}_long, and atomic_testandclear_long primitives
available in MI-namespace.
The second goal is to get this draft out of my local tree, as anything that
requires a full tinderbox is a big burden out of tree. MD specifics can be
refined individually afterwards.
The generic implementations may not be ideal for your architecture; feel
free to implement better versions. If no subword_atomic definitions are
needed, the include can be removed from your arch's machine/atomic.h.
Generic definitions are guarded by defined macros of the same name. To
avoid picking up conflicting generic definitions, some macro defines are
added to various MD machine/atomic.h to register an existing implementation.
Include _atomic_subword.h in arm and arm64 machine/atomic.h.
For some odd reason, KCSAN only generates some versions of primitives.
Generate the _acq variants of atomic_load.*_8, atomic_load.*_16, and
atomic_testandset.*_long. There are other questionably disabled primitives,
but I didn't run into them, so I left them alone. KCSAN is only built for
amd64 in tinderbox for now.
Add atomic_subword implementations of atomic_load_acq_{8,16} implemented
using masking and atomic_load_acq_32.
Add generic atomic_subword implementations of atomic_testandset_long(),
atomic_testandclear_long(), and atomic_testandset_acq_long(), using
atomic_fcmpset_long() and atomic_fcmpset_acq_long().
On x86, add atomic_testandset_acq_long as an alias for
atomic_testandset_long.
Reviewed by: kevans, rlibby (previous versions both)
Differential Revision: https://reviews.freebsd.org/D22963
r353150 added mnt_rootvnode and this seems to have broken NFS mounts when the
VFS_STATFS() called just after VFS_MOUNT() returns an error.
Then the code calls VFS_UNMOUNT(), which calls vflush(), which returns EBUSY.
Then the thread get stuck sleeping on "mntref" in vfs_mount_destroy().
This patch fixes this problem.
Reviewed by: kib, mjg
Differential Revision: https://reviews.freebsd.org/D24022
Otherwise nothing synchronizes with a concurrent conversion of the
socket to a listening socket.
Only the PF_LOCAL protocols implement pru_sense, and it is safe to hold
the socket lock there, so do so for now.
Reported by: syzbot+4801f1b79ea40953ca8e@syzkaller.appspotmail.com
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Remove a goto and an unneeded local variable, and fix style. No
functional change intended.
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
unp_connectat() no longer holds the link lock across calls to
sonewconn(), so the recursion described in r303855 can no longer occur.
No functional change intended.
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
filecaps_free_prep() bzeros the capabilities structure and we need to be
careful to synchronize with unlocked readers, which expect a consistent
rights structure.
Reviewed by: kib, mjg
Reported by: syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24120
The Capsicum system calls modify file descriptor table entries. To
ensure that readers observe a consistent snapshot of descriptor writes,
the system calls need to signal to unlocked readers that an update is
pending.
Note that ioctl rights are always checked with the descriptor table lock
held, so it is not strictly necessary to signal unlocked readers.
However, we probably want to enable lockless ioctl checks eventually, so
use seqc_write_begin() in kern_cap_ioctls_limit() too.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24119
When I implemented MD DYNAMIC parsing, I was originally passing a
linker_file_t so that the MD code could relocate pointers.
However, it turns out this isn't even filled in until later, so it was
always 0.
Just pass the load base (ef->address) directly, as that's really the only
thing we were interested in in the first place.
This fixes a crash on RB800 where it was trying to write to an unmapped
address when updating the GOT.
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D24105
Boot IDs are random, opaque 128-bit identifiers that distinguish distinct
system boots. A new ID is generated each time the system boots. Unlike
kern.boottime, the value is not modified by NTP adjustments. It remains fixed
until the machine is restarted.
PR: 244867
Reported by: Ricardo Fraile <rfraile AT rfraile.eu>
MFC after: I do not intend to, but feel free
The intent seems to be zeroing all of the cc_cpu array, or its singleton on
such platforms. The assumption made is that the BSP is always zero. The
code smell was introduced in r326218, which changed the prior explicit zero
to 'curcpu'. The change is only valid if curcpu continues to be zero,
contrary to the aim expressed in that commit message.
So, more succinctly, the expression could be: memset(cc_cpu,0,sizeof(cc_cpu)).
However, there's no point. cc_cpu lives in the data section and has a zero
initial value already. So this revision just removes the problematic
statement.
No functional change. Appeases a (false positive, ish) Coverity CID.
CID: 1383567
Reported by: Puneeth Jothaiah <puneethkumar.jothaia AT dell.com>
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24089
If a user spplies a non-\0 terminated osrelease parameter reading it back
may disclose kernel memory.
This is a problem in case of nested jails (children.max > 0, which is not
the default). Otherwise root outside the jail has access to kernel memory
by other means and root inside a jail cannot create a child jail.
Add the proper \0 check at the end of a supplied osrelease parameter and
make sure any copies of the field will be \0-terminated.
Submitted by: Hans Christian Woithe (chwoithe yahoo.com)
MFC after: 3 days
When clearing sigfastblock, either by sigfastblock(UNSETPTR) call or
implicitly on execve(2), kernel must check for pending signals and
reschedule them if needed.
E.g. on execve, all other threads are terminated, and current thread
fast block pointer is cleaned. If any signal was left pending, it can
now be delivered to the current thread, and we should prepare for
ast() on return to userspace to notice the signals.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
It was used after sigfastblock_setpend() call in in ast() when current
thread fast-blocks signals. Add a flag to sigfastblock_setpend() to
request reschedule, and remove the direct use of the function from
subr_trap.c
Tested by: pho
Sponsored by: The FreeBSD Foundation
Return ENOMEM if one of the buffer cannot be created even with the
minimal size. This should avoid subsequent spurious ENOMEM errors
from write(2) when buffer cannot be allocated on the fly, after we
reported that the pipe was create succesfully.
Reported by: Keno Fischer <keno@juliacomputing.com>
Reviewed by: markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23993
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:
- refectors lacp_select_tx_port_by_hash() and
lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
always called by lacp_select_tx_port()
- pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()
- adds a numa_domain field to if_snd_tag_alloc_params
- plumbs the numa domain into places where we allocate send tags
In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.
Reviewed by: rrs, hselasky, jhb
Sponsored by: Netflix
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23889
The aim is to reduce the boilerplate needed today to define and
initialize global counters. Also add SI_SUB_COUNTER to the sysinit
ordering.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23977
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.
Reviewed by: kib, mckusick
Approved by: imp (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23787
Ucred is passed to bread(9) so that non-local filesystems use proper
credentials. But, since clean buffer might be cached unless
buf_pager_relbuf is not enabled, it makes credentials to have extra
reference until buffer is reclaimed. Ucred reference would prevent
jail from destroying if creds are jailed.
Dereferencing the read credentials on the valid buffer avoid that, and
should be fine because the buffer is valid and does not need re-read.
PR: 238032
Reported by: bz
Reproduced and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23775
refcount that we took earlier that represents the I/O that ended up
not being started.
Reviewed by: glebius
Approved by: imp (mentor)
Sponsored by: Netflix
The results of ktls_get_cpu() are stored in u_int and NETISR_CPUID_NONE
requires u_int. Adjust uint16_t to uint_t in order to make RSS kernels
compile some more again.
HPTS still has to be fixed, which is a bit more complicated.
Reviewed by: jhb, gallatin, rrs
Differential Revision: https://reviews.freebsd.org/D23726
They are spurious since introduction of struct pwd, which provides them
implicitly.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23885
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.
This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
Otherwise we can fail to handle translation faults on curthread, leading
to a panic.
Reviewed by: alc, rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23895
refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter. However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.
This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.
Reviewed by: jeff, kib, mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23723
This provides the potential to force a lazy (tick based) SMR to advance
when there are blocking waiters by decoupling the wr_seq value from the
ticks value.
Add some missing compiler barriers.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23825
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
Now we execute sendfile_iodone() in all possible cases, which
guarantees that vm_object_pip_wakeup() is called and sfio structure
is freed.
At the beginning of sendfile initialize sfio->m to NULL, that would
indicate that the mbuf chain either doesn't exist, or belongs to the
syscall (not to I/O completion). Fill sfio->m only at a point when
we are positive that there are I/Os ongoing and before releasing
syscall's reference on sfio.
In sendfile_iodone() perform vm_object_pip_wakeup() once last
reference is released, then check for sfio->m. NULL pointer
indicates that we need only to free the memory.
Reviewed by: jtl, gallatin
Quiet a variety of Wwrite-strings warnings in sys/kern at low-impact
sites. This patch avoids addressing certain others which would need to
plumb const through structure definitions.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23798