Commit Graph

17398 Commits

Author SHA1 Message Date
Kyle Evans
59dafcde62 kqueue: fix conversion of timer data to sbintime
This unbreaks the i386 kqueue timer tests after a recent change switched
NOTE_ABSTIME over to using microseconds. Notably, the data argument (which
holds useconds) is an int64_t, but we were passing it to timer2sbintime
which takes an intptr_t. Perhaps in a previous incarnation, intptr_t would
have made sense, but now it just leads to the timestamp getting truncated
and subsequently rejected when it no longer fits in an intptr_t.

PR:		245768
Reported by:	lwhsu / CI
MFC after:	1 week
2020-04-21 03:57:30 +00:00
Mitchell Horne
49439183ce Convert arm's physmem interface to MI code
The arm_physmem interface found in arm's MD code provides a convenient
set of routines for adding/excluding physical memory regions and
initializing important kernel globals such as Maxmem, realmem,
phys_avail[], and dump_avail[]. It is especially convenient for FDT
systems, since we can use FDT parsing functions and pass the result
directly to one of these physmem routines. This interface is already in
use on arm and arm64, and can be used to simplify this early
initialization on RISC-V as well.

This requires only a couple trivial changes:
  - Move arm_physmem_kernel_addr to arm/machdep.c. It is unused on arm64,
    and manipulated entirely in arm MD code.
  - Convert arm32_btop/arm64_btop to atop. This is equivalently defined
    on all architectures.
  - Drop the "arm" prefix.

Reviewed by:	manu, emaste ("looks reasonable")
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24153
2020-04-19 00:12:30 +00:00
Kristof Provost
3f8bc99c4c ethersubr: Make the mac address generation more robust
If we create two (vnet) jails and create a bridge interface in each we end up
with the same mac address on both bridge interfaces.
These very often conflicts, resulting in same mac address in both jails.

Mitigate this problem by including the jail name in the mac address.

Reviewed by:	kevans, melifaro
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D24383
2020-04-18 07:50:30 +00:00
Konstantin Belousov
acf65eef23 sendfile: When all io finished, assert that sfio->pa[] is in expected state.
It must contain fully restored contigous run of the wired pages from
the object, except possible trimmed tail.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-18 03:09:25 +00:00
Konstantin Belousov
e795a04083 The pa argument for sendfile_iodone() is not necessary a slice of sfio->pa.
It is true for zfs, but it is not for e.g. vnode or buffer pagers.
When fixing bogus pages, fix them in both places.  Rely on the fact
that pa[0] must have been invalid so it cannot be bogus.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-18 03:07:18 +00:00
Kyle Evans
23d5326823 tty: convert tty_lock_assert to tty_assert_locked to hide lock type
A later change, currently being iterated on in D24459, will in-fact change
the lock type to an sx so that TTY drivers can sleep on it if they need to.
Committing this ahead of time to make the review in question a little more
palatable.

tty_lock_assert() is unfortunately still needed for now in two places to
make sure that the tty lock has not been recursed upon, for those scenarios
where it's supplied by the TTY driver and possibly a mutex that is allowed
to recurse.

Suggested by:	markj
2020-04-17 18:34:49 +00:00
Brooks Davis
b24e6ac8b7 Convert canary, execpathp, and pagesizes to pointers.
Use AUXARGS_ENTRY_PTR to export these pointers.  This is a followup to
r359987 and r359988.

Reviewed by:	jhb
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24446
2020-04-16 21:53:17 +00:00
Brooks Davis
19112958cf style(9): end continued line with operator. 2020-04-16 17:24:13 +00:00
Kyle Evans
c318828929 Preload hostuuid for early-boot use
prison0's hostuuid will get set by the hostid rc script, either after
generating it and saving it to /etc/hostid or by simply reading /etc/hostid.

Some things (e.g. arbitrary MAC address generation) may use the hostuuid as
a factor in early boot, so providing a way to read /etc/hostid (if it's
available) and using it before userland starts up is desirable. The code is
written such that the preload doesn't *have* to be /etc/hostid, thus not
assuming that there will be newline at the end of the buffer or even the
exact shape of the newline. White trailing whitespace/non-printables
trimmed, the result will be validated as a valid uuid before it's used for
early boot purposes.

The preload can be turned off with hostuuid_load="NO" in /boot/loader.conf,
just as other preloads; it's worth noting that this is a 37-byte file, the
overhead is believed to be generally minimal.

It doesn't seem necessary at this time to be concerned with kern.hostid.

One does wonder if we should consider validating hostuuids coming in
via jail_set(2); some bits seem to care about uuid form and we bother
validating format of smbios-provided uuid and in-fact whatever uuid comes
from /etc/hostid.

Reviewed by:	karels, delphij, jamie
MFC after:	1 week (don't preload by default, probably)
Differential Revision:	https://reviews.freebsd.org/D24288
2020-04-16 00:54:06 +00:00
Brooks Davis
9df1c38bbc Export argc, argv, envc, envv, and ps_strings in auxargs.
This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime.  Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv.  This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24407
2020-04-15 20:23:55 +00:00
Brooks Davis
397df744f9 Make ps_strings in struct image_params into a pointer.
This is a prepratory commit for D24407.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
2020-04-15 20:21:30 +00:00
Kyle Evans
3fb92d4cb1 validate_uuid: absorb the rest of parse_uuid with a flags arg
This makes the naming annoyance (validate_uuid vs. parse_uuid) less of an
issue and centralizes all of the functionality into the new KPI while still
making the extra validation optional. The end-result is all the same as far
as hostuuid validation-only goes.
2020-04-15 18:39:12 +00:00
Pawel Biernacki
f65eac0fc0 sysctl_handle_string: Put logical or in parentheses.
Reported by:	rdivacky
Approved by:	kib (mentor)
Pointy-hat to:	kaktus
2020-04-15 16:55:38 +00:00
Pawel Biernacki
1627b1fd9d sysctl(9): fix handling string tunables.
r357614 changed internals of handling string sysctls, and inadvertently
broke setting string tunables.  Take them into account.

PR:		245463
Reported by:	jhb, np
Reviewed by:	imp, jhb, kib
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D24429
2020-04-15 16:33:55 +00:00
Hans Petter Selasky
a90fb6cf3c Cast all ioctl command arguments through uint32_t internally.
Hide debug print showing use of sign extended ioctl command argument
under INVARIANTS. The print is available to all and can easily fill
up the logs.

No functional change intended.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2020-04-15 13:20:51 +00:00
Kyle Evans
142ffb8bdc kern uuid: break format validation out into a separate KPI
This new KPI, validate_uuid, strictly validates the formatting of the input
UUID and, optionally, populates a given struct uuid.

As noted in the header, the key differences are that the new KPI won't
recognize an empty string as a nil UUID and it won't do any kind of semantic
validation on it. Also key is that populating a struct uuid is optional, so
the caller doesn't necessarily need to allocate a bogus one on the stack
just to validate the string.

This KPI has specifically been broken out in support of D24288, which will
preload /etc/hostid in loader so that early boot hostuuid users (e.g.
anything that calls ether_gen_addr) can have a valid hostuuid to work with
once it's been stashed in /etc/hostid.
2020-04-15 03:59:26 +00:00
Brooks Davis
618a20d4f9 Remove bogus use of useracc() in (clock_)nanosleep.
There's no point in pre-checking that we can access the user's rmtp
pointer before we do it in copyout().

While here, improve style(9) compliance.

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24409
2020-04-14 20:53:12 +00:00
Brooks Davis
562894f0dc Centralize compatability translation macros.
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.

Input from:	cem, jhb
Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24275
2020-04-14 20:30:48 +00:00
Kyle Evans
e19b97f7a0 sysent: re-roll after r359930 2020-04-14 18:11:26 +00:00
Kyle Evans
7d03e08112 Mark closefrom(2) COMPAT12, reimplement in libc to wrap close_range
Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.

Reviewed by:	markj (previous version), kib
Differential Revision:	https://reviews.freebsd.org/D24399
2020-04-14 18:07:42 +00:00
Jonathan T. Looney
fb401f1bba Make sonewconn() overflow messages have per-socket rate-limits and values.
sonewconn() emits debug-level messages when a listen socket's queue
overflows. Currently, sonewconn() tracks overflows on a global basis. It
will only log one message every 60 seconds, regardless of how many sockets
experience overflows. And, when it next logs at the end of the 60 seconds,
it records a single message referencing a single PCB with the total number
of overflows across all sockets.

This commit changes to per-socket overflow tracking. The code will now
log one message every 60 seconds per socket. And, the code will provide
per-socket queue length and overflow counts. It also provides a way to
change the period between log messages using a sysctl.

Reviewed by:	jhb (previous version), bcr (manpages)
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24316
2020-04-14 15:38:18 +00:00
Jonathan T. Looney
f6ab9795d4 Print more detail as part of the sonewconn() overflow message.
When a socket's listen queue overflows, sonewconn() emits a debug-level
log message. These messages are sometimes useful to systems administrators
in highlighting a process which is not keeping up with its listen queue.

This commit attempts to enhance the usefulness of this message by printing
more details about the socket's address. If all else fails, it will at
least print the domain name of the socket.

Reviewed by:	bz, jhb, kbowling
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24272
2020-04-14 15:30:34 +00:00
Andrew Gallatin
23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Kyle Evans
51a16c8412 posixshm: fix counting of writable mappings
Similar to mmap'ing vnodes, posixshm should count any mapping where maxprot
contains VM_PROT_WRITE (i.e. fd opened r/w with no write-seal applied) as
writable and thus blocking of any write-seal.

The memfd tests have been amended to reflect the fixes here, which notably
includes:

1. Fix for error return bug; EPERM is not a documented failure mode for mmap
2. Fix rejection of write-seal with active mappings that can be upgraded via
    mprotect(2).

Reported by:	markj
Discussed with:	markj, kib
2020-04-14 13:32:03 +00:00
Mark Johnston
b36871af6d Fix sendto() on unconnected SOCK_STREAM/SEQPACKET unix sockets.
Previously the unpcb pointer of the newly connected remote socket was
not initialized correctly, so attempting to lock it would result in a
null pointer dereference.

Reported by:	syzkaller
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-04-13 19:22:05 +00:00
Mark Johnston
c7841c6b8e Relax restrictions on private mappings of POSIX shm objects.
When creating a private mapping of a POSIX shared memory object,
VM_PROT_WRITE should always be included in maxprot regardless of
permissions on the underlying FD.  Otherwise it is possible to open a
shm object read-only, map it with MAP_PRIVATE and PROT_WRITE, and
violate the invariant in vm_map_insert() that (prot & maxprot) == prot.

Reported by:	syzkaller
Reviewed by:	kevans, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24398
2020-04-13 19:20:39 +00:00
Kyle Evans
605c4cda2f close_range/closefrom: fix regression from close_range introduction
close_range will clamp the range between [0, fdp->fd_lastfile], but failed
to take into account that fdp->fd_lastfile can become -1 if all fds are
closed. =-( In this scenario, just return because there's nothing further we
can do at the moment.

Add a test case for this, fork() and simply closefrom(0) twice in the child;
on the second invocation, fdp->fd_lastfile == -1 and will trigger a panic
before this change.

X-MFC-With:	r359836
2020-04-13 17:55:31 +00:00
Kyle Evans
3d224fc909 sysent: re-roll after introduction of close_range in r359836 2020-04-12 21:23:51 +00:00
Kyle Evans
472ced39ef Implement a close_range(2) syscall
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.

sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.

The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.

This patch is based on a syscall of the same design that is expected to be
merged into Linux.

Reviewed by:	kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision:	https://reviews.freebsd.org/D21627
2020-04-12 21:23:19 +00:00
Konstantin Belousov
d25f1b21c9 sendfile_iodone: correct calculation of the page index for relookup.
This is yet another bug in r359473.

Reported and tested by:	delphij
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-12 05:10:48 +00:00
Mark Johnston
25f4ddfb2b sbappendcontrol() needs to avoid clearing M_NOTREADY on data mbufs.
If LOCAL_CREDS is set on a unix socket and sendfile() is called,
sendfile will call uipc_send(PRUS_NOTREADY), prepending a control
message to the M_NOTREADY mbufs.  uipc_send() then calls
sbappendcontrol() instead of sbappend(), and sbappendcontrol() would
erroneously clear M_NOTREADY.

Pass send flags to sbappendcontrol(), like we do for sbappend(), to
preserve M_READY when necessary.

Reported by:	syzkaller
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24333
2020-04-10 20:42:11 +00:00
Mark Johnston
a50b1900a0 Properly handle disconnected sockets in uipc_ready().
When transmitting over a unix socket, data is placed directly into the
receiving socket's receive buffer, instead of the transmitting socket's
send buffer.  This means that when pru_ready is called during
sendfile(), the passed socket does not contain M_NOTREADY mbufs in its
buffers; uipc_ready() must locate the linked socket.

Currently uipc_ready() frees the mbufs if the socket is disconnected,
but this is wrong since the mbufs may still be present in the receiving
socket's buffer after a disconnect.  This can result in a use-after-free
and potentially a double free if the receive buffer is flushed after
uipc_ready() frees the mbufs.

Fix the problem by trying harder to locate the correct socket buffer and
calling sbready(): use the global list of SOCK_STREAM unix sockets to
search for a sockbuf containing the now-ready mbufs.  Only free the
mbufs if we fail this search.

Reviewed by:	jah, kib
Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24332
2020-04-10 20:41:59 +00:00
Konstantin Belousov
f709eee61a Do not pass bogus page to mbufs.
This is a bug in r359473.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-10 01:28:47 +00:00
Kirk McKusick
9a79b99003 When running with a kernel compiled with DEBUG_LOCKS, before
panic'ing for recusing on a non-recursive lock, print out the
kernel stack where the lock was originally acquired.
2020-04-09 23:42:13 +00:00
Konstantin Belousov
90f29198c3 Remove extra call to vfs_op_exit() from vfs_write_suspend() when VFS_SYNC() fails.
The vfs_write_resume() handler already does vfs_op_exit() for us.

Reported by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
2020-04-09 18:38:00 +00:00
Rick Macklem
8de97f394e Remove the old NFS lock device driver that uses Giant.
This NFS lock device driver was replaced by the kernel NLM around FreeBSD7 and
has not normally been used since then.
To use it, the kernel had to be built without "options NFSLOCKD" and
the nfslockd.ko had to be deleted as well.
Since it uses Giant and is no longer used, this patch removes it.

With this device driver removed, there is now a lot of unused code
in the userland rpc.lockd. That will be removed on a future commit.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22933
2020-04-09 14:44:46 +00:00
Conrad Meyer
1e9ee2b596 ddb(4): show lockchain: Don't dereference LK_KERNPROC
Also, print a little more information for otherwise unhandled inhibited states.

Finally, improve the grammar of some prints.  Some of the print statements
missing verb.

Sponsored by:	Dell EMC Isilon
2020-04-02 20:47:51 +00:00
John Baldwin
59838c1a19 Retire procfs-based process debugging.
Modern debuggers and process tracers use ptrace() rather than procfs
for debugging.  ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code.  This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.

PR:		244939 (exp-run)
Reviewed by:	kib, mjg (earlier version)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D23837
2020-04-01 19:22:09 +00:00
Jason A. Harmening
8abaf6a7c8 deadlkres: include thread name in panic messages
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D24235
2020-04-01 04:51:39 +00:00
Andrew Gallatin
c4ad247b7a KTLS: Coalesce adjacent TLS trailers & headers to improve PCIe bus efficiency
KTLS uses the embedded header and trailer fields of unmapped
mbufs. This can lead to "silly" buffer lengths, where we have an
mbuf chain that will create a scatter/gather lists with a
regular pattern of 13 bytes followed by 16 bytes between each
adjacent TLS record.

For software ktls we typically wind up with a pattern where we
have several TLS records encrypted, and made ready at once. When
these records are made ready, we can coalesce these silly buffers
in sbready_compress by copying 13b TLS header of the next record
into the 16b TLS trailer of the current record. After doing so,
we now have a small 29 byte chunk between each TLS record.

This marginally increases PCIe bus efficiency. We've seen an
almost 1Gb/s increase in peak throughput on Broadwell based Xeons
running a 100% software TLS workload with Mellanox ConnectX-4
NICs.

Note that this change is ifdef'ed for KTLS, as KTLS is currently
the only user of the hdr/trailer feature of unmapped mbufs, and
peeking into them is expensive, since the ext_pgs struct lives in
separately allocated memory, and may be cold in cache.

This optimization is not applicable to HW ("NIC") TLS, as that
depends on having the entire TLS record described by a single
unmapped mbuf, so we cannot shift parts of the record between
mbufs for HW TLS.

Reviewed by:	jhb, hselasky, scottl
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24204
2020-03-30 23:29:53 +00:00
Konstantin Belousov
c506a6386f kern_sendfile.c: fix bugs with handling of busy page states.
- Do not call into a vnode pager while leaving some pages from the
  same block as the current run, xbusy. This immediately deadlocks if
  pager needs to instantiate the buffer.
- Only relookup bogus pages after io finished, otherwise we might
  obliterate the valid pages by out of date disk content.  While there,
  expand the comment explaining this pecularity.
- Do not double-unbusy on error.  Split unbusy for error case, which
  is left in the sendfile_swapin(), from the more properly coded
  normal case in sendfile_iodone().
- Add an XXXKIB comment explaining the serious bug in the validation
  algorithm, not fixed by this patch series.

PR:	244713
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 22:13:32 +00:00
Konstantin Belousov
d866353677 kern_sendfile.c: do not release sfio reference on error.
It is already done by sendfile_iodone(), now consistently for all errors.
This de-facto reverts r358597, after r359466.

Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 22:01:36 +00:00
Konstantin Belousov
59e1ac9d79 kern_sendfile.c: wait for all in-flight ios completion before unwiring pages.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:57:28 +00:00
Konstantin Belousov
8f0a223c78 kern_sendfile.c: add specific malloc type.
Now sfio leaks are more easily seen in the malloc statistics than
e.g. just wired or busy pages leak.

Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:50:51 +00:00
Konstantin Belousov
abfdf76791 VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:44:30 +00:00
Konstantin Belousov
bbca4bd7cd buffer pager: skip bogus pages.
We cannot validate bogus page by reading a buffer.

PR:	244713
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:42:46 +00:00
Konstantin Belousov
0ac8511aef kern_sendfile.c style: order headers alphabetically.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:40:35 +00:00
Ed Maste
fa9ce9ac29 capabilities.conf: provide information about capmode permitted syscalls
Reviewed by:	jhb (earlier)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24118
2020-03-30 18:24:07 +00:00
Mark Johnston
9b1d850be8 Remove the "config" taskqgroup and its KPIs.
Equivalent functionality is already provided by taskqueue(9), just use
that instead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-30 14:24:03 +00:00
Mark Johnston
59d50fe5ef Simplify taskqgroup inititialization.
taskqgroup initialization was broken into two steps:

1. allocate the taskqgroup structure, at SI_SUB_TASKQ;
2. initialize taskqueues, start taskqueue threads, enqueue "binder"
   tasks to bind threads to specific CPUs, at SI_SUB_SMP.

Step 2 tries to handle the case where tasks have already been attached
to a queue, by migrating them to their intended queue.  In particular,
tasks can't be enqueued before step 2 has completed.  This breaks NFS
mountroot on systems using an iflib-based driver when EARLY_AP_STARTUP
is not defined, since mountroot happens before SI_SUB_SMP in this case.

Simplify initialization: do all initialization except for CPU binding at
SI_SUB_TASKQ.  This means that until CPU binding is completed, group
tasks may be executed on a CPU other than that to which they were bound,
but this should not be a problem for existing users of the taskqgroup
KPIs.

Reported by:	sbruno
Tested by:	bdragon, sbruno
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24188
2020-03-30 14:22:52 +00:00
John Baldwin
c034143269 Refactor driver and consumer interfaces for OCF (in-kernel crypto).
- The linked list of cryptoini structures used in session
  initialization is replaced with a new flat structure: struct
  crypto_session_params.  This session includes a new mode to define
  how the other fields should be interpreted.  Available modes
  include:

  - COMPRESS (for compression/decompression)
  - CIPHER (for simply encryption/decryption)
  - DIGEST (computing and verifying digests)
  - AEAD (combined auth and encryption such as AES-GCM and AES-CCM)
  - ETA (combined auth and encryption using encrypt-then-authenticate)

  Additional modes could be added in the future (e.g. if we wanted to
  support TLS MtE for AES-CBC in the kernel we could add a new mode
  for that.  TLS modes might also affect how AAD is interpreted, etc.)

  The flat structure also includes the key lengths and algorithms as
  before.  However, code doesn't have to walk the linked list and
  switch on the algorithm to determine which key is the auth key vs
  encryption key.  The 'csp_auth_*' fields are always used for auth
  keys and settings and 'csp_cipher_*' for cipher.  (Compression
  algorithms are stored in csp_cipher_alg.)

- Drivers no longer register a list of supported algorithms.  This
  doesn't quite work when you factor in modes (e.g. a driver might
  support both AES-CBC and SHA2-256-HMAC separately but not combined
  for ETA).  Instead, a new 'crypto_probesession' method has been
  added to the kobj interface for symmteric crypto drivers.  This
  method returns a negative value on success (similar to how
  device_probe works) and the crypto framework uses this value to pick
  the "best" driver.  There are three constants for hardware
  (e.g. ccr), accelerated software (e.g. aesni), and plain software
  (cryptosoft) that give preference in that order.  One effect of this
  is that if you request only hardware when creating a new session,
  you will no longer get a session using accelerated software.
  Another effect is that the default setting to disallow software
  crypto via /dev/crypto now disables accelerated software.

  Once a driver is chosen, 'crypto_newsession' is invoked as before.

- Crypto operations are now solely described by the flat 'cryptop'
  structure.  The linked list of descriptors has been removed.

  A separate enum has been added to describe the type of data buffer
  in use instead of using CRYPTO_F_* flags to make it easier to add
  more types in the future if needed (e.g. wired userspace buffers for
  zero-copy).  It will also make it easier to re-introduce separate
  input and output buffers (in-kernel TLS would benefit from this).

  Try to make the flags related to IV handling less insane:

  - CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv'
    member of the operation structure.  If this flag is not set, the
    IV is stored in the data buffer at the 'crp_iv_start' offset.

  - CRYPTO_F_IV_GENERATE means that a random IV should be generated
    and stored into the data buffer.  This cannot be used with
    CRYPTO_F_IV_SEPARATE.

  If a consumer wants to deal with explicit vs implicit IVs, etc. it
  can always generate the IV however it needs and store partial IVs in
  the buffer and the full IV/nonce in crp_iv and set
  CRYPTO_F_IV_SEPARATE.

  The layout of the buffer is now described via fields in cryptop.
  crp_aad_start and crp_aad_length define the boundaries of any AAD.
  Previously with GCM and CCM you defined an auth crd with this range,
  but for ETA your auth crd had to span both the AAD and plaintext
  (and they had to be adjacent).

  crp_payload_start and crp_payload_length define the boundaries of
  the plaintext/ciphertext.  Modes that only do a single operation
  (COMPRESS, CIPHER, DIGEST) should only use this region and leave the
  AAD region empty.

  If a digest is present (or should be generated), it's starting
  location is marked by crp_digest_start.

  Instead of using the CRD_F_ENCRYPT flag to determine the direction
  of the operation, cryptop now includes an 'op' field defining the
  operation to perform.  For digests I've added a new VERIFY digest
  mode which assumes a digest is present in the input and fails the
  request with EBADMSG if it doesn't match the internally-computed
  digest.  GCM and CCM already assumed this, and the new AEAD mode
  requires this for decryption.  The new ETA mode now also requires
  this for decryption, so IPsec and GELI no longer do their own
  authentication verification.  Simple DIGEST operations can also do
  this, though there are no in-tree consumers.

  To eventually support some refcounting to close races, the session
  cookie is now passed to crypto_getop() and clients should no longer
  set crp_sesssion directly.

- Assymteric crypto operation structures should be allocated via
  crypto_getkreq() and freed via crypto_freekreq().  This permits the
  crypto layer to track open asym requests and close races with a
  driver trying to unregister while asym requests are in flight.

- crypto_copyback, crypto_copydata, crypto_apply, and
  crypto_contiguous_subsegment now accept the 'crp' object as the
  first parameter instead of individual members.  This makes it easier
  to deal with different buffer types in the future as well as
  separate input and output buffers.  It's also simpler for driver
  writers to use.

- bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer.
  This understands the various types of buffers so that drivers that
  use DMA do not have to be aware of different buffer types.

- Helper routines now exist to build an auth context for HMAC IPAD
  and OPAD.  This reduces some duplicated work among drivers.

- Key buffers are now treated as const throughout the framework and in
  device drivers.  However, session key buffers provided when a session
  is created are expected to remain alive for the duration of the
  session.

- GCM and CCM sessions now only specify a cipher algorithm and a cipher
  key.  The redundant auth information is not needed or used.

- For cryptosoft, split up the code a bit such that the 'process'
  callback now invokes a function pointer in the session.  This
  function pointer is set based on the mode (in effect) though it
  simplifies a few edge cases that would otherwise be in the switch in
  'process'.

  It does split up GCM vs CCM which I think is more readable even if there
  is some duplication.

- I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC
  as an auth algorithm and updated cryptocheck to work with it.

- Combined cipher and auth sessions via /dev/crypto now always use ETA
  mode.  The COP_F_CIPHER_FIRST flag is now a no-op that is ignored.
  This was actually documented as being true in crypto(4) before, but
  the code had not implemented this before I added the CIPHER_FIRST
  flag.

- I have not yet updated /dev/crypto to be aware of explicit modes for
  sessions.  I will probably do that at some point in the future as well
  as teach it about IV/nonce and tag lengths for AEAD so we can support
  all of the NIST KAT tests for GCM and CCM.

- I've split up the exising crypto.9 manpage into several pages
  of which many are written from scratch.

- I have converted all drivers and consumers in the tree and verified
  that they compile, but I have not tested all of them.  I have tested
  the following drivers:

  - cryptosoft
  - aesni (AES only)
  - blake2
  - ccr

  and the following consumers:

  - cryptodev
  - IPsec
  - ktls_ocf
  - GELI (lightly)

  I have not tested the following:

  - ccp
  - aesni with sha
  - hifn
  - kgssapi_krb5
  - ubsec
  - padlock
  - safe
  - armv8_crypto (aarch64)
  - glxsb (i386)
  - sec (ppc)
  - cesa (armv7)
  - cryptocteon (mips64)
  - nlmsec (mips64)

Discussed with:	cem
Relnotes:	yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D23677
2020-03-27 18:25:23 +00:00
Mark Johnston
b7e3a3b6e1 Remove unused SYSINIT macros for capability rights.
Static rights are initialized in cap_rights_sysinit().

MFC after:	1 week
2020-03-26 15:02:37 +00:00
Conrad Meyer
ca0ec73c11 Expand generic subword atomic primitives
The goal of this change is to make the atomic_load_acq_{8,16},
atomic_testandset{,_acq}_long, and atomic_testandclear_long primitives
available in MI-namespace.

The second goal is to get this draft out of my local tree, as anything that
requires a full tinderbox is a big burden out of tree.  MD specifics can be
refined individually afterwards.

The generic implementations may not be ideal for your architecture; feel
free to implement better versions.  If no subword_atomic definitions are
needed, the include can be removed from your arch's machine/atomic.h.
Generic definitions are guarded by defined macros of the same name.  To
avoid picking up conflicting generic definitions, some macro defines are
added to various MD machine/atomic.h to register an existing implementation.

Include _atomic_subword.h in arm and arm64 machine/atomic.h.

For some odd reason, KCSAN only generates some versions of primitives.
Generate the _acq variants of atomic_load.*_8, atomic_load.*_16, and
atomic_testandset.*_long.  There are other questionably disabled primitives,
but I didn't run into them, so I left them alone.  KCSAN is only built for
amd64 in tinderbox for now.

Add atomic_subword implementations of atomic_load_acq_{8,16} implemented
using masking and atomic_load_acq_32.

Add generic atomic_subword implementations of atomic_testandset_long(),
atomic_testandclear_long(), and atomic_testandset_acq_long(), using
atomic_fcmpset_long() and atomic_fcmpset_acq_long().

On x86, add atomic_testandset_acq_long as an alias for
atomic_testandset_long.

Reviewed by:	kevans, rlibby (previous versions both)
Differential Revision:	https://reviews.freebsd.org/D22963
2020-03-25 23:12:43 +00:00
Konstantin Belousov
8cf8c2f65a kern_copy_file_range(): check the file type.
The syscall can only operate on valid vnode types.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-03-24 17:16:52 +00:00
Rick Macklem
f9122b6488 Fix an NFS mount attempt where VFS_STATFS() fails.
r353150 added mnt_rootvnode and this seems to have broken NFS mounts when the
VFS_STATFS() called just after VFS_MOUNT() returns an error.
Then the code calls VFS_UNMOUNT(), which calls vflush(), which returns EBUSY.
Then the thread get stuck sleeping on "mntref" in vfs_mount_destroy().
This patch fixes this problem.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D24022
2020-03-22 18:18:30 +00:00
Mark Johnston
99258935eb Lock the socket in soo_stat().
Otherwise nothing synchronizes with a concurrent conversion of the
socket to a listening socket.

Only the PF_LOCAL protocols implement pru_sense, and it is safe to hold
the socket lock there, so do so for now.

Reported by:	syzbot+4801f1b79ea40953ca8e@syzkaller.appspotmail.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-03-20 20:09:00 +00:00
Mark Johnston
db38b699b4 Simplify uipc_detach() slightly.
Remove a goto and an unneeded local variable, and fix style.  No
functional change intended.

Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-20 16:18:54 +00:00
Mark Johnston
569a186d02 Remove UNP_NASCENT, reverting r303855.
unp_connectat() no longer holds the link lock across calls to
sonewconn(), so the recursion described in r303855 can no longer occur.
No functional change intended.

Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-20 16:17:54 +00:00
Mark Johnston
429537caeb kern_dup(): Call filecaps_free_prep() in a write section.
filecaps_free_prep() bzeros the capabilities structure and we need to be
careful to synchronize with unlocked readers, which expect a consistent
rights structure.

Reviewed by:	kib, mjg
Reported by:	syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24120
2020-03-19 15:40:05 +00:00
Mark Johnston
2d896b816b Enter a write sequence when updating rights.
The Capsicum system calls modify file descriptor table entries.  To
ensure that readers observe a consistent snapshot of descriptor writes,
the system calls need to signal to unlocked readers that an update is
pending.

Note that ioctl rights are always checked with the descriptor table lock
held, so it is not strictly necessary to signal unlocked readers.
However, we probably want to enable lockless ioctl checks eventually, so
use seqc_write_begin() in kern_cap_ioctls_limit() too.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24119
2020-03-19 15:39:45 +00:00
Brandon Bergren
3069380898 [PowerPC][Book-E] Fix missing load base in elf_cpu_parse_dynamic().
When I implemented MD DYNAMIC parsing, I was originally passing a
linker_file_t so that the MD code could relocate pointers.

However, it turns out this isn't even filled in until later, so it was
always 0.

Just pass the load base (ef->address) directly, as that's really the only
thing we were interested in in the first place.

This fixes a crash on RB800 where it was trying to write to an unmapped
address when updating the GOT.

Reviewed by:	jhibbits
Sponsored by:	Tag1 Consulting, Inc.
Differential Revision:	https://reviews.freebsd.org/D24105
2020-03-18 02:58:18 +00:00
Conrad Meyer
34086d5bda Implement sysctl kern.boot_id
Boot IDs are random, opaque 128-bit identifiers that distinguish distinct
system boots.  A new ID is generated each time the system boots.  Unlike
kern.boottime, the value is not modified by NTP adjustments.  It remains fixed
until the machine is restarted.

PR:		244867
Reported by:	Ricardo Fraile <rfraile AT rfraile.eu>
MFC after:	I do not intend to, but feel free
2020-03-17 22:27:16 +00:00
Conrad Meyer
a99c321802 Remove misleading / redundant bzero in callout_callwheel_init
The intent seems to be zeroing all of the cc_cpu array, or its singleton on
such platforms.  The assumption made is that the BSP is always zero.  The
code smell was introduced in r326218, which changed the prior explicit zero
to 'curcpu'.  The change is only valid if curcpu continues to be zero,
contrary to the aim expressed in that commit message.

So, more succinctly, the expression could be: memset(cc_cpu,0,sizeof(cc_cpu)).

However, there's no point.  cc_cpu lives in the data section and has a zero
initial value already.  So this revision just removes the problematic
statement.

No functional change.  Appeases a (false positive, ish) Coverity CID.

CID:		1383567
Reported by:	Puneeth Jothaiah <puneethkumar.jothaia AT dell.com>
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D24089
2020-03-16 22:25:25 +00:00
Bjoern A. Zeeb
1b786d0191 kern_jail: missing \0 termination check on osrelease parameter
If a user spplies a non-\0 terminated osrelease parameter reading it back
may disclose kernel memory.
This is a problem in case of nested jails (children.max > 0, which is not
the default).  Otherwise root outside the jail has access to kernel memory
by other means and root inside a jail cannot create a child jail.

Add the proper \0 check at the end of a supplied osrelease parameter and
make sure any copies of the field will be \0-terminated.

Submitted by:	Hans Christian Woithe (chwoithe yahoo.com)
MFC after:	3 days
2020-03-14 14:04:55 +00:00
Michael Tuexen
db4493f7b6 sendfile() does currently not support SCTP sockets.
Therefore, fail the call.

Reviewed by:		markj@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D24059
2020-03-13 18:38:28 +00:00
Conrad Meyer
7a119578a4 kern_shutdown: Add missing EKCD ifdef
Submitted by:	Puneeth Jothaiah <puneethkumar.jothaia AT dell.com>
Reviewed by:	bdrewery
Sponsored by:	Dell EMC Isilon
2020-03-12 21:26:36 +00:00
Konstantin Belousov
00ebd80972 Fix signal delivery might be on sigfastblock clearing.
When clearing sigfastblock, either by sigfastblock(UNSETPTR) call or
implicitly on execve(2), kernel must check for pending signals and
reschedule them if needed.

E.g. on execve, all other threads are terminated, and current thread
fast block pointer is cleaned.  If any signal was left pending, it can
now be delivered to the current thread, and we should prepare for
ast() on return to userspace to notice the signals.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-03-10 20:25:03 +00:00
Konstantin Belousov
0bc52b0bdb Return reschedule_signals() to being static again.
It was used after sigfastblock_setpend() call in in ast() when current
thread fast-blocks signals.  Add a flag to sigfastblock_setpend() to
request reschedule, and remove the direct use of the function from
subr_trap.c

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-03-10 20:04:38 +00:00
Konstantin Belousov
2d3c083fd7 pipe: explain why not deallocating inode number is fine.
Suggested and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24009
2020-03-09 23:40:25 +00:00
Konstantin Belousov
c6d3d601c9 Preallocate pipe buffers on pipe creation.
Return ENOMEM if one of the buffer cannot be created even with the
minimal size.  This should avoid subsequent spurious ENOMEM errors
from write(2) when buffer cannot be allocated on the fly, after we
reported that the pipe was create succesfully.

Reported by:	Keno Fischer <keno@juliacomputing.com>
Reviewed by:	markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23993
2020-03-09 21:55:26 +00:00
Konstantin Belousov
1213de28f8 Style.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23993
2020-03-09 19:46:28 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Mateusz Guzik
d2222aa0e9 fd: use smr for managing struct pwd
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23889
2020-03-08 00:23:36 +00:00
Mark Johnston
d869a17e62 Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23978
2020-03-06 19:10:00 +00:00
Mark Johnston
fffcb56f7a Add COUNTER_U64_SYSINIT() and COUNTER_U64_DEFINE_EARLY().
The aim is to reduce the boilerplate needed today to define and
initialize global counters.  Also add SI_SUB_COUNTER to the sysinit
ordering.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23977
2020-03-06 19:09:01 +00:00
Chuck Silvers
f15ccf8836 Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by:	kib, mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23787
2020-03-06 18:41:37 +00:00
Konstantin Belousov
695e0701a0 buffer pager: deref ucred immediately after read.
Ucred is passed to bread(9) so that non-local filesystems use proper
credentials.  But, since clean buffer might be cached unless
buf_pager_relbuf is not enabled, it makes credentials to have extra
reference until buffer is reclaimed.  Ucred reference would prevent
jail from destroying if creds are jailed.

Dereferencing the read credentials on the valid buffer avoid that, and
should be fine because the buffer is valid and does not need re-read.

PR:	238032
Reported by:	bz
Reproduced and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23775
2020-03-05 15:52:34 +00:00
Mateusz Guzik
8d4d271e92 execve: use LOCKSHARED when looking up the interpreter
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23956
2020-03-04 19:52:34 +00:00
Chuck Silvers
37bf88e790 if vm_pager_get_pages_async() returns an error, release the sfio->nios
refcount that we took earlier that represents the I/O that ended up
not being started.

Reviewed by:	glebius
Approved by:	imp (mentor)
Sponsored by:	Netflix
2020-03-04 00:22:50 +00:00
Bjoern A. Zeeb
a2fba2a700 upic_ktrls: make RSS compile again here
The results of ktls_get_cpu() are stored in u_int and NETISR_CPUID_NONE
requires u_int.  Adjust uint16_t to uint_t in order to make RSS kernels
compile some more again.

HPTS still has to be fixed, which is a bit more complicated.

Reviewed by:	jhb, gallatin, rrs
Differential Revision:	https://reviews.freebsd.org/D23726
2020-03-03 14:07:44 +00:00
Mark Johnston
4cf919edb9 Fix the malloc type used in sys_shm_unlink() after r354808.
PR:		244563
Reported by:	swills
2020-03-03 00:28:37 +00:00
Pawel Biernacki
b05ca4290c sys/: Document few more sysctls.
Submitted by:	Antranig Vartanian <antranigv@freebsd.am>
Reviewed by:	kaktus
Commented by:	jhb
Approved by:	kib (mentor)
Sponsored by:	illuria security
Differential Revision:	https://reviews.freebsd.org/D23759
2020-03-02 15:30:52 +00:00
Mateusz Guzik
2f423bce54 vfs: stop taking additional refs on root vnode during lookup
They are spurious since introduction of struct pwd, which provides them
implicitly.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23885
2020-03-01 21:54:28 +00:00
Mateusz Guzik
8d03b99b9d fd: move vnodes out of filedesc into a dedicated structure
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23884
2020-03-01 21:53:46 +00:00
Mateusz Guzik
8243063f9b fd: make fgetvp_rights work without the filedesc lock
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23883
2020-03-01 21:50:13 +00:00
Mark Johnston
5aa5420ff2 Ensure that arm64 thread structures are allocated from the direct map.
Otherwise we can fail to handle translation faults on curthread, leading
to a panic.

Reviewed by:	alc, rlibby
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23895
2020-02-29 18:41:48 +00:00
Jeff Roberson
6be21eb778 Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23866
2020-02-28 21:42:48 +00:00
Jeff Roberson
7aaf252c96 Convert a few triviail consumers to the new unlocked grab API.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23847
2020-02-28 20:34:30 +00:00
Jeff Roberson
f72eaaeb03 Use unlocked grab for uipc_shm/tmpfs.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23865
2020-02-28 20:33:28 +00:00
Mark Johnston
46994ec2b1 Fix standalone builds of systrace.ko after r357912.
Sponsored by:	The FreeBSD Foundation
2020-02-28 17:05:04 +00:00
Mark Johnston
c99d0c5801 Add a blocking counter KPI.
refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter.  However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.

This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.

Reviewed by:	jeff, kib, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23723
2020-02-28 16:05:18 +00:00
Jeff Roberson
561af25fa7 Simplify lazy advance with a 64bit atomic cmpset.
This provides the potential to force a lazy (tick based) SMR to advance
when there are blocking waiters by decoupling the wr_seq value from the
ticks value.

Add some missing compiler barriers.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D23825
2020-02-27 19:05:26 +00:00
Warner Losh
729ea680be Remove trailing white space. 2020-02-26 16:22:28 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Gleb Smirnoff
6bc27f086a Generalize resources freeing in sendfile with different scenarios.
Now we execute sendfile_iodone() in all possible cases, which
guarantees that vm_object_pip_wakeup() is called and sfio structure
is freed.

At the beginning of sendfile initialize sfio->m to NULL, that would
indicate that the mbuf chain either doesn't exist, or belongs to the
syscall (not to I/O completion).  Fill sfio->m only at a point when
we are positive that there are I/Os ongoing and before releasing
syscall's reference on sfio.

In sendfile_iodone() perform vm_object_pip_wakeup() once last
reference is released, then check for sfio->m.  NULL pointer
indicates that we need only to free the memory.

Reviewed by:	jtl, gallatin
2020-02-25 19:29:05 +00:00
Gleb Smirnoff
f85e1a806b Make ktls_frame() never fail. Caller must supply correct mbufs.
This makes sendfile code a bit simplier.
2020-02-25 19:26:40 +00:00
Gleb Smirnoff
69302907d6 When sendfile_swapin() sweeps through pages in search for a bogus page
skip first and last pages.  This is a micro optimisation.
2020-02-25 19:11:20 +00:00
Ryan Libby
fe20aaec0a sys/kern: quiet -Wwrite-strings
Quiet a variety of Wwrite-strings warnings in sys/kern at low-impact
sites.  This patch avoids addressing certain others which would need to
plumb const through structure definitions.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23798
2020-02-23 03:32:16 +00:00
Ryan Libby
2782c00c04 vfs: quiet -Wwrite-strings
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23797
2020-02-23 03:32:11 +00:00
Ryan Libby
eaa17d4291 sys/vm: quiet -Wwrite-strings
Discussed with:	kib
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23796
2020-02-23 03:32:04 +00:00