Commit Graph

18073 Commits

Author SHA1 Message Date
Mateusz Guzik
f837888a3e thread: use tdfind in sysctl_kern_proc_kstack
This treads linear scans for locked lookup, but more importantly removes
the only consumer of thread_find.
2020-11-10 01:57:19 +00:00
Mateusz Guzik
94275e3e69 threads: remove the unused TID_BUFFER_SIZE macro 2020-11-10 01:31:06 +00:00
Mateusz Guzik
934e7e5ec9 thread: adds newer bits for r367537
The committed patch was an older version.
2020-11-10 01:13:58 +00:00
Mateusz Guzik
35bb59edc5 threads: reimplement tid allocation on top of a bitmap
There are workloads with very bursty tid allocation and since unr tries very
hard to have small-sized bitmaps it keeps reallocating memory. Just doing
buildkernel gives almost 150k calls to free coming from unr.

This also gets rid of the hack which tried to postpone TID reuse.

Reviewed by:	kib, markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D27101
2020-11-09 23:05:28 +00:00
Mateusz Guzik
1bd3cf5de5 threads: introduce a limit for total number
The intent is to replace the current id allocation method and a known upper
bound will be useful.

Reviewed by:	kib (previous version), markj (previous version)
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D27100
2020-11-09 23:04:30 +00:00
Mateusz Guzik
f6dd1aefb7 vfs: group mount per-cpu vars into one struct
While here move frequently read stuff into the same cacheline.

This shrinks struct mount by 64 bytes.

Tested by:	pho
2020-11-09 23:02:13 +00:00
Mateusz Guzik
f0c90a0931 malloc: provide 384 byte zone
Total page count after buildworld on ZFS for 384 (if present) and 512 zones:
before: 29713
after: 25946

per-zone page use:
vm.uma.malloc_384.keg.domain.1.pages: 11621
vm.uma.malloc_384.keg.domain.0.pages: 11597
vm.uma.malloc_512.keg.domain.1.pages: 1280
vm.uma.malloc_512.keg.domain.0.pages: 1448

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27145
2020-11-09 22:59:41 +00:00
Mateusz Guzik
8e6526e966 malloc: retire mt_stats_zone in favor of pcpu_zone_64
Reviewed by:	markj, imp
Differential Revision:	https://reviews.freebsd.org/D27142
2020-11-09 22:58:29 +00:00
Mateusz Guzik
3a440a421d Add more per-cpu zones.
This covers powers of 2 up to 64.

Example pending user is ZFS.
2020-11-09 00:34:23 +00:00
Mateusz Guzik
523d66730c procdesc: convert the zone to a malloc type
The object is 128 bytes in size.
2020-11-09 00:05:21 +00:00
Mateusz Guzik
e90afaa015 kqueue: save space by using only one func pointer for assertions 2020-11-09 00:04:35 +00:00
Edward Tomasz Napierala
a1bd83fede Move syscall_thread_{enter,exit}() into the slow path. This is only
needed for syscalls from unloadable modules.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26988
2020-11-08 15:54:59 +00:00
Kyle Evans
8c28aa5e45 imgact_binmisc: limit the extent of match on incoming entries
imgact_binmisc matches magic/mask from imgp->image_header, which is only a
single page in size mapped from the first page of an image. One can specify
an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to
256 bytes past the mapped first page.

The limitation is that we cannot specify a magic string that exceeds a
single page, and we can't allow offset + size to exceed a single page
either.  A static assert has been added in case someone finds it useful to
try and expand the size, but it does seem a little unlikely.

While this looks kind of exploitable at a sideways squinty-glance, there are
a couple of mitigating factors:

1.) imgact_binmisc is not enabled by default,
2.) entries may only be added by the superuser,
3.) trying to exploit this information to read what's mapped past the end
  would be worse than a root canal or some other relatably painful
  experience, and
4.) there's no way one could pull this off without it being completely
  obvious.

The first page is mapped out of an sf_buf, the implementation of which (or
lack thereof) depends on your platform.

MFC after:	1 week
2020-11-08 04:24:29 +00:00
Michael Tuexen
f908d8247e The ioctl() calls using FIONREAD, FIONWRITE, FIONSPACE, and SIOCATMARK
access the socket send or receive buffer. This is not possible for
listening sockets since r319722.
Because send()/recv() calls fail on listening sockets, fail also ioctl()
indicating EINVAL.

PR:			250366
Reported by:		Yong-Hao Zou
Reviewed by:		glebius, rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26897
2020-11-07 21:17:49 +00:00
Kyle Evans
1024ef27fe imgact_binmisc: move some calculations out of the exec path
The offset we need to account for in the interpreter string comes in two
variants:

1. Fixed - macros other than #a that will not vary from invocation to
   invocation
2. Variable - #a, which is substitued with the argv0 that we're replacing

Note that we don't have a mechanism to modify an existing entry.  By
recording both of these offset requirements when the interpreter is added,
we can avoid some unnecessary calculations in the exec path.

Most importantly, we can know up-front whether we need to grab
calculate/grab the the filename for this interpreter. We also get to avoid
walking the string a first time looking for macros. For most invocations,
it's a swift exit as they won't have any, but there's no point entering a
loop and searching for the macro indicator if we already know there will not
be one.

While we're here, go ahead and only calculate the argv0 name length once per
invocation. While it's unlikely that we'll have more than one #a, there's no
reason to recalculate it every time we encounter an #a when it will not
change.

I have not bothered trying to benchmark this at all, because it's arguably a
minor and straightforward/obvious improvement.

MFC after:	1 week
2020-11-07 18:07:55 +00:00
Mateusz Guzik
42e7abd5db rms: several cleanups + debug read lockers handling
This adds a dedicated counter updated with atomics when INVARIANTS
is used. As a side effect one can reliably determine the lock is held
for reading by at least one thread, but it's still not possible to
find out whether curthread has the lock in said mode.

This should be good enough in practice.

Problem spotted by avg.
2020-11-07 16:57:53 +00:00
Kyle Evans
ecb4fdf943 imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC)
This doesn't change anything at the moment since the out-of-order elements
were a pair of uint32_t, but future additions may have caused unnecessary
padding by following the existing precedent.

MFC after:	1 week
2020-11-07 16:41:59 +00:00
Michal Meloun
eb20867f52 Add a method to determine whether given interrupt is per CPU or not.
MFC after:	2 weeks
2020-11-07 14:58:01 +00:00
Edward Tomasz Napierala
da45ea6bc6 Move TDB_USERWR check under 'if (traced)'.
If we hadn't been traced in the first place when syscallenter()
started executing, we can ignore TDB_USERWR.  TDB_USERWR can get set,
sure, but if it does, it's because the debugger raced with the syscall,
and it cannot depend on winning that race.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26585
2020-11-07 13:09:51 +00:00
Kyle Evans
2192cd125f imgact_binmisc: abstract away the list lock (NFC)
This module handles relatively few execs (initial qemu-user-static, then
qemu-user-static handles exec'ing itself for binaries it's already running),
but all execs pay the price of at least taking the relatively expensive
sx/slock to check for a match when this module is loaded. Future work will
almost certainly swap this out for another lock, perhaps an rmslock.

The RLOCK/WLOCK phrasing was chosen based on what the callers are really
wanting, rather than using the verbiage typically appropriate for an sx.

MFC after:	1 week
2020-11-07 05:10:46 +00:00
Kyle Evans
7d3ed9777a imgact_binmisc: validate flags coming from userland
We may want to reserve bits in the future for kernel-only use, so start
rejecting any that aren't the two that we're currently expecting from
userland.

MFC after:	1 week
2020-11-07 04:10:23 +00:00
Kyle Evans
7667824ade epoch: support non-preemptible epochs checking in_epoch()
Previously, non-preemptible epochs could not check; in_epoch() would always
fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING.

For default epochs, it's easy enough to verify that we're in the given
epoch: if we're in a critical section and our record for the given epoch
is active, then we're in it.

This patch also adds some additional INVARIANTS bookkeeping. Notably, we set
and check the recorded thread in epoch_enter/epoch_exit to try and catch
some edge-cases for the caller. It also checks upon freeing that none of the
records had a thread in the epoch, which may make it a little easier to
diagnose some improper use if epoch_free() took place while some other
thread was inside.

This version differs slightly from what was just previously reviewed by the
below-listed, in that in_epoch() will assert that no CPU has this thread
recorded even if it *is* currently in a critical section. This is intended
to catch cases where the caller might have somehow messed up critical
section nesting, we can catch both if they exited the critical section or if
they exited, migrated, then re-entered (on the wrong CPU).

Reviewed by:	kib, markj (both previous version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27098
2020-11-07 03:29:04 +00:00
Kyle Evans
80083216cb imgact_binmisc: minor re-organization of imgact_binmisc_exec exits
Notably, streamline error paths through the existing 'done' label, making it
easier to quickly verify correct cleanup.

Future work might add a kernel-only flag to indicate that a interpreter uses
#a. Currently, all executions via imgact_binmisc pay the penalty of
constructing sname/fname, even if they will not use it. qemu-user-static
doesn't need it, the stock rc script for qemu-user-static certainly doesn't
use it, and I suspect these are the vast majority of (if not the only)
current users.

MFC after:	1 week
2020-11-07 03:28:32 +00:00
Mateusz Guzik
e25d8b67c3 malloc: tweak the version check in r367432 to include type name
While here fix a whitespace problem.
2020-11-07 01:32:16 +00:00
Mateusz Guzik
bdcc222644 malloc: move malloc_type_internal into malloc_type
According to code comments the original motivation was to allow for
malloc_type_internal changes without ABI breakage. This can be trivially
accomplished by providing spare fields and versioning the struct, as
implemented in the patch below.

The upshots are one less memory indirection on each alloc and disappearance
of mt_zone.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27104
2020-11-06 21:33:59 +00:00
Konstantin Belousov
f10845877e Suspend all writeable local filesystems on power suspend.
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.

Only filesystems declared local are suspended.

Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.

Reviewed by:	markj
Discussed with:	imp
Tested by:	imp (previous version), pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D27054
2020-11-05 20:52:49 +00:00
Mateusz Guzik
16b971ed6d malloc: add a helper returning size allocated for given request
Sample usage: kernel modules can decide whether to stick to malloc or
create their own zone.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27097
2020-11-05 16:21:21 +00:00
Mateusz Guzik
2dee296a3d Rationalize per-cpu zones.
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).

Follow malloc by naming them "pcpu-" + size in bytes.

This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
2020-11-05 15:08:56 +00:00
Mateusz Guzik
ea33cca971 poll/select: change selfd_zone into a malloc type
On a sample box vmstat -z shows:

ITEM                   SIZE  LIMIT     USED     FREE      REQ
64:                      64,      0, 1043784, 4367538,3698187229
selfd:                   64,      0,    1520,   13726,182729008

But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121

Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.
2020-11-05 12:24:37 +00:00
Mateusz Guzik
2fbb45c601 vfs: change nt_zone into a malloc type
Elements are small in size and allocated for short periods.
2020-11-05 12:06:50 +00:00
Kyle Evans
df69035d7f imgact_binmisc: fix up some minor nits
- Removed a bunch of redundant headers
- Don't explicitly initialize to 0
- The !error check prior to setting imgp->interpreter_name is redundant, all
  error paths should and do return or go to 'done'. We have larger problems
  otherwise.
2020-11-05 04:19:48 +00:00
Mateusz Guzik
3c50616fc1 fd: make all f_count uses go through refcount_* 2020-11-05 02:12:33 +00:00
Mateusz Guzik
d737e9eaf5 fd: hide _fdrop 0 count check behind INVARIANTS
While here use refcount_load and make sure to report the tested value.
2020-11-05 02:12:08 +00:00
Mateusz Guzik
331c21dd5e pipe: whitespace nit in previous 2020-11-04 23:17:41 +00:00
Mateusz Guzik
c22ba7bb06 pipe: fix POLLHUP handling if no events were specified
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.

While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.

Reported by:	Jeremie Galarneau <jeremie.galarneau at efficios.com>
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27094
2020-11-04 23:11:54 +00:00
Mateusz Guzik
6fc2b069ca rms: fixup concurrent writer handling and add more features
Previously the code had one wait channel for all pending writers.
This could result in a buggy scenario where after a writer switches
the lock mode form readers to writers goes off CPU, another writer
queues itself and then the last reader wakes up the latter instead
of the former.

Use a separate channel.

While here add features to reliably detect whether curthread has
the lock write-owned. This will be used by ZFS.
2020-11-04 21:18:08 +00:00
Mark Johnston
f7db0c9532 vmspace: Convert to refcount(9)
This is mostly mechanical except for vmspace_exit().  There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace.  In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by:	alc, kib, mmel
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27057
2020-11-04 16:30:56 +00:00
Brooks Davis
19647e76fc sysvshm: pass relevant uap members as arguments
Alter shmget_allocate_segment and shmget_existing to take the values
they want from struct shmget_args rather than passing the struct
around.  In general, uap structures should only be the interface to
sys_<foo> functions.

This makes on small functional change and records the allocated space
rather than the requested space.  If this turns out to be a problem (e.g.
if software tries to find undersized segments by exact size rather than
using keys), we can correct that easily.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27077
2020-11-03 19:14:03 +00:00
Conrad Meyer
2de07e4096 unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT
This option is intended to be semantically identical to Linux's
SOL_SOCKET:SO_PASSCRED.  For now, it is mutually exclusive with the
pre-existing sockopt SOL_LOCAL:LOCAL_CREDS.

Reviewed by:	markj (penultimate version)
Differential Revision:	https://reviews.freebsd.org/D27011
2020-11-03 01:17:45 +00:00
Mateusz Guzik
e1b6a7f83f malloc: prefix zones with malloc-
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27038
2020-11-02 17:39:15 +00:00
Mateusz Guzik
828afdda17 malloc: export kernel zones instead of relying on them being power-of-2
Reviewed by:	markj (previous version)
Differential Revision:	https://reviews.freebsd.org/D27026
2020-11-02 17:38:08 +00:00
Stefan Eßer
1ebef47735 Make sysctl user.local a tunable that can be written at run-time
This sysctl value had been provided as a read-only variable that is
compiled into the C library based on the value of _PATH_LOCALBASE in
paths.h.

After this change, the value is compiled into the kernel as an empty
string, which is translated to _PATH_LOCALBASE by the C library.

This empty string can be overridden at boot time or by a privileged
user at run time and will then be returned by sysctl.

When set to an empty string, the value returned by sysctl reverts to
_PATH_LOCALBASE.

This update does not change the behavior on any system that does
not modify the default value of user.localbase.

I consider this change as experimental and would prefer if the run-time
write permission was reconsidered and the sysctl variable defined with
CLFLAG_RDTUN instead to restrict it to be set at boot time.

MFC after:	1 month
2020-10-31 23:48:41 +00:00
Mateusz Guzik
82c174a3b4 malloc: delegate M_EXEC handling to dedicacted routines
It is almost never needed and adds an avoidable branch.

While here do minior clean ups in preparation for larger changes.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27019
2020-10-30 20:02:32 +00:00
Stefan Eßer
147eea393f Add read only sysctl variable user.localbase
The value is provided by the C library as for other sysctl variables in
the user tree. It is compiled in and returns the value of _PATH_LOCALBASE
defined in paths.h.

Reviewed by:	imp, scottl
Differential Revision:	https://reviews.freebsd.org/D27009
2020-10-30 18:48:09 +00:00
Mateusz Guzik
0685574968 vfs: change vnode poll to just a malloc type
The size is 120, close fit for 128 and rarely used. The infrequent use
avoidably populates per-CPU caches and ends up with more memory.
2020-10-30 14:02:56 +00:00
Mateusz Guzik
4bfebc8d2c cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename 2020-10-30 10:46:35 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
Mateusz Guzik
62568e886a vfs: add NAMEI_DBG_HADSTARTDIR handling lost in rewrite
Noted by:	rpokala
2020-10-29 18:43:37 +00:00
Mateusz Guzik
eebc2e450f vfs: add NDREINIT to facilitate repeated namei calls
struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.

Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.

Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.

Reported by:	pho
Tested by:	pho (different variant)
2020-10-29 12:56:02 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
Konstantin Belousov
3cbf9dc81c Check for process group change in tty_wait_background().
The calling process's process group can change between PROC_UNLOCK(p)
and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call
from another process. If that happens, the signal is not sent to the
calling process, even if the prior checks determine that one should be
sent.  Re-check that the process group hasn't changed after acquiring
the pgrp lock, and if it has, redo the checks.

PR:	250701
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	2 weeks
2020-10-28 22:12:47 +00:00
Edward Tomasz Napierala
bdc0cb4e2c Add local variable to store the sysent pointer. Just a cleanup,
no functional changes.

Reviewed by:	kib (earlier version)
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26977
2020-10-28 14:43:38 +00:00
Edward Tomasz Napierala
bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Mateusz Guzik
11743b6e47 vfs: tidy up vnlru_free
Apart from cosmeatic changes make sure to only decrease the recycled counter
if vtryrecycle succeeded.

Tested by:	pho
2020-10-27 18:13:09 +00:00
Mateusz Guzik
68ac2b804c vfs: fix vnode reclaim races against getnwevnode
All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.

Picking up such a vnode by vnlru would be problematic.

To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail

Tested by:	pho
2020-10-27 18:12:07 +00:00
Mateusz Guzik
d681c51d36 cache: add missing NIRES_ABS handling 2020-10-26 18:01:18 +00:00
Alexander Motin
3c0177b887 Enable bioq 'car limit' added at r335066 at 128 bios.
Without the 'car limit' enabled (before this), running sequential ZFS scrub
on HDD without command queuing support, I've measured latency on concurrent
random reads reaching 4 seconds (surprised that not more).  Enabling this
reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s.

For disks with command queuing this does not make much difference (if any),
since most time all the requests are queued down to the disk or HBA, leaving
nothing in the queue to sort.  And even if something does not fit, staying on
the queue, it is likely not for long.  To not limit sorting in such bursty
scenarios I've added batched counter zeroing when the queue is getting empty.

The internal scheduler of the SAS HDD I was testing seems to be even more
loyal to random I/O, reducing the scrub speed to ~120MB/s.  So in case
somebody worried this is limit is too strict -- it actually looks relaxed.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-26 04:04:06 +00:00
Alexander Motin
8b220f8915 Fix asymmetry in devstat(9) calls by GEOM.
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-24 21:07:10 +00:00
Ruslan Bukin
f32f0095e9 o Add iommu de-initialization method for MSI interface.
o Add iommu_unmap_msi() to release the msi GAS entry.
o Provide default implementations for iommu init/deinit methods.

Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26906
2020-10-24 20:09:27 +00:00
Ryan Moeller
e58483c4fb sysctl+kern_sysctl: Honor SKIP for descendant nodes
Ensure we also skip descendants of SKIP nodes when iterating through children
of an explicitly specified node.

Reported by:	np
Reviewed by:	np
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26833
2020-10-24 16:17:07 +00:00
Ryan Moeller
0595c12484 kern_sysctl: Misc code cleanup
Remove unused oidpp parameter from sysctl_sysctl_next_ls and
add high level comments to describe how it works.

No functional change.

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26854
2020-10-24 14:46:38 +00:00
Kyle Evans
275c821d3d audit: correct reporting of *execve(2) success
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.

Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.

PR:		249179
Reported by:	Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by:	csjp, kib
MFC after:	1 week
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26922
2020-10-24 14:39:17 +00:00
Mateusz Guzik
eb65cde4f5 cache: assorted typo fixes 2020-10-24 13:31:40 +00:00
Mateusz Guzik
029cfccc71 cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup
They are de facto ignored.
2020-10-24 13:31:25 +00:00
Mateusz Guzik
7cc1718613 vfs: fix a race where reclaim vholds freed vnodes
Reported by:	pho
Tested by:	pho (previous version)
Fixes:	r366974 ("vfs: stop taking the interlock in vnode reclaim")
2020-10-24 13:30:37 +00:00
Mateusz Guzik
acb41008f3 cache: batch updates to numcache in case of mass removal 2020-10-24 01:14:52 +00:00
Mateusz Guzik
208cb7c4b6 cache: refactor alloc/free
This in particular centralizes manipulation of numcache.
2020-10-24 01:14:17 +00:00
Mateusz Guzik
1d44405690 cache: fold branch prediction into cache_ncp_canuse 2020-10-24 01:13:47 +00:00
Mateusz Guzik
c13d7d1f98 cache: fix some typos 2020-10-24 01:13:16 +00:00
Mateusz Guzik
f878526f20 cache: drop write-only vars 2020-10-24 01:13:02 +00:00
Ruslan Bukin
9729b14985 Move the iommu stubs to a generic place, so they are available on all the
platforms.

This allows to not depend on the IOMMU macro in AHCI driver.

Requested by:	kib
Suggested by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26887
2020-10-23 21:27:48 +00:00
Mateusz Guzik
3862838921 cache: reduce memory waste in struct namecache
The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.

nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.
2020-10-23 15:56:22 +00:00
Mateusz Guzik
703f3fafa5 vfs: stop taking the interlock in vnode reclaim
It no longer protects any of tested fields, keeping all the checks racy.

While here make vtryrecycle drop the vnode on its own. Avoids an additional
lock trip.
2020-10-23 15:49:18 +00:00
Mateusz Guzik
c7520caa4f vfs: prevent avoidable evictions on mkdir of existing directories
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by:	pho
Discussed with:	kib
2020-10-22 19:28:12 +00:00
Mateusz Guzik
54f09403a3 cache: assert the created entry does not point to itself 2020-10-22 19:22:34 +00:00
Konstantin Belousov
18b8496c23 sysv_sem: semusz depends on semume.
Size of the per-process semaphore undo structure (semusz) depends on
the number of the per-process undos.  If kern.ipc.semume is adjusted,
semusz must be adjusted as well, and it makes no sense to delegate
adjustment to user.  Make it automatic.

Reported and tested by:	Olef <o.vandestadt@gmail.com>
PR:	250361
Reviewed by:	jhb, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26826
2020-10-22 09:28:11 +00:00
Hans Petter Selasky
2ae634c6db Implement mbuf hashing routines for IP over infiniband, IPoIB.
No functional change intended.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 09:17:56 +00:00
Brooks Davis
44ca4575ea vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
Mateusz Guzik
2f1c35053c cache: drop the spurious slash_prefixed argument 2020-10-21 05:57:25 +00:00
Mateusz Guzik
ab21ed17ed vfs: drop the de facto curthread argument from VOP_INACTIVE 2020-10-20 07:19:03 +00:00
Mateusz Guzik
8ecd87a3e7 vfs: drop spurious cred argument from VOP_VPTOCNP 2020-10-20 07:18:27 +00:00
Konstantin Belousov
c0baa3dc4a vgonel(): avoid recursing into VOP_INACTIVE().
It is a common pattern for filesystems' VOP_INACTIVE() implementation
to forcibly reclaim the vnode when its state is final.  For instance,
UFS vnode with zero link count is removed, and since it is
inactivated, the last open reference on it is dropped.

On the other hand, vnode might get spurious usecount reference for
many reasons.  If the spurious reference exists while vgonel() checks
for active state of the vnode, it would recurse into VOP_INACTIVE().

Fix it by checking and not doing inactivation when vgone() was called
from inactive VOP.

Reported and tested by:	pho
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-19 19:20:23 +00:00
Mateusz Guzik
6d5d469fc1 cache: promote negative entries based on more than one hit
During tinderbox and similar workloads negative entries get at least one
hit before they get evicted. In the current scheme this avoidably promotes
them.

Be conservative and stick to 2 hits for now.
2020-10-19 18:51:51 +00:00
John Baldwin
6bcf3c46d8 Check TF_TOE not the tod pointer to determine if TOE is active.
The TF_TOE flag is the check used in the rest of the network stack to
determine if TOE is active on a socket.  There is at least one path in
the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a
socket not using TOE.

Reported by:	Sony Arpita Das <sonyarpitad@chelsio.com>
Reviewed by:	np
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D26803
2020-10-19 18:24:06 +00:00
Mark Johnston
d80126a6f4 link_elf_obj: Colour VM objects
This will cause the VM to back sufficiently large .text sections, such
as those in zfs.ko or amdgpu.ko on amd64, with superpage mappings when
possible.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26802
2020-10-19 16:57:59 +00:00
Mark Johnston
6351771b7c vmem: Allocate btags before looping in vmem_xalloc()
BT_MAXALLOC (4) is the number of boundary tags required to complete an
allocation in the worst case: two to clip a free segment, and two to
import from a parent arena.  vmem_xalloc() preallocates four boundary
tags before attempting a search to simplify the segment allocation code.
It implements a loop that:
1) ensures that BT_MAXALLOC boundary tags are available,
2) attempts to find and clip a free segment satisfying the allocation
   constraints, and failing that,
3) attempts to import a segment.

On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion:
it needs boundary tags to allocate boundary tags.  Thus we reserve
2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2
is because there are two layers of vmem arenas, the per-domain arena and
global arena.  For a single thread, 2 * BT_MAXALLOC tags should be
sufficient.

Because of the way the loop is structured, BT_MAXALLOC tags are not
sufficient.  The first bt_fill() call may allocate BT_MAXALLOC tags,
then import a segment (consuming two tags), then attempt to top up the
preallocation before carving into the imported free segment, thus
requiring up to six tags in the worst case.  Because we don't
preallocate that many, this bug can cause deadlocks in rare scenarios.

Fix the problem by moving the preallocation out the loop.  This assumes
that only a single import is ever required to satisfy an allocation
request.

Thanks to manu, emaste and lwhsu for helping test debug patches.

Reported by:	Jenkins (hardware CI lab)
Reviewed by:	alc, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26770
2020-10-19 16:54:06 +00:00
Mark Johnston
33a9bce62f vmem: Simplify bt_fill() callers a bit
No functional change intended.

Reviewed by:	alc, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26769
2020-10-19 16:52:27 +00:00
Ruslan Bukin
e707c8be4e Manage MSI iommu pages.
This allows the interrupt controller driver only need a small change to
create a map for the page the device will write to raise an interrupt.

Submitted by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26705
2020-10-19 13:10:21 +00:00
Mateusz Guzik
665c8c3e7d cache: refactor negative promotion/demotion handling
This will simplify policy changes.
2020-10-19 09:52:52 +00:00
Bjoern A. Zeeb
f7a0bb0dec ddb: add show sysinit command
Add a show sysinit command to ddb (similar to show vnet_sysinit) which
proved to be helpful to debug some ordering issues on early-mid kernel
start panics.
2020-10-17 22:47:08 +00:00
Mateusz Guzik
4c4aa84848 cache: shorten names of debug stats 2020-10-17 21:30:46 +00:00
Mateusz Guzik
676557143f cache: don't automatically evict negative entries if usage is low
The previous scheme only looked at negative entry count in relation to the
total count, leading to tons of spurious evictions if the cache is not
significantly populated.

Instead, only try the above if negative entry count goes beyond namecache
capacity.
2020-10-17 21:22:40 +00:00
Mateusz Guzik
e98c3bc667 cache: erwork sysctl vfs.cache tree
Split everything into neg, debug, param and stat categories.

The legacy nchstats sysctl (queried e.g., by systat) remains untouched.

While here rename some vars to be easier on the eye.
2020-10-17 13:06:29 +00:00
Mateusz Guzik
fa7c73d30c cache: factor negative lookup out of cache_fplookup_next 2020-10-17 13:04:46 +00:00
Mateusz Guzik
41e6b18422 cache: avoid smr in cache_neg_evict in favoro of the already held bucket lock 2020-10-17 13:04:25 +00:00
Mateusz Guzik
c38d8e1eb2 cache: rework parts of negative entry management
- declutter sysctl vfs.cache by moving relevant entries into
vfs.cache.neg
- add a little more parallelism to eviction by replacing the
global lock with an atomically modified counter
- track more statistics

The code needs further effort.
2020-10-17 08:48:58 +00:00
Mateusz Guzik
b31b5e9cfd cache: remove entries before trying to add new ones, not after
Should allow positive entries to replace negative ones in case
the cache is full.
2020-10-17 08:48:32 +00:00
Mateusz Guzik
ad89066af4 vfs: annotate mountlist_mtx with __exclusive_cache_line 2020-10-17 08:47:08 +00:00
Mateusz Guzik
d6eee35004 cache: add a probe reporting addition of duplicate entries 2020-10-17 00:27:26 +00:00
Mateusz Guzik
a59b0ac3aa cache: flip inverted condition in previous
It happened to not affect correctness in that the fallback code would
simply neglect to promote the entry.
2020-10-16 02:19:33 +00:00
Mateusz Guzik
e7602e04c7 cache: support negative entry promotion in slowpath smr 2020-10-16 00:56:13 +00:00
Mateusz Guzik
571bc3d1af cache: elide vhold/vdrop around promoting negative entry 2020-10-16 00:55:57 +00:00
Mateusz Guzik
640e6162ee cache: dedup code for negative promotion 2020-10-16 00:55:31 +00:00
Mateusz Guzik
c97c8746c0 cache: neglist -> nl; negstate -> ns
No functional changes.
2020-10-16 00:55:09 +00:00
Mateusz Guzik
43777a207d cache: split hotlist between existing negative lists
This simplifies the code while allowing for concurrent negative eviction
down the road.

Cache misses increased slightly due to higher rate of evictions allowed by
the change.

The current algorithm remains too aggressive.
2020-10-15 17:44:17 +00:00
Mateusz Guzik
430dc4518d cache: make neglist an array given the static size 2020-10-15 17:42:22 +00:00
Brooks Davis
16e4a0c89c physio: Don't store user addresses in bio_data
Only assign the address from the iovec to bio_data if it is a kernel
address.  This was the single place where bio_data stored (however
briefly) a userspace pointer.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26783
2020-10-15 17:05:21 +00:00
Mateusz Guzik
214eccf4b6 vfs: add VOP_EAGAIN
Can be used to stub fplookup for example.
2020-10-15 04:48:14 +00:00
Konstantin Belousov
6f3b523c9a Avoid dump_avail[] redefinition.
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h.  This fixes default gcc build for mips.

Reviewed by:	alc, scottph
Tested by:	kevans (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26741
2020-10-14 22:51:40 +00:00
John Baldwin
c2a8fd6f05 Permit sending empty fragments for TLS 1.0.
Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically
send empty TLS records ("empty fragments").  These TLS records have no
payload (and thus a page count of zero).  m_uiotombuf_nomap() was
returning NULL instead of an empty mbuf, and a few places needed to be
updated to treat an empty TLS record as having a page count of "1" as
0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark
ready via sbready()).

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26729
2020-10-13 17:30:34 +00:00
Warner Losh
e59db46854 newbus: use ssize_t to match sb's len and size, fix ordering of space check
Both s_len and s_size are ssize_t, so their differece is also more
properly a ssize_t not a size_t. Also, assert that len is <= size when
we enter. This should always be the case. Ensure that we have that one
byte that we write to the end of the buffer before we do so, though
the error should already be set on the buffer if not, and the only
times we supply 'partial' buffers they should be plenty large.

Reviewed by: cem, jhb (prior version, I did cem's suggestion)
Differential Revsion: https://reviews.freebsd.org/D26752
2020-10-12 22:07:44 +00:00
Conrad Meyer
f8e8a06d23 random(4) FenestrasX: Push root seed version to arc4random(3)
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled.  Otherwise, there is no
functional change.  The mechanism can be disabled with
debug.fxrng_vdso_enable=0.

arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time.  Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed.  On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.

This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator.  That change is
left for future discussion/work.

Reviewed by:	kib (previous version)
Approved by:	csprng (me -- only touching FXRNG here)
Differential Revision:	https://reviews.freebsd.org/D22839
2020-10-10 21:52:00 +00:00
Mateusz Guzik
dd28b379cb vfs: support lockless dirfd lookups 2020-10-10 03:48:17 +00:00
Bryan Drewery
c2c6fb90e0 Use unlocked page lookup for inmem() to avoid object lock contention
Reviewed By:	kib, markj
Submitted by:	mlaier
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D26653
2020-10-09 23:49:42 +00:00
Mateusz Guzik
deb1339f3f vfs: fix a panic when truncating comming from copy_file_range
Truncating requires an exclusive lock, but it was not taken if the
filesystem indicates support for shared writes. This only concerns
ZFS.

In particular fixes cp of files which have trailing holes.

Reported by:	bdrewery
2020-10-09 20:31:42 +00:00
John Baldwin
7e8bd70cff Don't invoke semunload() if seminit() fails during MOD_LOAD.
The module handler code invokes a MOD_UNLOAD event immediately if
MOD_LOAD fails.  The result was that if seminit() failed, semunload()
was invoked twice.  semunload() is not idempotent however and would
try to remove it's process_exit eventhandler twice resulting in a
panic.

Reviewed by:	kib, markj
Obtained from:	CheriBSD
MFC after:	1 month
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26696
2020-10-09 20:20:42 +00:00
Mateusz Guzik
eb88fed446 cache: fix vexec panic when racing against vgone
Use of dead_vnodeops would result in a panic instead of returning the intended
EOPNOTSUPP error.

While here make sure to abort, not just try to return a partial result.
The former allows the regular lookup to restart from scratch, while the latter
makes it stuck with an unusable vnode.

Reported by:	kevans
2020-10-09 19:10:00 +00:00
Rick Macklem
19fe23fa2b Make vn_generic_copy_file_range() interruptible via a signal.
Without this patch, when vn_generic_copy_file_range() is
doing a large copy, it will remain in the function for a
considerable amount of time, delaying handling of any
outstanding signals until the copy completes.

This patch adds checks for signals that need to be
processed after each successful data copy cycle.
When sig_intr() returns non-zero, vn_generic_copy_file_range()
will return.
The check "if (len < savlen)" ensures that some data
has been copied, so that progress will be made.

Note that, since copy_file_range(2) is allowed to
return fewer bytes copied than requested, it
will never return EINTR/ERESTART when sig_intr()
returns non-zero.

Reviewed by:	kib, asomers
Differential Revision:	https://reviews.freebsd.org/D26620
2020-10-09 01:04:28 +00:00
Konstantin Belousov
203dda8a63 sig_intr(9): return early if AST is not scheduled.
Check td_flags for relevant AST requests lock-less.  This opens the
race slightly wider where sig_intr() returns false negative, but might
be it is worth it.

Requested by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-08 22:34:34 +00:00
Konstantin Belousov
4ea4966009 Do not allow to use O_BENEATH as an oracle.
Specifically, if lookup() returned any error and the topping directory
was not latched, which means that (non-existent) path did not returned
to the topping location, give ENOTCAPABLE a priority over the lookup()
error.

PR:	249960
Reviewed by:	emaste, ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26695
2020-10-08 22:31:11 +00:00
Mitchell Horne
841dad02e9 Fix a loop condition
The correct way to identify the end of the metadata is two adjacent
entries set to zero/MODINFO_END. I made a typo and this was checking the
first entry twice.

Reported by:	rpokala
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
2020-10-08 18:29:17 +00:00
Mitchell Horne
22e6a67086 Add a routine to dump boot metadata
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.

Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.

Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.

Reviewed by:	tsoome
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26687
2020-10-08 18:02:05 +00:00
Hans Petter Selasky
eccb214897 The ethernet header structure is read-only. Add const keyword.
(This is a diff reduction towards D26254)

MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-08 11:25:19 +00:00
Mitchell Horne
44c705cf15 Handle kmod local relocation failures gracefully
It is possible for elf_reloc_local() to fail in the unlikely case of
an unsupported relocation type. If this occurs, do not continue to
process the file.

Reviewed by:	kib, markj (earlier version)
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26701
2020-10-07 23:14:49 +00:00
Warner Losh
bc683a89a3 Move kernel env global variables, etc to sys/kenv.h
The kernel globals for kenv are confined to 2 files that need them and
a few that likely shouldn't (but as written the code does). Move them
from sys/systm.h to sys/kenv.h. This removed a XXX from systm.h and
cleans it up a little bit...
2020-10-07 06:16:37 +00:00
Mitchell Horne
6debfd4b13 Remove unused function cpu_boot()
The prototype was added with the creation of kern_shutdown.c in r17658,
but it appears to have never been implemented. Remove it now.

Reviewed by:	cem, kib
Differential Revision:	https://reviews.freebsd.org/D26702
2020-10-06 23:16:56 +00:00
John Baldwin
56fb710f1b Store the send tag type in the common send tag header.
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway.  This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).

In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.

Reviewed by:	gallatin, hselasky, np, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26689
2020-10-06 17:58:56 +00:00
Ryan Moeller
92e17803cd Enable iterating all sysctls, even ones with CTLFLAG_SKIP
Add an "nextnoskip" sysctl that allows for listing of sysctls intended to be
normally skipped for cost reasons.

This makes it so the names/descriptions of those sysctls can be discovered with
sysctl -aN/sysctl -ad/sysctl -at.

It also makes it so children are visited when a node flagged with CTLFLAG_SKIP
is explicitly requested.

The intended use case is to mark the root "kstat" node with CTLFLAG_SKIP so that
the extensive and expensive stats are skipped by default but may still be easily
obtained without having to know them all (which may not even be possible) and
request each one-by-one.

Reviewed by:	jhb
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26560
2020-10-05 20:13:22 +00:00
Mateusz Guzik
4e2266100d cache: fix pwd use-after-free in setting up fallback
Since the code exits smr section prior to calling pwd_hold, the used
pwd can be freed and a new one allocated with the same address, making
the comparison erroneously true.

Note it is very unlikely anyone ran into it.
2020-10-05 19:38:51 +00:00
Mark Johnston
780766eb52 Remove sysctl_kern_consmute()
It is a trivial wrapper for sysctl_handle_int() since r184521.  Also
remove the NEEDGIANT flag, cn_mute is accessed locklessly.

MFC after:	1 week
2020-10-05 15:54:19 +00:00
Konstantin Belousov
0400be45e9 Add sig_intr(9).
It gives the answer would the thread sleep according to current state
of signals and suspensions.  Of course the answer is racy and allows
for false-negatives (no sleep when signal is delivered after process
lock is dropped).  Also the answer might change due to signal
rescheduling among threads in multi-threaded process.

Still it is the best approximation I can provide, to answering the
question was the thread interrupted.

Reviewed by:	markj
Tested by:	pho, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D26628
2020-10-04 16:33:42 +00:00
Konstantin Belousov
0c82fb267b Refactor sleepq_catch_signals().
- Extract suspension check into sig_ast_checksusp() helper.
- Extract signal check and calculation of the interruption errno into
  sig_ast_needsigchk() helper.
The helpers are moved to kern_sig.c which is the proper place for
signal-related code.

Improve control flow in sleepq_catch_signals(), to handle ret == 0
(can sleep) and ret != 0 (interrupted) only once, by separating
checking code into sleepq_check_ast_sq_locked(), which return value is
interpreted at single location.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D26628
2020-10-04 16:30:05 +00:00
Edward Tomasz Napierala
4658877815 Move KTRUSERRET() from userret() to ast(). It's a really long
detour - it writes ktrace entries to the filesystem - so the overhead
of ast() won't make any difference.

Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26404
2020-10-03 12:03:08 +00:00
Mark Johnston
c88285c54a Fix the INVARIANTS build for 32-bit platforms
Reported by:	Jenkins
MFC with:	r366368
2020-10-02 18:54:37 +00:00
Mark Johnston
f31695cc64 Implement sparse core dumps
Currently we allocate and map zero-filled anonymous pages when dumping
core.  This can result in lots of needless disk I/O and page
allocations.  This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.

Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page.  Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file.  This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.

The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.

PR:		249067
Reviewed by:	kib
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26590
2020-10-02 17:50:22 +00:00
Mark Johnston
fec41f0751 Simplify the check for non-dumpable VM object types
OBJT_DEFAULT, _SWAP, _VNODE and _PHYS is exactly the set of
non-fictitious object types, so just check for OBJ_FICTITIOUS.  The
check no longer excludes dead objects, but such objects have to be
handled regardless.

No functional change intended.

Reviewed by:	alc, dougm, kib
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26589
2020-10-02 17:49:13 +00:00
Mateusz Guzik
aa34e791fa cache: update the commentary for path parsing 2020-10-02 14:50:03 +00:00
Bryan Drewery
9ceba22462 Revert r366340.
CR wasn't finished and it breaks the build.
2020-10-01 20:08:27 +00:00
Bryan Drewery
2398cd1103 Use unlocked page lookup for inmem() to avoid object lock contention
Reviewed By:	kib, markj
Sponsored by:	Dell EMC Isilon
Submitted by:	mlaier
Differential Revision:	https://reviews.freebsd.org/D26597
2020-10-01 19:17:03 +00:00
Edward Tomasz Napierala
4c6f466cb4 Only clear TDP_NERRNO when needed, ie when it's previously been set.
Reviewed by:	kib
Tested by:	pho
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26612
2020-10-01 18:45:31 +00:00
Mateusz Guzik
b5ab177a99 cache: properly report ENOTDIR on foo/bar lookups where foo is a file
Reported by:	fernape
2020-10-01 08:46:21 +00:00
Rick Macklem
961afe3c99 Clip the "len" argument to vn_generic_copy_file_range() at a
hole size boundary.

By clipping the len argument of vn_generic_copy_file_range() to end at
an exact multiple of hole size, holes are more likely to be maintained
during the copy.
A hole can still straddle the boundary at the end of the
copy range, resulting in a block being allocated in the
output file as it is being grown in size, but this will reduce the
likelyhood of this happening.

While here, also modify setting of blksize to better handle the
case where _PC_MIN_HOLE_SIZE is returned as 1.

Reviewed by:	asomers
Differential Revision:	https://reviews.freebsd.org/D26570
2020-10-01 00:33:44 +00:00
John Baldwin
8128c65b4c Avoid a dubious assignment to bio_data in aio_qbio().
A user pointer is not a suitable value for bio_data and the next block
of code always overwrites bio_data anyway.  Just use cb->aio_buf
directly in the call to vm_fault_quick_hold_pages().

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 month
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26595
2020-09-30 17:49:06 +00:00
Mateusz Guzik
4301a5a794 cache: push the lock into cache_purge_impl 2020-09-30 17:08:34 +00:00
Mateusz Guzik
d4cac59429 cache: use cache_has_entries where appropriate instead of opencoding it 2020-09-30 04:27:38 +00:00
Rick Macklem
164aa1e941 Make copy_file_range(2) Linux compatible for overflow of offset + len.
Without this patch, if a call to copy_file_range(2) specifies an input file
offset + len that would wrap around, EINVAL is returned.
I thought that was the Linux behaviour, but recent testing showed that
Linux accepts this case and does the copy_file_range() to EOF.

This patch changes the FreeBSD code to exhibit the same behaviour as
Linux for this case.

Reviewed by:	asomers, kib
Differential Revision:	https://reviews.freebsd.org/D26569
2020-09-30 02:18:09 +00:00
Edward Tomasz Napierala
3409864922 Use the 'traced' variable instead of comparing p->p_flag again.
Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26577
2020-09-29 11:18:48 +00:00
Kyle Evans
5f0601fd19 Address whitespace nits in subr_rtc.c
These were separated out from a nearby patch from Andrew Gierth.

MFC after:	3 days
2020-09-28 17:19:57 +00:00
Warner Losh
ab3f5b6ef2 For mulitcons boot, report it and which console is primary
Until we can do proper /etc/rc output on both consoles in multicons
boot (or all of them if we ever generalize), report when we are
booting multicons. Also report the primary console. This will be a big
hint why output stops after this line (though some slow USB discovery
still happens after mountroot / init starts).

Reviewed by: scottl@, tsoome@
Differential Revision: https://reviews.freebsd.org/D26574
2020-09-28 16:19:29 +00:00
Edward Tomasz Napierala
1e2521ffae Get rid of sa->narg. It serves no purpose; use sa->callp->sy_narg instead.
Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26458
2020-09-27 18:47:06 +00:00
Edward Tomasz Napierala
0c5bd5f993 Regen after r366145.
Sponsored by:	DARPA
2020-09-25 10:05:38 +00:00
Alan Somers
5710395f4d Fix some signed/unsigned comparison warnings in NFS
Reviewed by:	rmacklem
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26533
2020-09-24 15:38:01 +00:00
Konstantin Belousov
5dca94ee82 Remove pointless local variable.
Reported by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2020-09-24 12:14:25 +00:00
Mateusz Guzik
1b2edd6e2b cache: eliminate cache_zap_locked_vnode
It is only ever called for negative entries and for those it is
just a wrapper around cache_zap_negative_locked_vnode_kl which
always succeeds.

This also fixes a bug where cache_lookup_fallback should have been
calling cache_zap_locked_bucket instead. Note that in order to trigger
the bug NOCACHE must not be set, which currently only happens when
creating a new coredump (and then the coredump-to-be has to have a
negative entry).
2020-09-24 03:38:32 +00:00
Mark Johnston
78257765f2 Add a vmparam.h constant indicating pmap support for large pages.
Enable SHM_LARGEPAGE support on arm64.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26467
2020-09-23 19:34:21 +00:00
Konstantin Belousov
aaf78c16f5 Do not leak oldvmspace if image activation failed
and current address space is already destroyed, so kern_execve()
terminates the process.

While there, clean up some internals of post_execve() inlined in init_main.

Reported by:	Peter <pmc@citylink.dinoex.sub.org>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26525
2020-09-23 18:03:07 +00:00
Mateusz Guzik
a3d9bf49b5 cache: drop the force flag from purgevfs
The optional scan is wasteful, thus it is removed altogether from unmount.

Callers which always want it anyway remain unaffected.
2020-09-23 10:46:07 +00:00
Mateusz Guzik
a952fefff2 cache: reimplement purgevfs to iterate vnodes instead of the entire hash
The entire cache scan was a leftover from the old implementation.

It is incredibly wasteful in presence of several mount points and does not
win much even for single ones.
2020-09-23 10:44:49 +00:00
Mateusz Guzik
efeec5f0c6 cache: clean up atomic ops on numneg and numcache
- use subtract instead of adding -1
- drop the useless _rel fence

Note this should be converted to a scalable scheme.
2020-09-23 10:42:41 +00:00
Konstantin Belousov
1317da4349 Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH.
It is like O_BENEATH, but disables to walk out of the subtree rooted
in the starting directory. O_BENEATH does not care if path walks out
if it returned.

Requested by:	Dan Gohman <dev@sunfishcode.online>
PR:	248335
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 22:48:12 +00:00
Konstantin Belousov
6a9c72d901 Change O_BENEATH to handle relative paths same as absolute.
Do not care if path walks out of the topping directory if it returns back.

Requested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 22:43:32 +00:00
Konstantin Belousov
07e7ad2b98 Only clear latch for BENEATH when we walk out of the startdir,
not unconditionally on any dotdot component.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 22:36:02 +00:00
Konstantin Belousov
4a0b316d2a Add open2nameif()
the helper to calculate namei flags both for open(2) and creat(2).

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 22:23:58 +00:00
Konstantin Belousov
861f039df1 Add at2cnpflags()
the helper to convert AT_ flags for *at() syscalls to namei flags.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 22:22:29 +00:00
Konstantin Belousov
c7de3d6f0b Add NIRES_STRICTREL.
Stop abusing internal namei flag NI_LCF_STRICTRELATIVE as indicator of
cap-restricted lookup.  Add designated returned flag NIRES_STRICTREL
to inform kern_openat() that lookup was restricted.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 22:06:20 +00:00
Konstantin Belousov
f9e46c9bf1 lookup: Track last lookup component if it is directory.
This makes open("/a/../a", O_BENEATH) with cwd == "/a" work.

Reviewed by:	markj
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 21:59:18 +00:00
Konstantin Belousov
44619a5e86 Improve comment above nameicap_check_dotdot().
Explain why tracker is needed at all.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25886
2020-09-22 21:54:30 +00:00
Mitchell Horne
624a7e1f4f Use getenv_is_true() in init_static_kenv()
A small example of how these functions can be used to simplify checks of
this nature.

Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26271
2020-09-21 15:44:23 +00:00
Mitchell Horne
cba446e2c2 Add getenv(9) boolean parsing functions
This adds the getenv_bool() function, to parse a boolean value from a
kernel environment variable or tunable. This works for traditional
boolean values like "0" and "1", and also "true" and "false"
(case-insensitive). These semantics do not yet apply to sysctls declared
using SYSCTL_BOOL with CTLFLAG_TUN (they still only parse 1 and 0).

Also added are two wrapper functions, getenv_is_true() and
getenv_is_false(). These are slightly simpler for callers wishing to
perform a single check of a configuration variable.

Reviewed by:	jhb (slightly earlier version)
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26270
2020-09-21 15:24:44 +00:00
Michal Meloun
95a85c125d Add NetBSD compatible bus_space_peek_N() and bus_space_poke_N() functions.
One problem with the bus_space_read_N() and bus_space_write_N() family of
functions is that they provide no protection against exceptions which can
occur when no physical hardware or device responds to the read or write
cycles. In such a situation, the system typically would panic due to a
kernel-mode bus error. The bus_space_peek_N() and bus_space_poke_N() family
of functions provide a mechanism to handle these exceptions gracefully
without the risk of crashing the system.

Typical example is access to PCI(e) configuration space in bus enumeration
function on badly implemented PCI(e) root complexes (RK3399 or Neoverse
N1 N1SDP and/or access to PCI(e) register when device is in deep sleep state.

This commit adds a real implementation for arm64 only. The remaining
architectures have bus_space_peek()/bus_space_poke() emulated by using
bus_space_read()/bus_space_write() (without exception handling).

MFC after:	1 month
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D25371
2020-09-19 11:06:41 +00:00
Eric van Gyzen
f9cc8410e1 vm_ooffset_t is now unsigned
vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.

Reported by:	Coverity
Reviewed by:	kib markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26214
2020-09-18 16:48:08 +00:00
Warner Losh
fd0a41d241 Move to a more robust and conservative alloation scheme for devctl messages
Change the zone setup:
- Allow slabs to be returned to the OS
- Set the number of slots to the max devctl will queue before discarding
- Reserve 2% of the max (capped at 100) for low memory allocations
- Disable per-cpu caching since we don't need it and we avoid some pathologies

Change the alloation strategiy a bit:
- If a normal allocation fails, try to get the reserve
- If a reserve allocation fails, re-use the oldest-queued entry for storage
- If there's a weird race/failure and nothing on the queue to steal, return NULL

This addresses two main issues in the old code:
- If devd had died, and we're generating a lot of messages, we have an
  unbounded leak. This new scheme avoids the issue that lead to this.
- The MPASS that was 'sure' the allocation couldn't have failed turned out
  to be wrong in some rare cases. The new code doesn't make this assumption.

Since we reserve only 2% of the space, we go from about 1MB of
allocation all the time to more like 50kB for the reserve.

Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26448
2020-09-17 17:29:33 +00:00
Edward Tomasz Napierala
70890254b3 Get rid of sv_errtbl and SV_ABI_ERRNO().
Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26388
2020-09-17 11:39:33 +00:00
Konstantin Belousov
dd90d96342 Put calls to check_pgrp_jobc() in fixjobc_kill() under INVARIANTS.
Reported by:	Michael Butler <imb@protected-networks.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-09-17 00:07:15 +00:00
Konstantin Belousov
182cfe6ff4 Add check_pgrp_jobc() calls into process exit path.
Both before and after job control adjustments.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26416
2020-09-16 21:49:19 +00:00
Konstantin Belousov
2f5f11f533 Fix fixjobc+orhpanage.
Orphans affect job control state, we must account for them when
changing pg_jobc.

Instead of p_pptr, use proc_realparent() to get parent relevant for
job control.

Use correct calculation of the parent for exiting process.  For jobc
purposes, we must use realparent, but if it is also exiting, we should
fall to reaper, then recursively find non-exiting reaper.

Reported by:	trasz
PR:	249257
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26416
2020-09-16 21:46:57 +00:00
Konstantin Belousov
928b85384a Assert that P_TREE_GRPEXITED is set only once.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26416
2020-09-16 21:40:32 +00:00
Konstantin Belousov
844219f471 proc_realparent: if p_oppid does not match pid of the current parent
for non-orphaned process, return reaper instead of init.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26416
2020-09-16 21:38:24 +00:00
Konstantin Belousov
82207cd246 Improve ddb 'show pgrpdump' command.
Use ddb pager.
Make lines more compact.
Eliminate unneeded casts.
Print more job-control related info when reporting process group.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26416
2020-09-16 21:34:18 +00:00
Warner Losh
9ea860660f Use standard bool type, instead of non-standard boolean_t 2020-09-16 06:02:30 +00:00
Mark Johnston
b4e07e3da5 Fix locking in uipc_accept().
Reported by:	cy
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-09-15 23:03:56 +00:00
Konstantin Belousov
3c484f325e Convert page cache read to VOP.
There are several negative side-effects of not calling into VOP layer
at all for page cache reads.  The biggest is the missed activation of
EVFILT_READ knotes.

Also, it allows filesystem to make more fine grained decision to
refuse read from page cache.

Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for
asserts.

Reviewed by:	markj
Tested by:	pho
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 22:06:36 +00:00
Konstantin Belousov
888636655d vfs_subr.c: export io_hold_cnt and vn_read_from_obj().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 22:00:58 +00:00
Konstantin Belousov
96474d2a3f Do not copy vp into f_data for DTYPE_VNODE files.
The pointer to vnode is already stored into f_vnode, so f_data can be
reused.  Fix all found users of f_data for DTYPE_VNODE.

Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.

Reviewed by:	markj (previous version)
Discussed with:	freqlabs (openzfs chunk)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 21:55:21 +00:00
Mark Johnston
448000279e Fix locking in uipc_accept().
This function wasn't converted to use the new locking protocol in
r333744.  Make it use the PCB lock for synchronizing connection state.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26300
2020-09-15 19:23:42 +00:00
Mark Johnston
ccdadf1a9b Simplify unix socket connection peer locking.
unp_pcb_owned_lock2() has some sharp edges and forces callers to deal
with a bunch of cases.  Simplify it:

- Rename to unp_pcb_lock_peer().
- Return the connected peer instead of forcing callers to load it
  beforehand.
- Handle self-connected sockets.
- In unp_connectat(), just lock the accept socket directly.  It should
  not be possible for the nascent socket to participate in any other
  lock orders.
- Get rid of connect_internal().  It does not provide any useful
  checking anymore.
- Block in unp_connectat() when a different thread is concurrently
  attempting to lock both sides of a connection.  This provides simpler
  semantics for callers of unp_pcb_lock_peer().
- Make unp_connectat() return EISCONN if the socket is already
  connected.  This fixes a race[1] when multiple threads attempt to
  connect() to different addresses using the same datagram socket.
  Upper layers will disconnect a connected datagram socket before
  calling the protocol connect's method, but there is no synchronization
  between this and protocol-layer code.

Reported by:	syzkaller [1]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26299
2020-09-15 19:23:22 +00:00
Mark Johnston
ed92e1c78c Avoid an unnecessary malloc() when connecting dgram sockets.
The allocated memory is only required for SOCK_STREAM and SOCK_SEQPACKET
sockets.

Reviewed by:	kevans
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26298
2020-09-15 19:23:01 +00:00
Mark Johnston
f0317f868b Simplify unp_disconnect() callers.
In all cases, PCBs are unlocked after unp_disconnect() returns.  Since
unp_disconnect() may release the last PCB reference, callers may have to
bump the refcount before the call just so that they can release them
again.

Change unp_disconnect() to release PCB locks as well as connection
references; this lets us remove several refcount manipulations.  Tighten
assertions.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26297
2020-09-15 19:22:37 +00:00
Mark Johnston
4820bf6ac2 Rename unp_pcb_lock2().
unp_pcb_lock_pair() seems like a better name.  Also make it handle the
case where the two sockets are the same instead of making callers do it.
No functional change intended.

Reviewed by:	glebius, kevans, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26296
2020-09-15 19:22:16 +00:00
Mark Johnston
5362170da7 Improve unix socket PCB refcounting.
- Use refcount_init().
- Define an INVARIANTS-only zone destructor to assert that various
  bits of PCB state aren't left dangling.
- Annotate unp_pcb_rele() with __result_use_check.
- Simplify control flow.

Reviewed by:	glebius, kevans, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26295
2020-09-15 19:21:58 +00:00
Mark Johnston
d5cbccecd8 Update unix domain socket locking comments.
- Define a locking key for unpcb members.
- Rewrite some of the locking protocol description to make it less
  verbose and avoid referencing some subroutines which will be renamed.
- Reorder includes.

Reviewed by:	glebius, kevans, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26294
2020-09-15 19:21:33 +00:00
Edward Tomasz Napierala
fecb19e431 Move td_softdep_cleanup() from userret() to ast(); it's infrequent
at best.  The schedule_cleanup() function already sets TDF_ASTPENDING.

Reviewed by:	kib, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26375
2020-09-14 10:17:07 +00:00
Edward Tomasz Napierala
60f083efe2 Move TDP_GEOM check from userret() to ast(); this code path is quite
infrequent.

Reviewed by:	kib
No objections:	mav
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26374
2020-09-14 10:14:03 +00:00
Edward Tomasz Napierala
30d158eecc Move racct/rctl throttling from userret() to ast(). There's no reason
for it to sit in the syscall fast path.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26368
2020-09-14 09:44:24 +00:00
Scott Long
74c781ed91 Refine the busdma template interface. Provide tools for filling in fields
that can be extended, but also ensure compile-time type checking.  Refactor
common code out of arch-specific implementations.  Move the mpr and mps
drivers to this new API.  The template type remains visible to the consumer
so that it can be allocated on the stack, but should be considered opaque.
2020-09-14 05:58:12 +00:00
Konstantin Belousov
7978363417 Fix interaction between largepages and seals/writes.
On write with SHM_GROW_ON_WRITE, use proper truncate.
Do not allow to grow largepage shm if F_SEAL_GROW is set. Note that
shrinks are not supported at all due to unmanaged mappings.
Call to vm_pager_update_writecount() is only valid for swap objects,
skip it for unmanaged largepages.
Largepages cannot support write sealing.
Do not writecnt largepage mappings.

Reported by:	kevans
Reviewed by:	kevans, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26394
2020-09-10 20:54:44 +00:00
Konstantin Belousov
d301b3580f Support for userspace non-transparent superpages (largepages).
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.

Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:12:51 +00:00
Konstantin Belousov
25f44824ba uipc_shm.c: Move comment where it belongs.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:00:11 +00:00
Gleb Smirnoff
022c2f5570 In r354148 the goal was to check THREAD_CAN_SLEEP() only once for the
purpose of epoch_trace() and for calling subsequent panic, but to keep
code fully under INVARIANTS, so don't use bare function call to panic().
However, at the last stage of review a true value slipped in, while
always false was assumed.  I checked that in email archive with kib@.

Noticed by:	trasz
2020-09-09 16:13:33 +00:00
Konstantin Belousov
fbf2a77876 Convert allocations of the phys pager to vm_pager_allocate().
Future changes would require additional initialization of OBJT_PHYS
objects, and vm_object_allocate() is not suitable for it.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:38:49 +00:00
Mateusz Guzik
54052edaa0 fd: fix fhold on an uninitialized var in fdcopy_remapped
Reported by:	gcc9
2020-09-08 16:07:47 +00:00
Mateusz Guzik
da62ed4f1a cache: drop write-only tvp_seqc vars 2020-09-08 16:06:46 +00:00
Mateusz Guzik
2bcfa5ba6f vfs: drop a write-only var in vfs_periodic_msync_inactive 2020-09-08 16:06:26 +00:00
Konstantin Belousov
7de1bc13e2 imgact_elf.c: unify check for phdr fitting into the first page.
Similar to the userspace rtld check.

Reviewed by:	dim, emaste (previous versions)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26339
2020-09-07 21:37:16 +00:00
Chuck Silvers
a0a36d4886 vfs: avoid exposing partially constructed vnodes
If multiple threads race calling vfs_hash_insert() while creating vnodes
with the same identity, all of the vnodes which lose the race must be
destroyed before any other thread can see them. Previously this was
accomplished by the vput() in vfs_hash_insert() resulting in the vnode's
VOP_INACTIVE() method calling vgone() before the vnode lock was unlocked,
but at some point changes to the the vnode refcount/inactive logic have caused
that to no longer work, leading to crashes, so instead vfs_hash_insert()
must call vgone() itself before calling vput() on vnodes which lose the race.

Reviewed by:	mjg, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26291
2020-09-05 00:26:03 +00:00
Bjoern A. Zeeb
d29a3de296 uipc_ktls: remove unused static function
m_segments() was added with r363464 but never used. Remove it to
avoid warnings when compiling kernels.

Reported by:	rmacklem (also says jhb)
Reviewed by:	gallatin, jhb
Differential Revision:	https://reviews.freebsd.org/D26330
2020-09-05 00:19:40 +00:00
Andrew Gallatin
9675d8895a ktls: Check for a NULL send tag in ktls_cleanup()
When using ifnet ktls, and when ktls_reset_send_tag()
fails to allocate a replacement tag, it leaves
the tls session's snd_tag pointer NULL. ktls_cleanup()
tries to release the send tag, and will trip over
this NULL pointer and panic unless NULL is checked for.

Reviewed by:	jhb
Sponsored by:	Netflix
2020-09-04 17:36:15 +00:00
Brooks Davis
18f917a90e Always report ENOSYS in init
While rare, encountering an unimplemented system call early in init is
catastrophic and difficult to debug.  Even after a SIGSYS handler is
registered, such configurations are problematic.  As such, always report
such events for pid 1 (following kern.lognosys if non-zero).

Reviewed by:	kevans, imp
Obtained from:	CheriBSD (plus suggestions from kevans)
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26288
2020-09-02 23:17:33 +00:00
Mateusz Guzik
b1a824b684 vfs: retire vholdl as a symbol
Similarly to vrefl in r364283.
2020-09-02 19:21:37 +00:00
Mateusz Guzik
2b4632aee9 vfs: purge cache entries early on vgone
There is no reason for them to linger across reclaim and it is an
invariant that doomed vnodes are not added to the namecache.
2020-09-02 19:21:10 +00:00
Mark Johnston
a0efcf6400 Add sysctl(8) formatting for hw.pagesizes.
- Change the type of hw.pagesizes to OPAQUE, since it returns an array.
- Modify the handler to only truncate the returned length if the caller
  supplied an output buffer.  This allows use of the trick of passing a
  NULL output buffer to fetch the output size, while preserving
  compatibility if MAXPAGESIZES is increased.
- Add a "S,pagesize" formatter to sysctl(8).

Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26239
2020-09-02 18:17:08 +00:00
Hans Petter Selasky
624677fad7 Assert that cc_exec_drain(cc, direct) is NULL before assigning a new value.
Suggested by:	markj@
Tested by:	callout_test
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-09-02 10:00:30 +00:00
Hans Petter Selasky
0d0053d7ed Micro optimise _callout_stop_safe() by removing dead code.
The CS_DRAIN flag cannot be set at the same time like the async-drain function
pointer is set. These are orthogonal features. Assert this at the beginning
of the function.

Before:
        if (flags & CS_DRAIN) {
                /* FALLTHROUGH */
        } else if (xxx) {
                return yyy;
        }
        if (drain) {
                zzz = drain;
        }
After:
        if (flags & CS_DRAIN) {
                /* FALLTHROUGH */
        } else if (xxx) {
                return yyy;
        } else {
                if (drain) {
                        zzz = drain;
                }
        }

Reviewed by:	markj@
Tested by:	callout_test
Differential Revision:	https://reviews.freebsd.org/D26285
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-09-02 09:44:00 +00:00
Mateusz Guzik
6fed89b179 kern: clean up empty lines in .c and .h files 2020-09-01 22:12:32 +00:00
Kyle Evans
5dd47b52e5 posixshm: fix setting of shm_flags
Noted in D24652, we currently set shmfd->shm_flags on every
shm_open()/shm_open2(). This wasn't properly thought out; one shouldn't be
able to specify incompatible flags on subsequent opens of non-anon shm.

Move setting of shm_flags explicitly to the two places shmfd are created, as
we do with seals, and validate when we're opening a pre-existing mapping
that we've either passed no flags or we've passed the exact same flags as
the first time.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D26242
2020-08-31 15:07:15 +00:00
Andrew Gallatin
796d4eb89e make m_getm2() resilient to zone_jumbop exhaustion
When the zone_jumbop is exhausted, most things using
using sosend* (like sshd)  will eventually
fail or hang if allocations are limited to the
depleted jumbop zone.  This makes it imossible to
communicate with a box which is under an attach which
exhausts the jumbop zone.

Rather than depending on the page size zone, also try cluster
allocations to satisfy larger requests.  This allows me
to ssh to, and serve 100Gb/s of traffic from a server which
under attack and has had its page-sized zone exhausted.

Reviewed by:	glebius, markj, rmacklem
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26150
2020-08-31 13:53:14 +00:00
Vladimir Kondratyev
5d4bf0578f LinuxKPI: Implement ksize() function.
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.

ksize() function is used by drm-kmod.

Reviewed by:	hselasky, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26215
2020-08-29 19:26:31 +00:00
Warner Losh
f6c941f347 We don't need to INCLUDENUL, so turn it off to avoid assertion...
sbuf_new_for_sysctl turns on INCLUDENUL, but we don't need it. And we assert for
it in the new bus_pnpinfo_sb and bus_location_sb strings.
2020-08-29 11:46:50 +00:00
Warner Losh
6dd5b77a15 Use sbuf_cat instead of sbuf_cpy
sbuf_cpy doesn't work with sysctl sbufs because of the drain function.
2020-08-29 11:18:10 +00:00
Warner Losh
5eade881a8 Avoid NULL pointer dereferences
Add back NULL pointer checks accidentally dropped in r364946. We need
to append a NUL character when that happens.
2020-08-29 09:59:52 +00:00
Warner Losh
17c219fd6f Move to using sbuf for some sysctl in newbus
Convert two different sysctl to using sbuf. First, for all the default
sysctls we implement for each device driver that's attached. This is a
pure sbuf conversion.

Second, convert sysctl_devices to fill its buffer with sbuf rather
than a hand-rolled crappy thing I wrote years ago.

Reviewed by: cem, markj
Differential Revision: https://reviews.freebsd.org/D26206
2020-08-29 04:30:12 +00:00
Warner Losh
887611b122 Retire devctl_notify_f()
devctl_notify_f isn't needed, so retire it. The flags argument is now
unused, so rather than keep it around, retire it. Convert all old
users of it to devctl_notify(). This path no longer sleeps, so is safe
to call from any context. Since it doesn't sleep, it doesn't need to
know if it is OK to sleep or not.

Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
2020-08-29 04:30:06 +00:00
Warner Losh
bca8f35f28 devctl: move to using a uma zone
Convert the memory management of devctl.  Rewrite if to make better
use of memory. This eliminates several mallocs (5? worse case) needed
to send a message. It's now possible to always send a message, though
if things are really backed up the oldest message will be dropped to
free up space for the newest.

Add a static bus_child_{location,pnpinfo}_sb to start migrating to
sbuf instead of buffer + length. Use it in the new code.  Other code
will be converted later (bus_child_*_str is only used inside of
subr_bus.c, though implemented in ~100 places in the tree).

Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
2020-08-29 04:29:53 +00:00
Kirk McKusick
66ac5b2c5a Add a comment to clarify when and why cached names are deleted
during pathname lookup.

Reviewed by:  kib
MFC after:    3 days
Sponsored by: Netflix
2020-08-27 22:14:58 +00:00
Mark Johnston
6255e8c8e2 Fix writing of the final block of encrypted, compressed kernel dumps.
Previously any residual data in the final block of a compressed kernel
dump would be written unencrypted.  Note, such a configuration already
does not work properly when using AES-CBC since the compressed data is
typically not a multiple of the AES block length in size and EKCD does
not implement any padding scheme.  However, EKCD more recently gained
support for using the ChaCha20 cipher, which being a stream cipher does
not have this problem.

Submitted by:	sigsys@gmail.com
Reviewed by:	cem
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26188
2020-08-27 17:36:06 +00:00
Mateusz Guzik
84ecea90b7 cache: don't update timestmaps on found entry 2020-08-27 06:31:55 +00:00
Mateusz Guzik
5f08d440b0 cache: assorted clean ups
In particular remove spurious comments, duplicate assertions and the
inconsistently done KTR support.
2020-08-27 06:31:27 +00:00
Mateusz Guzik
12441fcbe2 cache: ncp = NULL early to account for sdt probes in ailure path
CID:	1432106
2020-08-27 06:30:40 +00:00
Warner Losh
cbda6f66f4 Implement FLUSHO
Turn FLUSHO on/off with ^O (or whatever VDISCARD is). Honor that to
throw away output quickly. This tries to remain true to 4.4BSD
behavior (since that was the origin of this feature), with any
corrections NetBSD has done. Since the implemenations are a little
different, though, some edge conditions may be handled differently.

Reviewed by: kib, kevans
Differential Review: https://reviews.freebsd.org/D26148
2020-08-27 05:11:15 +00:00
Rick Macklem
df665abd34 Fix a "v_seqc_users == 0 not met" panic when VFS_STATFS() fails during mount.
r363210 introduced v_seqc_users to the vnodes.  This change requires
a vn_seqc_write_end() to match the vn_seqc_write_begin() in
vfs_cache_root_clear().
mjg@ provided this patch which seems to fix the panic.

Tested for an NFS mount where the VFS_STATFS() call will fail.

Submitted by:	mjg
Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D26160
2020-08-26 21:49:43 +00:00
Mark Johnston
41c6838786 vmem: Avoid allocating span tags when segments are never released.
vmem uses span tags to delimit imported segments, so that they can be
released if the segment becomes free in the future.  However, the
per-domain kernel KVA arenas never release resources, so the span tags
between imported ranges are unused when the ranges are contiguous.
Furthermore, such span tags prevent coalescing of free segments across
KVA_QUANTUM boundaries, resulting in internal fragmentation which
inhibits superpage promotion in the kernel map.

Stop allocating span tags in arenas that never release resources.  This
saves a small amount of memory and allows free segements to coalesce
across import boundaries.  This manifests as improved kernel superpage
usage during poudriere runs, which also helps to reduce physical memory
fragmentation by reducing the number of broken partially populated
reservations.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24548
2020-08-26 14:31:35 +00:00
Mateusz Guzik
1e9a0b391d cache: relock on failure in cache_zap_locked_vnode
This gets rid of bogus scheme of yielding in hopes the blocking thread will
make progress.
2020-08-26 12:54:18 +00:00
Mateusz Guzik
075f58f231 cache: stop null checking in cache_free 2020-08-26 12:53:16 +00:00
Mateusz Guzik
66fa11c898 cache: make it mandatory to request both timestamps or neither 2020-08-26 12:52:54 +00:00
Mateusz Guzik
eef63775b6 cache: convert bucketlocks to a mutex
By now bucket locks are almost never taken for anything but writing and
converting to mutex simplifies the code.
2020-08-26 12:52:17 +00:00
Mateusz Guzik
32f3d0821c cache: only evict negative entries on CREATE when ISLASTCN is set 2020-08-26 12:50:57 +00:00
Mateusz Guzik
935e15187c cache: decouple smr and locked lookup in the slowpath
Tested by:	pho
2020-08-26 12:50:10 +00:00
Mateusz Guzik
d3476daddc cache: factor dotdot lookup out of cache_lookup
Tested by:	pho
2020-08-26 12:49:39 +00:00
Alan Somers
e6f6d0c9bc crypto(9): add CRYPTO_BUF_VMPAGE
crypto(9) functions can now be used on buffers composed of an array of
vm_page_t structures, such as those stored in an unmapped struct bio.  It
requires the running to kernel to support the direct memory map, so not all
architectures can use it.

Reviewed by:	markj, kib, jhb, mjg, mat, bcr (manpages)
MFC after:	1 week
Sponsored by:	Axcient
Differential Revision:	https://reviews.freebsd.org/D25671
2020-08-26 02:37:42 +00:00
Mateusz Guzik
a459a6cfe7 vfs: respect PRIV_VFS_LOOKUP in vaccess_smr
Reported by:	novel
2020-08-25 14:18:50 +00:00
Rick Macklem
22df1ffd81 Fix hangs with processes stuck sleeping on btalloc on i386.
r358097 introduced a problem for i386, where kernel builds will intermittently
get hung, typically with many processes sleeping on "btalloc".
I know nothing about VM, but received assistance from rlibby@ and markj@.

rlibby@ stated the following:
   It looks like the problem is that
   for systems that do not have UMA_MD_SMALL_ALLOC, we do
           uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
   but we haven't set an appropriate free function.  This is probably why
   UMA_ZONE_NOFREE was originally there.  When NOFREE was removed, it was
   appropriate for systems with uma_small_alloc.

   So by default we get page_free as our free function.  That calls
   kmem_free, which calls vmem_free ... but we do our allocs with
   vmem_xalloc.  I'm not positive, but I think the problem is that in
   effect we vmem_xalloc -> vmem_free, not vmem_xfree.

   Three possible fixes:
    1: The one you tested, but this is not best for systems with
       uma_small_alloc.
    2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
    3: Actually provide an appropriate vmem_bt_free function.

   I think we should just do option 2 with a comment, it's simple and it's
   what we used to do.  I'm not sure how much benefit we would see from
   option 3, but it's more work.

This patch implements #2. I haven't done a comment, since I don't know
what the problem is.

markj@ noted the following:
   I think the suggested patch is ok, but not for the reason stated.
   On platforms without a direct map the problem is:
   to allocate btags we need a slab,
   and to allocate a slab we need to map a page, and to map a page we need
   to allocate btags.

   We handle this recursion using a custom slab allocator which specifies
   M_USE_RESERVE, allowing it to dip into a reserve of free btags.
   Because the returned slab can be used to keep the reserve populated,
   this ensures that there are always enough free btags available to
   handle the recursion.

   UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone.
   However, when it was removed, an apparent bug in UMA was exposed:
   keg_drain() ignores the reservation set by uma_zone_reserve()
   in vmem_startup().
   So under memory pressure we reclaim the free btags that are needed to
   break the recursion.
   That's why adding _NOFREE back fixes the problem: it disables the
   reclamation.

   We could perhaps fix it more cleverly, by modifying keg_drain() to always
   leave uk_reserve slabs available.

markj@'s initial patch failed testing, so committing this patch was agreed
upon as the interim solution.
Either rlibby@ or markj@ might choose to add a comment to it.

PR:		248008
Reviewed by:	rlibby, markj
2020-08-25 00:58:14 +00:00
Alexander V. Chernikov
592d300e34 Remove RT_LOCK mutex from rte.
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
 the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
 the need for the latter reduced.
To be more precise, the following rte field are mutable:

rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.

rt_nhop and rt_weight (addressed in this review) are read under rib lock and
 stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
 is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.

rt_expire and rte_flags reads will be dealt in a separate reviews soon.

Differential Revision:	https://reviews.freebsd.org/D26162
2020-08-24 20:23:34 +00:00
Warner Losh
f87655ec76 Change the resume notification event from 'kern' to 'kernel'
We have both a system of 'kern' and of 'kernel'. Prefer the latter and
convert this notification to use 'kernel' instead of 'kern'. As a
transition period, continue to also generate the 'kern' notification
until sometime after FreeBSD 13 is branched.

MFC After: 3 days
2020-08-24 19:35:15 +00:00
Mateusz Guzik
f9cdb0775e cache: remove leftover assert in vn_fullpath_any_smr
It is only valid when !slash_prefixed. For slash_prefixed the length
is properly accounted for later.

Reported by:	markj (syzkaller)
2020-08-24 18:23:58 +00:00
Mateusz Guzik
e35406c8f7 cache: lockless reverse lookup
This enables fully scalable operation for getcwd and significantly improves
realpath.

For example:
PATH_CUSTOM=/usr/src ./getcwd_processes -t 104
before:  1550851
after: 380135380

Tested by:	pho
2020-08-24 09:00:57 +00:00
Mateusz Guzik
feabaaf995 cache: drop the always curthread argument from reverse lookup routines
Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by:	pho
2020-08-24 08:57:02 +00:00
Mateusz Guzik
f0696c5e4b cache: perform reverse lookup using v_cache_dd if possible
Tested by:	pho
2020-08-24 08:55:55 +00:00
Mateusz Guzik
ce575cd0e2 cache: populate v_cache_dd for non-VDIR entries
It makes v_cache_dd into a little bit of a misnomer and it may be addressed later.

Tested by:	pho
2020-08-24 08:55:04 +00:00
Mateusz Guzik
f0d9c77e52 vfs: validate ndp state after the lookup
The intent is to remove known-to-be-nops NDFREE calls after many lookups.
2020-08-23 21:06:41 +00:00
Mateusz Guzik
4b5001196a vfs: convert nameiop into an enum
While here change the field size from long to int and move it into the
gap next to cn_flags.

Shrinks struct componentname from 64 to 56 bytes on amd64.
2020-08-23 21:05:39 +00:00