Commit Graph

17410 Commits

Author SHA1 Message Date
Jeff Roberson
a4d50e49da Add more precise SMR entry asserts. 2020-02-13 20:50:21 +00:00
Kyle Evans
b30ab6d8fe sys/kern sysent: re-add dependency on capabilities.conf
r356868 inadvertently removed this, so changes to capabilities.conf were no
longer considered for being outdated.
2020-02-12 19:06:34 +00:00
Ed Maste
fe16bad415 regen sysent after r357831, r357838
Capability mode changes allowing fdatasync and getloginclass.

Sponsored by:	The FreeBSD Foundation
2020-02-12 19:05:10 +00:00
Ed Maste
e953765f15 Allow getloginclass in capability mode
As with e.g. getgroups and getlogin it allows querying current process
credential state.

Reported by:	sigsys@gmail.com via kevans
Sponsored by:	The FreeBSD Foundation
2020-02-12 18:59:00 +00:00
Ed Maste
9cdfb2d69a Allow fdatasync in capability mode
fdatasync is essentially a subset of fsync (and may be exactly fsync,
depending on filesystem and development effort) and operates only on
a provided fd.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-12 17:12:26 +00:00
Mateusz Guzik
4602214772 vfs: refactor vputx and add more comment
Reviewed by:	jeff (previous version)
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D23530
2020-02-12 11:19:07 +00:00
Mateusz Guzik
ed67a63c39 vfs: drop remaining zpcpu casts 2020-02-12 11:18:12 +00:00
Mateusz Guzik
123c519731 vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit
In particular on amd64 this eliminates an atomic op in the common case,
trading it for IPIs in the uncommon case of catching CPUs executing the
code while the filesystem is getting suspended or unmounted.
2020-02-12 11:17:45 +00:00
Mateusz Guzik
00ac9d2632 rms: use smp_rendezvous_cpus_retry instead of a hand-rolled variant 2020-02-12 11:17:18 +00:00
Mateusz Guzik
e4f584971b Add smp_rendezvous_cpus_retry
This is a wrapper around smp_rendezvous_cpus which enables use of IPI
handlers which can fail and require retrying.

wait_func argument is added to to provide a routine which can be used to
poll CPU of interest for when the IPI can be retried.

Handlers which succeed must call smp_rendezvous_cpus_done to denote that
fact.

Discussed with:	 jeff
Differential Revision:	https://reviews.freebsd.org/D23582
2020-02-12 11:16:55 +00:00
Mateusz Guzik
3acb6572fc Store offset into zpcpu allocations in the per-cpu area.
This shorten zpcpu_get and allows more optimizations.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23570
2020-02-12 11:11:22 +00:00
Mateusz Guzik
48baf00f54 epoch: convert zpcpu_get_cpua(.., curcpu) to zpcpu_get 2020-02-12 11:10:10 +00:00
Gleb Smirnoff
4426b2e64b Add flag to struct task to mark the task as requiring network epoch.
When processing a taskqueue and a task has associated epoch, then
enter for duration of the task.  If consecutive tasks belong to the
same epoch, batch them.  Now we are talking about the network epoch
only.

Shrink the ta_priority size to 8-bits.  No current consumers use
a priority that won't fit into 8 bits.  Also complexity of
taskqueue_enqueue() is a square of maximum value of priority, so
we unlikely ever want to go over UCHAR_MAX here.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D23518
2020-02-11 18:48:07 +00:00
Mateusz Guzik
57349a4f41 vfs: fix vhold race in mnt_vnode_next_lazy_relock
vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by
mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the
count has to be > 0.

Reported by:	pho
Tested by:	pho
2020-02-11 18:19:56 +00:00
Mateusz Guzik
1b853b62f3 capsicum: restore the cap_rights_contains symbol
It is expected to be provided by libc.

PR:		244033
Reported by:	 Jan Kokemueller
2020-02-11 18:13:53 +00:00
Mateusz Guzik
2e57c8fde7 vfs: fix device count leak on vrele racing with vgone
The race is:

CPU1                                CPU2
                                    devfs_reclaim_vchr
make v_usecount 0
                                      VI_LOCK
                                      sees v_usecount == 0, no updates
                                      vp->v_rdev = NULL;
                                      ...
                                      VI_UNLOCK
VI_LOCK
v_decr_devcount
  sees v_rdev == NULL, no updates

In this scenario si_devcount decrement is not performed.

Note this can only happen if the vnode lock is not held.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23529
2020-02-10 22:28:54 +00:00
Li-Wen Hsu
37d4ece7c5 Restore the behavior of allowing empty string in a string sysctl
Added as a special case to avoid unnecessary memory operations.

Reviewed by:	delphij
Sponsored by:	The FreeBSD Foundation
2020-02-10 20:53:59 +00:00
Hans Petter Selasky
f912e8f2ff Fix for unbalanced EPOCH(9) usage in the generic kernel interrupt
handler.

Interrupt handlers are removed via intr_event_execute_handlers() when
IH_DEAD is set. The thread removing the interrupt is woken up, and
calls intr_event_update(). When this happens, the ie_hflags are
cleared and re-built from all the remaining handlers sharing the
event. When the last IH_NET handler is removed, the IH_NET flag will
be cleared from ih_hflags (or ie_hflags may still be being rebuilt in
a different context), and the ithread_execute_handlers() may return
with ie_hflags missing IH_NET. This can lead to a scenario where
IH_NET was present before calling ithread_execute_handlers, and is not
present at its return, meaning the need for epoch must be cached
locally.

This can happen when loading and unloading network drivers. Also make
sure the ie_hflags is not cleared before being updated.

This is a regression issue after r357004.

Backtrace:
panic()
# trying to access epoch tracker on stack of dead thread
_epoch_enter_preempt()
ifunit_ref()
ifioctl()
fo_ioctl()
kern_ioctl()
sys_ioctl()
syscallenter()
amd64_syscall()

Differential Revision:	https://reviews.freebsd.org/D23483
Reviewed by:	glebius@, gallatin@, mav@, jeff@ and kib@
Sponsored by:	Mellanox Technologies
2020-02-10 20:23:08 +00:00
Mateusz Guzik
cd951a0d8e vfs: fix lock recursion in vrele
vrele is supposed to be called with an unlocked vnode, but this was never
asserted for if v_usecount was > 0. For such counts the lock is never touched
by the routine. As a result the kernel has several consumers which expect
vunref semantics and get away with calling vrele since they happen to never do
it when this is the last reference (and for some of them this may happen to be
a guarantee).

Work around the problem by changing vrele semantics to tolerate being called
with a lock. This eliminates a possible bug where the lock is already held and
vputx takes it anyway.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23528
2020-02-10 13:54:34 +00:00
Konstantin Belousov
48fcb46311 Add sysctl kern.proc.sigfastblk for reporting sigfastblock word address.
Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 12:29:51 +00:00
Konstantin Belousov
944cf37bb5 Add AT_BSDFLAGS auxv entry.
The intent is to provide bsd-specific flags relevant to interpreter
and C runtime.  I did not want to reuse AT_FLAGS which is common ELF
auxv entry.

Use bsdflags to report kernel support for sigfastblock(2).  This
allows rtld and libthr to safely infer the syscall presence without
SIGSYS.  The tunable kern.elf{32,64}.sigfastblock blocks reporting.

Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 12:10:37 +00:00
Konstantin Belousov
f88c67a625 Regen. 2020-02-09 11:53:37 +00:00
Konstantin Belousov
146fc63fce Add a way to manage thread signal mask using shared word, instead of syscall.
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery.  Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose.  It is intended to be used only by our C runtime
implementation internals.

Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 11:53:12 +00:00
Mateusz Guzik
2f7f11b7de vfs: tidy up vget_finish and vn_lock
- remove assertion which duplicates vn_lock
- use VNPASS instead of retyping the failure
- report what flags were passed if panicking on them
2020-02-08 15:52:20 +00:00
Mateusz Guzik
3eb6b656c2 vfs: remove now useless ENODEV handling from vn_fullpath consumers
Noted by:	ngie
2020-02-08 15:51:08 +00:00
Konstantin Belousov
300b525d29 Correct the function name in the comment.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2020-02-08 15:06:06 +00:00
Mateusz Guzik
ea77ce6ef9 rms: use newly added zpcpu routines instead of direct access where appropriate 2020-02-07 22:44:41 +00:00
Jeff Roberson
a40068e524 Fix a race in smr_advance() that could result in unnecessary poll calls.
This was relatively harmless but surprising to see in counters.  The
race occurred when rd_seq was read after the goal was updated and we
incorrectly calculated the delta between them.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D23464
2020-02-06 20:51:46 +00:00
Jeff Roberson
8d7f16a5db Add some global counters for SMR. These may eventually become per-smr
counters.  In my stress test there is only one poll for every 15,000
frees.  This means we are effectively amortizing the cache coherency
overhead even with very high write rates (3M/s/core).

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23463
2020-02-06 20:10:21 +00:00
Pawel Biernacki
210176ad76 sysctl(9): add CTLFLAG_NEEDGIANT flag
Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to
mark sysctls that still require locking Giant.

Rewrite sysctl_handle_string() to use internal locking instead of locking
Giant.

Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE.

Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT
and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet.

Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23378
2020-02-06 12:45:58 +00:00
Mark Johnston
d3631aa582 Avoid releasing object PIP in vn_sendfile() if no pages were grabbed.
sendfile(2) optionally takes a set of headers that get prepended to the
file data.  If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference.  This was
introduced in r356902.

Reported by:	syzkaller
Sponsored by:	The FreeBSD Foundation
2020-02-05 16:09:21 +00:00
Leandro Lupori
eb5a41cf2f Add SYSCTL to get KERNBASE and relocated KERNBASE
This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE
values. This provides an easy, architecture independent way to calculate the
running kernel displacement (current/load address minus original base address).

The initial goal for this change is to add a new libkvm function that returns
the kernel displacement, both for live kernels and crashdumps. This would in
turn be used by kgdb to find out how to relocate kernel symbols (if needed).

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D23284
2020-02-05 11:34:10 +00:00
Mateusz Guzik
1a9fe4528b fd: always nullify *fdp in fget* routines
Some consumers depend on the pointer being NULL if an error is returned.

The guarantee got broken in r357469.

Reported by:	https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef
Noted by:	markj
2020-02-05 00:20:26 +00:00
Ryan Libby
10c8fb47d9 uma: convert mbuf_jumbo_alloc to UMA_ZONE_CONTIG & tag others
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc).  Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23238
2020-02-04 22:40:23 +00:00
Konstantin Belousov
0783b70974 Remove unneeded assert for curproc. Simplify.
Reported by:	syzkaller by markj
Sponsored by:	The FreeBSD Foundation
2020-02-04 21:02:08 +00:00
Mark Johnston
60185d649b Correct the malloc tag used when freeing the temporary semop(2) buffer.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-04 20:00:45 +00:00
Dmitry Chagin
cbc1089190 For code reuse in Linuxulator rename get_proccess_cputime()
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.

As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.

Fix whitespace nit while here.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D23340
MFC after		2 weeks
2020-02-04 05:25:51 +00:00
Jeff Roberson
bc6509845d Implement a deferred write advancement feature that can be used to further
amortize shared cacheline writes.

Discussed with: rlibby
Differential Revision:	https://reviews.freebsd.org/D23462
2020-02-04 02:44:52 +00:00
Jeff Roberson
c8ea36e881 Fix a recursion on the thread lock by acquiring it after call rtp_to_pri().
Reported by:	swills
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23495
2020-02-04 02:42:54 +00:00
Mark Johnston
e489450589 Fix the !SMP case in sched_add() after r355779.
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23492
2020-02-03 22:49:05 +00:00
Mateusz Guzik
8151b6e92a fd: partially unengrish the previous commit 2020-02-03 22:34:50 +00:00
Mateusz Guzik
e10f063b30 fd: streamline fget_unlocked
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.

In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).

Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.

While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
2020-02-03 22:32:49 +00:00
Mateusz Guzik
52604ed792 fd: remove the seq argument from fget_unlocked
It is almost always NULL.
2020-02-03 22:27:55 +00:00
Mateusz Guzik
7f1566f884 fd: remove the seq argument from fget routines
It is almost always NULL.
2020-02-03 22:27:03 +00:00
Mateusz Guzik
0a1427c5ab ktrace: provide ktrstat_error
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
2020-02-03 22:26:00 +00:00
Gleb Smirnoff
0017b2adac Couple protocol drain routines (frag6_drain and sctp_drain) may send
packets.  An unexpected behaviour for memory reclamation routine.
Anyway, we need enter the network epoch for doing that.
2020-02-03 20:48:57 +00:00
Kyle Evans
3d62f685d5 namei: preserve errors from fget_cap_locked
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.

PR:		243839
Reviewed by:	kib (committed form recommended by)
Tested by:	lwhsu
Differential Revision:	https://reviews.freebsd.org/D23479
2020-02-03 18:59:07 +00:00
Warner Losh
58aa35d429 Remove sparc64 kernel support
Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs
2020-02-03 17:35:11 +00:00
Mateusz Guzik
bcd1cf4f03 capsicum: faster cap_rights_contains
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.

While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.

Verified with the capsicum suite /usr/tests.
2020-02-03 17:08:11 +00:00
Mateusz Guzik
fee204544e fd: fix f_count acquire in fget_unlocked
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.

This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.

The bug is only present in -CURRENT.
2020-02-03 14:28:31 +00:00
Mateusz Guzik
f1fa1ba3d0 Fix up various vnode-related asserts which did not dump the used vnode 2020-02-03 14:25:32 +00:00
Kyle Evans
6a5abb1ee5 Provide O_SEARCH
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23247
2020-02-02 16:34:57 +00:00
Mateusz Guzik
2568d5bb79 fd: sprinkle some predits around fget
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
2020-02-02 09:38:40 +00:00
Mateusz Guzik
da4f45ea5c fd: use atomic_load_ptr instead of hand-rolled cast through volatile
No change in assembly.
2020-02-02 09:37:16 +00:00
Mateusz Guzik
6698e11f4b vfs: remove the now empty vop_unlock_post 2020-02-02 09:36:32 +00:00
Mateusz Guzik
7739d92766 cache: replace kern___getcwd with vn_getcwd
The previous routine was resulting in extra data copies most notably in
linux_getcwd.
2020-02-01 20:38:38 +00:00
Mateusz Guzik
921e7210f8 cache: return the total length from vn_fullpath1
This removes strlen from getcwd.
2020-02-01 20:37:11 +00:00
Mateusz Guzik
4511dd9d41 cache: remove vnode -> path lookup disablement
It seems to be of little to no use even when debugging.

Interested parties can resurrect it and gate compilation with a macro.
2020-02-01 20:36:35 +00:00
Mateusz Guzik
45757984f8 vfs: consistently use size_t for buflen around VOP_VPTOCNP 2020-02-01 20:34:43 +00:00
Mateusz Guzik
643656cfaf vfs: replace VOP_MARKATIME with VOP_MMAPPED
The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23422
2020-02-01 06:46:55 +00:00
Mateusz Guzik
90f4ec3328 vfs: save on atomics on the root vnode for absolute lookups
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23427
2020-02-01 06:40:35 +00:00
Mateusz Guzik
21c4f1041e vfs: add vrefactn
Differential Revision:	https://reviews.freebsd.org/D23427
2020-02-01 06:39:49 +00:00
Jeff Roberson
915c367e8e Add two missing fences with comments describing them. These were found by
inspection and after a lengthy discussion with jhb and kib.  They have not
produced test failures.

Don't pointer chase through cpu0's smr.  Use cpu correct smr even when not
in a critical section to reduce the likelihood of false sharing.
2020-01-31 22:21:15 +00:00
Mark Johnston
1c29da0279 Reimplement stack capture of running threads on i386 and amd64.
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.

Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.

Simplify the KPIs.  Remove stack_save_td_running() and add a return
value to stack_save_td().  On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP.  If the
target thread is running in user mode, stack_save_td() returns EBUSY.

Reviewed by:	kib
Reported by:	mjg, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23355
2020-01-31 15:43:33 +00:00
Mateusz Guzik
0f4d8b77c0 vfs: revert the overzealous assert added in r357285 to vgone
The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an
equivalent).

One immediate case which is missed is vgone from called by inactive itself.

A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.

Reported by:	syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
2020-01-31 11:31:14 +00:00
Mateusz Guzik
1a78ac2416 Add rms_try_rlock and rms_wowned. 2020-01-31 08:36:49 +00:00
Mateusz Guzik
cedad2916e Remove an overzealous assert from rms_runlock. 2020-01-31 08:36:23 +00:00
Jeff Roberson
da6e9935e4 Don't use "All rights reserved" in new copyrights.
Requested by:	rgrimes
2020-01-31 02:08:09 +00:00
Jeff Roberson
d4665eaa66 Implement a safe memory reclamation feature that is tightly coupled with UMA.
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm.  This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost.  The memory overhead
is significantly lessened by limiting the free-to-use latency.  A synthetic
test uses 1/20th of the memory vs Epoch.  There is significant further
discussion in the comments and code review.

This code should be considered experimental.  I will write a man page after
it has settled.  After further validation the VM will begin using this
feature to permit lockless page lookups.

Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures.  I will commit a stress testing
tool in a follow-up.

Reviewed by:	mmacy, markj, rlibby, hselasky
Discussed with:	sbahara
Differential Revision:	https://reviews.freebsd.org/D22586
2020-01-31 00:49:51 +00:00
Mateusz Guzik
3ff65f71cb Remove duplicated empty lines from kern/*.c
No functional changes.
2020-01-30 20:05:05 +00:00
Mateusz Guzik
2823710f05 Tidy up 2 comments in smp_rendezvous_cpus. 2020-01-30 20:02:14 +00:00
Mateusz Guzik
7ab99925fd Assert that smp_rendezvous_cpus is called with interrupts enabled. 2020-01-30 19:38:51 +00:00
Mateusz Guzik
d53d924f60 vfs: keep the mount point referenced across sys_quotactl
Otherwise we risk running into use-after-free.

In particular this codepath ends up dropping all protection before
suspending writes:

ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt

Reported by:	pho
2020-01-30 19:38:12 +00:00
John Baldwin
fbb9879c0c Fix use of an uninitialized variable.
ctx (and thus ctx.flags) is stack garbage at the start of this
function, so initialize ctx.flags to an explicit value instead of
using binary operations on the garbage.

Reported by:	gcc9
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D23368
2020-01-30 18:28:02 +00:00
Mateusz Guzik
c2ef6aa3d5 vfs: assert that doomed vnodes don't need to call vm_object_page_clean
... after the optional inactive processing.
2020-01-30 04:59:08 +00:00
Mateusz Guzik
07c6e2f4ab vfs: unlazy before dooming the vnode
With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.

Reviewed by:	kib (early version)
Differential Revision:	https://reviews.freebsd.org/D23397
2020-01-30 02:12:52 +00:00
Gleb Smirnoff
79674264df Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This
is a regression from r356642, r356645.
2020-01-30 00:18:00 +00:00
Conrad Meyer
07a65f9d38 hwpstate_intel(4): Silence/fix Coverity reports
These were all introduced in the initial import of hwpstate_intel(4).

Reported by:	Coverity
CIDs:		1413161, 1413164, 1413165, 1413167
X-MFC-With:	r357002
2020-01-29 03:15:34 +00:00
Warner Losh
42ec4f05a3 Make mqueue objects work across a fork again.
In r110908 (2003) alfred added DFLAG_PASSABLE to tag those types of FD
that can be passed via unix pipes, but mqueuefs didn't exist
yet. Later, in r152825 (2005) davidxu neglected to include
DFLAG_PASSABLE since people don't normally pass these things via unix
sockets (it's a FreeBSD implementation detail that it's a file
descriptor, nobody noticed). Then r223866 (2011) by jonathan used the
new flag in fdcopy, which fork uses. Due to that, mqueuefs actually
broke mqueue objects being propagated by fork. No mention of mqueuefs
was made in r223866, so I think it was an unintended consequence.

Fix this by tagging mqueuefs as passable as well. They were prior to
alfred's change (and it's clear there's no intent in his change to
change this behavior), and POSIX requires this to be the case as well.

PR: 243103
Reviewed by: kib@, jiles@
Differential Revision: https://reviews.freebsd.org/D23038
2020-01-27 22:36:54 +00:00
John Baldwin
425e5f9dcf Revert accidental change from r357146. 2020-01-26 14:23:27 +00:00
John Baldwin
c73222d0e6 Fix some misleading indentation warnings reported by recent clang.
These should not be any functional change.  While the change in
emul10kx-pcm.c looks like a real bug fix (as opposed to inconsistent
whitespace), the extra statements were not harmful.

Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D23363
2020-01-26 14:20:57 +00:00
Mateusz Guzik
1513f80391 vfs: do an unlocked check before iterating the lazy list
For most filesystems it is expected to be empty most of the time.
2020-01-26 07:06:18 +00:00
Mateusz Guzik
cd0e46c66b vfs: remove vop loop from vop_sigdefer
All ops are guaranteed to be present since r357131.
2020-01-26 07:05:06 +00:00
Mateusz Guzik
6d69e665dd vfs: fix freevnodes count update race against preemption
vdbatch_process leaves the critical section too early, openign a time
window where another thread can get scheduled and modify vd->freevnodes.
Once it the preempted thread gets back it overrides the value with 0.

Just move critical_exit to the end of the function.
2020-01-26 00:40:27 +00:00
Mateusz Guzik
dc9a1cb60b vfs: predict vn_lock failure as unlikely in vget 2020-01-26 00:34:57 +00:00
Jason A. Harmening
a9aa06f7b1 Implement cycle-detecting garbage collector for AF_UNIX sockets
The existing AF_UNIX socket garbage collector destroys any socket
which may potentially be in a cycle, as indicated by its file reference
count being equal to its enqueue count. However, this can produce false
positives for in-flight sockets which aren't part of a cycle but are
part of one or more SCM_RIGHTS mssages and which have been closed
on the sending side. If the garbage collector happens to run at
exactly the wrong time, destruction of these sockets will render them
unusable on the receiving side, such that no previously-written data
may be read.

This change rewrites the garbage collector to precisely detect cycles:

1. The existing check of msgcount==f_count is still used to determine
   whether the socket is potentially in a cycle.
2. The socket is now placed on a local "dead list", which is used to
   reduce iteration time (and therefore contention on the global
   unp_link_rwlock).
3. The first pass through the dead list removes each potentially-dead
   socket's outgoing references from the graph of potentially-dead
   sockets, using a gc-specific copy of the original reference count.
4. The second series of passes through the dead list removes from the
   list any socket whose remaining gc refcount is non-zero, as this
   indicates the socket is actually accessible outside of any possible
   cycle.  Iteration is repeated until no further sockets are removed
   from the dead list.
5. Sockets remaining in the dead list are destroyed as before.

PR:		227285
Submitted by:	jan.kokemueller@gmail.com (prior version)
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23142
2020-01-25 08:57:26 +00:00
Mark Johnston
a89c2c8c34 Revert r357050.
It seems to have introduced a couple of regressions.

Reported by:	cy, pho
2020-01-24 14:58:02 +00:00
Edward Tomasz Napierala
b3fb13eb55 Add kern_unmount() and use in Linuxulator. No functional changes.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22646
2020-01-24 11:57:55 +00:00
Mateusz Guzik
28eb39a5ab vfs: allow v_usecount to transition 0->1 without the interlock
There is nothing to do but to bump the count even during said transition.
There are 2 places which can do it:
- vget only does this after locking the vnode, meaning there is no change in
  contract versus inactive or reclamantion
- vref only ever did it with the interlock held which did not protect against
  either (that is, it would always succeed)

VCHR vnodes retain special casing due to the need to maintain dev use count.

Reviewed by:	jeff, kib
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D23185
2020-01-24 07:47:44 +00:00
Mateusz Guzik
d93762b94d vfs: stop handling VI_OWEINACT in vget
vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by:	jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23184
2020-01-24 07:45:59 +00:00
Mateusz Guzik
74c4b7cc60 vfs: stop unlocking the vnode upfront in vput
Doing so runs into races with filesystems which make half-constructed vnodes
visible to other users, while depending on the chain vput -> vinactive ->
vrecycle to be executed without dropping the vnode lock.

Impediments for making this work got cleared up (notably vop_unlock_post now
does not do anything and lockmgr stops touching the lock after the final
write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can
now be eliminated.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23344
2020-01-24 07:44:25 +00:00
Mateusz Guzik
c00115f108 lockmgr: don't touch the lock past unlock
This evens it up with other locking primitives.

Note lock profiling still touches the lock, which again is in line with the
rest.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23343
2020-01-24 07:42:57 +00:00
Mark Johnston
1bfca40c57 Set td_oncpu before dropping the thread lock during a switch.
After r355784 we no longer hold a thread's thread lock when switching it
out.  Preserve the previous synchronization protocol for td_oncpu by
setting it together with td_state, before dropping the thread lock
during a switch.

Reported and tested by:	pho
Reviewed by:	kib
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D23270
2020-01-23 16:24:51 +00:00
Jeff Roberson
91e31c3c08 Consistently use busy and vm_page_valid() rather than touching page bits
directly.  This improves API compliance, asserts, etc.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23283
2020-01-23 04:54:49 +00:00
Jeff Roberson
1eb13fce84 Block the thread lock in sched_throw() and use cpu_switch() to unblock
it.  The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.

Reported/Tested by:	rlibby
Discussed with:	rlibby, jhb
2020-01-23 03:36:50 +00:00
Gleb Smirnoff
ad3980121b DEVICE_POLLING is an alternative to network interrupts and also
needs to enter epoch.  Assert that in the netisr_poll() and do
the work for the idle poll routine.
2020-01-23 01:30:50 +00:00
Gleb Smirnoff
511d1afb6b Enter the network epoch for interrupt handlers of INTR_TYPE_NET.
Provide tunable to limit how many times handlers may be executed
without reentering epoch.

Differential Revision:	https://reviews.freebsd.org/D23242
2020-01-23 01:24:47 +00:00
Gleb Smirnoff
c4eb66309f Add ie_hflags to struct intr_event, which accumulates flags from all
handlers on this event.  For now handle only IH_ENTROPY in that manner.
2020-01-23 01:20:59 +00:00
Conrad Meyer
4577cf3744 cpufreq(4): Add support for Intel Speed Shift
Intel Speed Shift is Intel's technology to control frequency in hardware,
with hints from software.

Let's get a working version of this in the tree and we can refine it from
here.

Submitted by:	bwidawsk, scottph
Reviewed by:	bcr (manpages), myself
Discussed with:	jhb, kib (earlier versions)
With feedback from:	Greg V, gallatin, freebsdnewbie AT freenet.de
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D18028
2020-01-22 23:28:42 +00:00
Hans Petter Selasky
1f69a50940 Make sure the VNET is properly set when calling tcp_drop() from
the ktls taskqueue callback function.

A valid VNET is needed when updating statistics.

panic()
tcp_state_change()
tcp_drop()
ktls_reset_send_tag()
taskqueue_run_locked()
taskqueue_thread_loop()

Sponsored by:	Mellanox Technologies
2020-01-21 11:43:25 +00:00
Mateusz Guzik
6403455301 cache: revert r352613 now that vhold does not take locks 2020-01-20 19:52:23 +00:00
Mateusz Guzik
8bba93c7e0 cache: make numcachehv use counter(9) on all archs
Requested by:	kib
2020-01-20 14:42:11 +00:00
Jeff Roberson
d6e13f3b4d Don't hold the object lock while calling getpages.
The vnode pager does not want the object lock held.  Moving this out allows
further object lock scope reduction in callers.  While here add some missing
paging in progress calls and an assert.  The object handle is now protected
explicitly with pip.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23033
2020-01-19 23:47:32 +00:00
Mateusz Guzik
a9099e5b10 vfs: switch vop_stdunlock to call lockmgr_unlock
Since the flags argument is now alawys 0 the new call provides the same
behavior.
2020-01-19 21:41:34 +00:00
Jeff Roberson
811d05fcb7 Provide an API for interlocked refcount sleeps.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22908
2020-01-19 18:18:17 +00:00
Mateusz Guzik
28479aaae2 vfs: allow v_holdcnt to transition 0->1 without the interlock
Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23186
2020-01-19 17:47:04 +00:00
Mateusz Guzik
059cb4843b cache: counter_u64_add_protected -> counter_u64_add
Fixes booting on RISC-V where it does happen to not be equivalent.

Reported by:	lwhsu
2020-01-19 17:05:26 +00:00
Mateusz Guzik
1399033590 cache: convert numcachehv to counter(9) on 64-bit platforms 2020-01-19 05:37:27 +00:00
Mateusz Guzik
512fa9a4e0 vfs: plug a conditional assigment of lo_name in getnewvnode
It only matters for witness. No functional changes.
2020-01-19 05:36:45 +00:00
Kyle Evans
05d7dd739c sysent targets: further cleanup and deduplication
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.

Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.

Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.

Reviewed by:	brooks, imp
Nice!:		emaste
Differential Revision:	https://reviews.freebsd.org/D23197
2020-01-18 20:37:45 +00:00
Mateusz Guzik
2d0c620272 vfs: distribute freevnodes counter per-cpu
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.

This soothes a global serialisation point for all 0<->1 hold count transitions.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23235
2020-01-18 01:29:02 +00:00
Mateusz Guzik
d3cc535474 vfs: provide F_ISUNIONSTACK as a kludge for libc
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D23162
2020-01-17 14:42:25 +00:00
Mateusz Guzik
1ad72b270c vfs: shorten lock hold time in vdbatch_process 2020-01-17 14:39:00 +00:00
Gleb Smirnoff
66c6c556b6 Change argument order of epoch_call() to more natural, first function,
then its argument.

Reviewed by:	imp, cem, jhb
2020-01-17 06:10:24 +00:00
Mateusz Guzik
66f67d5e5e vfs: increment numvnodes without the vnode list lock unless under pressure
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.

The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.

Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23140
2020-01-16 21:45:21 +00:00
Mateusz Guzik
b7f50b9ad1 vfs: refcator vnode allocation
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D23158
2020-01-16 21:43:13 +00:00
Mateusz Guzik
875cfc082d vfs: reimplement vlrureclaim to actually use LRU
Take advantage of global ordering introduced in r356672.

Reviewed by:	mckusick (previous version)
Differential Revision:	https://reviews.freebsd.org/D23067
2020-01-16 10:44:02 +00:00
Jeff Roberson
a81c400e75 Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed.  After the KVA allocator has started up we allocate the KVA that
we consumed during boot.  This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by:	glebius, markj
Differential Revision:	https://reviews.freebsd.org/D23102
2020-01-16 05:01:21 +00:00
Kirk McKusick
bbb1e07d65 Peter Holm reports that his test that does an umount(8) on an active
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.

The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.

Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.

The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().

Reported by:  Peter Holm
Analysis by:  Chuck Silvers
Tested by:    Peter Holm
MFC after:    7 days
Sponsored by: Netflix
2020-01-15 18:53:32 +00:00
Gleb Smirnoff
9074694339 Since this code uses if_ref()/if_rele() it must include if_var.h
explicitly, not via header pollution.
2020-01-15 03:39:11 +00:00
Gleb Smirnoff
3264dcadc9 - Move global network epoch definition to epoch.h, as more different
subsystems tend to need to know about it, and including if_var.h is
  huge header pollution for them.  Polluting possible non-network
  users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch.  Not used yet, and unlikely
  to get used in close future.
2020-01-15 03:34:21 +00:00
Mateusz Guzik
cda3176851 vfs: in vop_stdadd_writecount only vlazy vnodes on mounts using msync
The only reason to vlazy there is to (overzealously) ensure all vnodes
which need to be visited by msync scan can be found there.

In particluar this is of no use zfs and tmpfs.

While here depessimize the check.
2020-01-15 01:34:05 +00:00
Ryan Libby
51871224c0 malloc: remove assumptions about MINALLOCSIZE
Remove assumptions about the minimum MINALLOCSIZE, in order to allow
testing of smaller MINALLOCSIZE.  A following patch will lower the
MINALLOCSIZE, but not so much that the present patch is required for
correctness at these sites.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
2020-01-14 02:14:02 +00:00
Konstantin Belousov
fedab1b499 Code must not unlock a mutex while owning the thread lock.
Reviewed by:	hselasky, markj
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23150
2020-01-13 14:30:19 +00:00
Mateusz Guzik
0c236d3d52 vfs: per-cpu batched requeuing of free vnodes
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.

Per-cpu areas are locked in order to synchronize against UMA freeing
memory.

vnode's v_mflag is converted to short to prevent the struct from
growing.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22998
2020-01-13 02:39:41 +00:00
Mateusz Guzik
cc3593fbd9 vfs: rework vnode list management
The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22997
2020-01-13 02:37:25 +00:00
Mateusz Guzik
57083d2576 vfs: add per-mount vnode lazy list and use it for deferred inactive + msync
This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22995
2020-01-13 02:34:02 +00:00
Conrad Meyer
365cd52245 Fix a typo in r356667 comment
No functional change.

Reported by:	bdragon
Approved by:	csprng(markm), earlier version
X-MFC-With:	r356667
2020-01-12 23:52:16 +00:00
Conrad Meyer
86def3dcd6 getrandom(2): Add Linux GRND_INSECURE API flag
Treat it as a synonym for GRND_NONBLOCK.  The reasoning is this:

We have two choices for handling Linux's GRND_INSECURE API flag.

1. We could ignore it completely (like GRND_RANDOM).  However, this might
produce the surprising result of GRND_INSECURE requests blocking, when the
Linux API does not block.

2. Alternatively, we could treat GRND_INSECURE requests as requests for
GRND_NONBLOCk.  Here, the surprising result for Linux programs is that
invocations with unseeded random(4) will produce EAGAIN, rather than
garbage.

Honoring the flag in the way Linux does seems fraught.  If we actually use
the output of a random(4) implementation prior to seeding, we leak some
entropy (in an information theory and also practical sense) from what will
be the initial seed to attackers (or allow attackers to arbitrary DoS
initial seeding, if we don't leak).  This seems unacceptable -- it defeats
the purpose of blocking on initial seeding.

Secondary to that concern, before seeding we may have arbitrarily little
entropy collected; producing output from zero or a handful of entropy bits
does not seem particularly useful to userspace.

If userspace can accept garbage, insecure, non-random bytes, they can create
their own insecure garbage with srandom(time(NULL)) or similar.  Any program
which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG
bytes.  So asking the kernel to produce such an output from the secure
getrandom(2) API seems inane.

For now, we've elected to emulate GRND_INSECURE as an alternative spelling
of GRND_NONBLOCK (2).  Consider this API not-quite stable for now.  We
guarantee it will never block.  But we will attempt to monitor actual port
uptake of this bizarre API and may revise our plans for the unseeded
behavior (prior stable/13 branching).

Approved by:	csprng(markm), manpages(bcr)
See also:	https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/
See also:	https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/
Differential Revision:	https://reviews.freebsd.org/D23130
2020-01-12 20:47:38 +00:00
Edward Tomasz Napierala
ca603bb1ee dd kern_getpriority(), make Linuxulator use it.
Reviewed by:	kib, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22842
2020-01-12 14:25:44 +00:00
Edward Tomasz Napierala
7a0ef283e6 Add kern_setpriority(), use it in Linuxulator.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22841
2020-01-12 13:38:51 +00:00
Mateusz Guzik
d199ad3b44 Add "panicked" boolean which can be tested instead of panicstr
The test is performed all the time and reading entire panicstr to do it
wastes space.
2020-01-12 06:09:10 +00:00
Mateusz Guzik
879e0604ee Add KERNEL_PANICKED macro for use in place of direct panicstr tests 2020-01-12 06:07:54 +00:00
Mateusz Guzik
91de98e6d4 vfs: only recalculate watermarks when limits are changing
Previously they would get recalculated all the time, in particular in:
getnewvnode -> vcheckspace -> vspace
2020-01-11 23:00:57 +00:00
Mateusz Guzik
e6ae744e0e vfs: deduplicate vnode allocation logic
This creates a dedicated routine (vn_alloc) to allocate vnodes.

As a side effect code duplicationw with getnewvnode_reserve is eleminated.

Add vn_free for symmetry.
2020-01-11 22:59:44 +00:00
Mateusz Guzik
b52d50cf69 vfs: prealloc vnodes in getnewvnode_reserve
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
2020-01-11 22:58:14 +00:00
Mateusz Guzik
6928306764 vfs: incomplete pass at converting more ints to u_long
Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.
2020-01-11 22:56:20 +00:00
Mateusz Guzik
bf62296f35 vfs: add missing CLTFLA_MPSAFE annotations
This covers all kern/vfs_*.c files.
2020-01-11 22:55:12 +00:00
Kyle Evans
1171c633fb Set .ORDER for makesyscalls generated files
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.

Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.

Reviewed by:	brooks
Discussed with:	sjg
Differential Revision:	https://reviews.freebsd.org/D23099
2020-01-10 18:24:17 +00:00
Mark Johnston
dc727127f1 Change malloc_domain() to return the allocation size to the caller.
Otherwise the malloc type accounting in malloc_domainset(9) is wrong
after r355203.

Reviewed by:	rlibby
Reported by:	kaktus
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23095
2020-01-09 15:02:48 +00:00
Kyle Evans
6a38cd3a54 kern/Makefile: systrace_args.c is also generated 2020-01-09 06:10:25 +00:00
Kyle Evans
39eae263cd shmfd: posix_fallocate(2): only take rangelock for section we need
Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX
for safety, so we still get proper synchronization of shmfd->shm_size in
effect. There's no need to block readers/writers of earlier segments when
we're just reserving more space, so narrow the scope -- it would likely be
safe to narrow it completely to just the section of the range that extends
beyond our current size, but this likely isn't worth it since the size isn't
stable until the writelock is granted the first time.

Suggested by:	cem (passing comment)
2020-01-09 04:03:17 +00:00
Kyle Evans
f10405323a posixshm: implement posix_fallocate(2)
Linux expects to be able to use posix_fallocate(2) on a memfd. Other places
would use this with shm_open(2) to act as a smarter ftruncate(2).

Test has been added to go along with this.

Reviewed by:	kib (earlier version)
Differential Revision:	https://reviews.freebsd.org/D23042
2020-01-08 19:08:44 +00:00
Kyle Evans
2856d85ecb posix_fallocate: push vnop implementation into the fileop layer
This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.

Reviewed by:	kib, bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D23042
2020-01-08 19:05:32 +00:00
Mateusz Guzik
a9a047bc87 vfs: handle doomed vnodes in vdefer_inactive
vgone dooms the vnode while keeping VI_OWEINACT set and then drops the
interlock.

vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.

The race is harmless, just don't defer anything as vgone will take care of it.

Reported by:	pho
2020-01-07 20:24:21 +00:00
Mateusz Guzik
c8b3463dd0 vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT)
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.

Reviewed by:	kib (previous version)
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D23036
2020-01-07 15:56:24 +00:00
Mateusz Guzik
b7cc9d1847 vfs: trylock in vfs_msync and refactor the func
- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes

Reviewed by:	kib
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D23035
2020-01-07 15:44:19 +00:00
Mateusz Guzik
c92fe112a7 vfs: use a dedicated counter for free vnode recycling
Otherwise vlrureclaim activitity is mixed in and it is hard to tell which
vnodes got reclaimed.
2020-01-07 15:42:01 +00:00
Mateusz Guzik
cc2b586d69 vfs: prevent numvnodes and freevnodes re-reads when appropriate
Otherwise in code like this:
if (numvnodes > desiredvnodes)
	vnlru_free_locked(numvnodes - desiredvnodes, NULL);

numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
value.
2020-01-07 04:34:03 +00:00
Mateusz Guzik
37fe521a6f vfs: annotate numvnodes and vnode_free_list_mtx with __exclusive_cache_line 2020-01-07 04:30:49 +00:00
Mateusz Guzik
478368ca41 vfs: eliminate v_tag from struct vnode
There was only one consumer and it was using it incorrectly.

It is given an equivalent hack.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23037
2020-01-07 04:29:34 +00:00
Mateusz Guzik
a91190c63e vfs: add a helper for allocating marker vnodes 2020-01-07 04:27:40 +00:00
Pawel Biernacki
91c4b68fa3 kern_sysctl: make sysctl.debug work as intended
r136999 introduced SYSTCL_DEBUG but apparently "opt_sysctl.h" was never
included making the option ignored.

r322954 introduced sysctl.reuse_test with OID number equal to 0, effectively
shadowing the very special sysctl.debug one. Use OID_AUTO as it doesn't need
any special treatment.

Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23056
2020-01-06 19:47:59 +00:00
Mateusz Guzik
2e77cad11d locks: add default delay struct
Use it for all primitives. This makes everything fit in 8 bytes.
2020-01-05 12:48:19 +00:00
Mateusz Guzik
6b8dd26e7c locks: convert delay times to u_short
int is just a waste of space for this purpose.
2020-01-05 12:47:29 +00:00
Mateusz Guzik
d6ae918835 Mark mtxpool_sleep as read mostly, not frequently.
The latter is not justified.
2020-01-05 12:46:35 +00:00
Kyle Evans
535b1df993 shm: correct KPI mistake introduced around memfd_create
When file sealing and shm_open2 were introduced, we should have grown a new
kern_shm_open2 helper that did the brunt of the work with the new interface
while kern_shm_open remains the same. Instead, more complexity was
introduced to kern_shm_open to handle the additional features and consumers
had to keep changing in somewhat awkward ways, and a kern_shm_open2 was
added to wrap kern_shm_open.

Backpedal on this and correct the situation- kern_shm_open returns to the
interface it had prior to file sealing being introduced, and neither
function needs an initial_seals argument anymore as it's handled in
kern_shm_open2 based on the shmflags.
2020-01-05 04:06:40 +00:00
Kyle Evans
58366f05c0 shmfd/mmap: restrict maxprot with MAP_SHARED + F_SEAL_WRITE
If a write seal is set on a shared mapping, we must exclude VM_PROT_WRITE as
the fd is effectively read-only. This was discovered by running
devel/linux-ltp, which mmap's with acceptable protections specified then
attempts to raise to PROT_READ|PROT_WRITE with mprotect(2), which we
allowed.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22978
2020-01-05 03:15:16 +00:00
Mateusz Guzik
7e2ea5772b vfs: factor out avoidable branches in _vn_lock 2020-01-05 01:00:11 +00:00
Mateusz Guzik
8dbc63520c vfs: drop thread argument from vinactive 2020-01-05 00:59:47 +00:00
Mateusz Guzik
867fd730c6 vfs: patch up vnode count assertions to report found value 2020-01-05 00:59:16 +00:00
Jeff Roberson
727c691857 Use a separate lock for the zone and keg. This provides concurrency
between populating buckets from the slab layer and fetching full buckets
from the zone layer.  Eliminate some nonsense locking patterns where
we lock to fetch a single variable.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22828
2020-01-04 03:15:34 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Kyle Evans
3a22f09cbf emulated atomic64: disable interrupts as the lock mechanism on !SMP
Reviewed by:	jhibbits, bdragon
Differential Revision:	https://reviews.freebsd.org/D23015
2020-01-03 18:29:20 +00:00
Brandon Bergren
9aafc7c052 [PowerPC] [MIPS] Implement 32-bit kernel emulation of atomic64 operations
This is a lock-based emulation of 64-bit atomics for kernel use, split off
from an earlier patch by jhibbits.

This is needed to unblock future improvements that reduce the need for
locking on 64-bit platforms by using atomic updates.

The implementation allows for future integration with userland atomic64,
but as that implies going through sysarch for every use, the current
status quo of userland doing its own locking may be for the best.

Submitted by:	jhibbits (original patch), kevans (mips bits)
Reviewed by:	jhibbits, jeff, kevans
Differential Revision:	https://reviews.freebsd.org/D22976
2020-01-02 23:20:37 +00:00
Konstantin Belousov
478ca4b004 Rename umtxq_check_susp() to thread_check_susp()
and make it usable outside of kern_umtx.c.  To be used in several
future changes.

Discussed with:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-01-02 22:13:59 +00:00
Konstantin Belousov
8f4d74eb1e Style: remove trailing spaces/tabs.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-01-02 22:07:03 +00:00
Pawel Biernacki
e8cdbb4815 sysctl: hide 2.x era compat node
r23081 introduced kern.dummy oid as a semi ABI compat for kern.maxsockbuf
that was moved to a new namespace.  It never functioned as an alias of any
kind and was just returning 0 unconditionally, hence it was probably
provided to keep some 3rd party programmes happy about sysctl(3) not
reporting an error because of non-existing oid.
After nearly 23 years it seems reasonable to just hide it from sysctl(8)
list not to cause unnecessary confusion as for its purpose.

Reported by:	Antranig Vartanian <antranigv@freebsd.am>
Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D22982
2020-01-02 01:23:43 +00:00
Mateusz Guzik
57db0e12c8 vfs: drop an always-false check from vlrureclaim
The vnode gets held few lines prior, making the VI_FREE condition
illegal.
2020-01-01 22:51:17 +00:00
Alexander V. Chernikov
c83dda362e Split gigantic rtsock route_output() into smaller functions.
Amount of changes to the original code has been intentionally minimised
to ease diffing.
The changes are mostly mechanical, with the following exceptions:

* lltable handler is now called directly based of RTF_LLINFO flag presense.
* "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified,
  fixing several potential use-after-free cases in rt_addrinfo.
* llable asserts has been replaced with error-returning, preventing kernel
  crashes when lltable gw af family is invalid (root required).

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22864
2019-12-31 17:26:53 +00:00
Alexander Motin
024932aae9 Use atomic for start_count in devstat_start_transaction().
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places.  It would be cool if we had
more SMP-friendly statistics, but this helps too.

Sponsored by:	iXsystems, Inc.
2019-12-30 03:13:38 +00:00
Mark Johnston
9f5632e6c8 Remove page locking for queue operations.
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page.  It is also no longer needed in
the page queue scans.  This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22885
2019-12-28 19:04:00 +00:00
Justin Hibbits
741dfd86b3 Fix the powerpc copyout fixup from r356113
Summary:
r356113 used an older patch, which predated the
freebsd_copyout_auxargs() addition.  Fix this by using a private
powerpc_copyout_auxargs() instead, and keep it private to powerpc, not in MI
files.

Reviewed by:	kib, bdragon
Differential Revision:	https://reviews.freebsd.org/D22935
2019-12-27 17:38:25 +00:00
Mateusz Guzik
3983dc32d7 Plug a warning in read-mostly spinlocks reported by gcc. 2019-12-27 13:37:19 +00:00
Mateusz Guzik
eb9764615d vfs: remove production kernel checks and mp == NULL support from vdrop
1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.

Any code which would want to pass NULL mp can construct a fake one instead.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D22722
2019-12-27 11:26:12 +00:00
Mateusz Guzik
1f162fef76 Add read-mostly sleepable locks
To be used when like rmlocks, except when sleeping for readers needs to be
allowed. See the manpage for more information.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D22823
2019-12-27 11:19:57 +00:00
Justin Hibbits
6b45727338 Fix the build from r356113.
Types had changed from when the patch was first created, and a final build
was not done pre-commit.
2019-12-27 04:52:17 +00:00
Justin Hibbits
adea0d6368 Eliminate the last MI difference in AT_* definitions (for powerpc).
Summary:
As a transition aide, implement an alternative elfN_freebsd_fixup which
is called for old powerpc binaries.  Similarly, add a translation to rtld to
convert old values to new ones (as expected by a new rtld).

Translation of old<->new values  is incomplete, but sufficient to allow an
installworld of a new userspace from an old one when a new kernel is running.

Test Plan:
Someone needs to see how a new kernel/rtld/libc works with an old
binary.  If if works we can probalby ship this.  If not we probalby need
some more compat bits.

Submitted by:	brooks
Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D20799
2019-12-27 04:07:03 +00:00
Conrad Meyer
f3bae413e9 random(9): Deprecate random(9), remove meaningless srandom(9)
srandom(9) is meaningless on SMP systems or any system with, say,
interrupts.  One could never rely on random(9) to produce a reproducible
sequence of outputs on the basis of a specific srandom() seed because the
global state was shared by all kernel contexts.  As such, removing it is
literally indistinguishable to random(9) consumers (as compared with
retaining it).

Mark random(9) as deprecated and slated for quick removal.  This is not to
say we intend to remove all fast, non-cryptographic PRNG(s) in the kernel.
It/they just won't be random(9), as it exists today, in either name or
implementation.

Before random(9) is removed, a replacement will be provided and in-tree
consumers will be converted.

Note that despite the name, the random(9) interface does not bear any
resemblance to random(3).  Instead, it is the same crummy 1988 Park-Miller
LCG used in libc rand(3).
2019-12-26 19:41:09 +00:00
Conrad Meyer
af00898b5d gone_in(9): Trivial string grammar and style cleanups 2019-12-26 18:25:07 +00:00
Kyle Evans
f46412c021 kern_cons: add a stub kbdinit for configs with no keyboard/console drivers
A weak symbol here is decidedly cleaner than any #ifdef soup or relocating
kbdinit, the former leading to maintenance required on addition of any
console/keyboard drivers and the latter pushing kbd init bits away from
where they're used.
2019-12-26 15:47:19 +00:00
Kyle Evans
3ed7166aca kbd: merge linker set drivers into standard kbd driver list
This leads to the revert of r355806; this reduces duplication in keyboard
registration and driver switch lookup and leaves us with one authoritative
source for currently registered drivers. The reduced duplication later is
nice as we have more procedure involved in keyboard setup.

keyboard_driver->flags is used to more quickly detect bogus adds/removes.
From KPI consumers' perspective, nothing changes- kbd_add_driver of an
already-registered driver will succeed, and a single kbd_delete_driver will
later remove it as expected. In contrast to historical behavior,
kbd_delete_driver on a driver registered via linker set will now actually
de-register the driver so that it may not be used -- e.g. if kbdmux's
MOD_LOAD handler fails somewhere.

Detection for already-registered drivers in kbd_add_driver has improved, as
the previous SLIST_NEXT(driver) != NULL check would not have caught a driver
that's at the tail end.

kbdinit is now called from cninit() rather than via SYSINIT so that keyboard
drivers are available as early as console drivers. This is particularly
important as cnprobe will, in both syscons and vt, attempt to do any early
configuration of keyboard drivers built-in (see: kbd_configure).

Reviewed by:	imp (earlier version, pre-cninit change)
Differential Revision:	https://reviews.freebsd.org/D22835
2019-12-26 15:21:34 +00:00
Conrad Meyer
fea73412a0 sleep(9), sleepqueue(9): const'ify wchan pointers
_sleep(9), wakeup(9), sleepqueue(9), et al do not dereference or modify the
channel pointers provided in any way; they are merely used as intptrs into a
dictionary structure to match waiters with wakers.  Correctly annotate this
such that _sleep() and wakeup() may be used on const pointers without
invoking ugly patterns like __DECONST().  Plumb const through all of the
underlying sleepqueue bits.

No functional change.

Reviewed by:	rlibby
Discussed with:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22914
2019-12-24 16:19:33 +00:00
Brandon Bergren
7821a820d0 [PowerPC] Implement Secure-PLT jump table processing for ppc32.
Due to clang and LLD's tendency to use a PLT for builtins, and as they
don't have full support for EABI, we sometimes have to deal with a PLT in
.ko files in a clang-built kernel.

As such, augment the in-kernel linker to support jump table processing.

As there is no particular reason to support lazy binding in kernel modules,
only implement Secure-PLT immediate binding.

As part of these changes, add elf_cpu_parse_dynamic() to the MD API of the
in-kernel linker (except on platforms that use raw object files.)

The new function will allow MD code to act on MD tags in _DYNAMIC.

Use this new function in the PowerPC MD code to ensure BSS-PLT modules using
PLT will be rejected during insertion, and to poison the runtime resolver to
ensure we get a clear panic reason if a call is made to the resolver.

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D22608
2019-12-24 15:56:24 +00:00
Conrad Meyer
e30f025ff9 kern_synch: Fix some UB
It is UB to evaluate pointer comparisons when pointers do not point within
the same object.  Instead, convert the pointers to numbers and compare the
numbers.

Reported by:	kib
Discussed with:	rlibby
2019-12-24 06:08:29 +00:00
Konstantin Belousov
52f3524cfd Do not use waitable allocation of pbuf when creating cluster for write.
Previously just ensuring that we do not sleep when clustering for
md(4) vnode was enough.  Now, with the switch of the pbuf allocator to
uma and completely broken per-subsystem pbuf limits, it might cause
unbounded sleep even for non-md(4) vnodes.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22899
2019-12-23 20:15:19 +00:00
Philip Paeps
1b6dd6d772 Print upper 32 bits in devmap table entries
If the devmap entry uses the upper 32 bits they wouldn't be printed in
devmap_dump_table(). This fixes that.

Submitted by:   Nicholas O'Brien <nickisobrien_gmail.com>
Sponsored by:   Axiado
2019-12-20 03:40:53 +00:00
Mark Johnston
5d3bd9e242 Fix SIGINFO stack collection to ignore threads with swapped-out stacks.
We by definition cannot trace the stack of such a thread.  Also remove a
redundant stack_zero() call in the SIGINFO handler, the stack structure
is cleared by the MD stack_capture().

Sponsored by:	The FreeBSD Foundation
2019-12-19 19:34:25 +00:00
Jeff Roberson
d8d5f03610 Fix a bug in r355784. I missed a sched_add() call that needed to reacquire
the thread lock.

Reported by:	mjg
2019-12-19 18:22:11 +00:00
Hans Petter Selasky
cc79ea3a26 Restore important comment in RCU/EPOCH support in FreeBSD after r355784.
Sponsored by:	Mellanox Technologies
2019-12-18 09:30:32 +00:00
Mateusz Guzik
6fa079fc3f vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
Mateusz Guzik
3fd19ce7a5 mtx: eliminate recursion support from thread lock
Now that it is not used after schedlock changes got merged.

Note the unlock routine temporarily still checks for it on account of just using
regular spin unlock.

This is a prelude towards a general clean up.
2019-12-16 00:04:33 +00:00
Jeff Roberson
686bcb5c14 schedlock 4/4
Don't hold the scheduler lock while doing context switches.  Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency.  This means that mi_switch() is entered with the thread
locked but returns without.  This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.

This change has not been made to 4BSD but in principle it would be
more straightforward.

Discussed with:	markj
Reviewed by:	kib
Tested by:	pho
Differential Revision: https://reviews.freebsd.org/D22778
2019-12-15 21:26:50 +00:00
Jeff Roberson
1c81a87efd schedlock 3/4
Eliminate lock recursion from turnstiles.  This was simply used to avoid
tracking the top-level turnstile lock.  explicitly check for it before
picking up and dropping locks.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22746
2019-12-15 21:19:41 +00:00
Jeff Roberson
045b7c6084 schedlock 2/4
Do all sleepqueue post-processing in sleepq_remove_thread() so that we
do not require the thread lock after a context switch.

Reviewed by:	jhb, kib
Differential Revision:	https://reviews.freebsd.org/D22745
2019-12-15 21:18:07 +00:00
Ian Lepore
b19c9dea3e Rewrite arm kernel stack unwind code to work when unwinding through modules.
The arm kernel stack unwinder has apparently never been able to unwind when
the path of execution leads through a kernel module. There was code that
tried to handle modules by looking for the unwind data in them, but it did
so by trying to find symbols which have never existed in arm kernel
modules. That caused the unwind code to panic, and because part of panic
handling calls into the unwind code, that just created a recursion loop.

Locating the unwind data in a loaded module requires accessing the Elf
section headers to find the SHT_ARM_EXIDX section. For preloaded modules
those headers are present in a metadata blob. For dynamically loaded
modules, the headers are present only while the loading is in progress; the
memory is freed once the module is ready to use. For that reason, there is
new code in kern/link_elf.c, wrapped in #ifdef __arm__, to extract the
unwind info while the headers are loaded. The values are saved into new
fields in the linker_file structure which are also conditional on __arm__.

In arm/unwind.c there is new code to locally cache the per-module info
needed to find the unwind tables. The local cache is crafted for lockless
read access, because the unwind code often needs to run in context where
sleeping is not allowed.  A large comment block describes the local cache
list, so I won't repeat it all here.
2019-12-15 21:16:35 +00:00
Jeff Roberson
61a74c5ccd schedlock 1/4
Eliminate recursion from most thread_lock consumers.  Return from
sched_add() without the thread_lock held.  This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks.  This will eventually allow for lockless remote adds.

Discussed with:	kib
Reviewed by:	jhb
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22626
2019-12-15 21:11:15 +00:00
Jeff Roberson
d29f674f2e Fix a mistake in r355765. We need to activate the page if it is not yet
on a pagequeue.

Reported by:	pho
2019-12-15 06:26:47 +00:00
Jeff Roberson
a808177864 Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy.  This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present.  The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22654
2019-12-15 03:15:06 +00:00
Jeff Roberson
af00971419 Handle pagein clustering in vm_page_grab_valid() so that it can be used by
exec_map_first_page().  This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).

Discussed with:	alc
Approved by:	kib
Differential Revision:	https://reviews.freebsd.org/D22731
2019-12-15 02:00:32 +00:00
Doug Moore
9f70442a04 Simplify the processing a leaf mask to find big-enough ranges of set
bits, by storing and modifying the complement of the original leaf
mask, and by avoiding some unnecessary intermediate variables in
computing the shift amounts. The logic is similar to what has recently
been committed to sys/sys/bitstring.h.

Compute better hint updates for the case when the cursor starts in
mid-leaf, and eliminates some otherwise viable solutions. Assume the
worst case, that all the eliminated offsets could have been solutions,
and you can still compute a better hint than we use now.

Eliminate some unnecessary conditional control flow.

Approved by: alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22666
2019-12-14 19:44:42 +00:00
Mateusz Guzik
6f836483ec Remove the useless return value from proc_set_cred 2019-12-14 00:43:17 +00:00
John Baldwin
4b28d96e5d Remove the deprecated timeout(9) interface.
All in-tree consumers have been converted to callout(9).

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22602
2019-12-13 21:03:12 +00:00
Warner Losh
b832a7e505 Create new wrapper function: bus_delayed_attach_children()
Delay the attachment of children, when requested, until after interrutps are
running. This is often needed to allow children to run transactions on i2c or
spi busses. It's a common enough idiom that it will be useful to have its own
wrapper.

Reviewed by: ian
Differential Revision: https://reviews.freebsd.org/D21465
2019-12-13 19:39:33 +00:00
John Baldwin
bf2276f378 Use callout(9) instead of deprecated timeout(9) for fail points.
Allocate the callout structure on-demand from
fail_point_use_timeout_path() since most fail points do not use
timeouts.

Reviewed by:	markj (earlier version), cem
Differential Revision:	https://reviews.freebsd.org/D22599
2019-12-13 19:26:04 +00:00
Edward Tomasz Napierala
34ad5ac242 Add kern_kill() and use it in Linuxulator. It's just a cleanup,
no functional changes.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22645
2019-12-13 18:44:02 +00:00
Edward Tomasz Napierala
be2cfdbc86 Add kern_getsid() and use it in Linuxulator; no functional changes.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22647
2019-12-13 18:39:36 +00:00
Ryan Libby
9825eadf2c bitset: rename confusing macro NAND to ANDNOT
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too.  The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with:	jeff
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22791
2019-12-13 09:32:16 +00:00
Conrad Meyer
cd5650407e kern/subr_unit: Rip srandomdev, random(3) out of dead code
The simulation cannot be reproduced, so the value of using a deterministic PRNG
like random(3) is dubious.  The number of repitions used in the sample isn't a
problem for the Chacha implementation of arc4random we have today.  (Also, no
one actually runs this code; it was provided as an example of the work the
author did validating the implementation.  It's not even test code.)
2019-12-13 04:48:20 +00:00
Rick Macklem
ea9a16b252 r355677 requires that vop_stdioctl() be global so it can be called from NFS.
r355677 modified the NFS client so that it does lseek(SEEK_DATA/SEEK_HOLE)
for NFSv4.2, but calls vop_stdioctl() otherwise. As such, vop_stdioctl()
needs to be a global function.

Missed during the code merge for r355677.
2019-12-13 00:14:12 +00:00
Edward Tomasz Napierala
d6fee74a0c Add kern_sync(9), and make kernel code call it instead of going
via sys_sync(2).  Minor cleanup, no functional changes.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19366
2019-12-12 18:45:31 +00:00
Mark Johnston
7789ab32b3 Rename tdq_ipipending and clear it in sched_switch().
This fixes a regression after r355311.  Specifically, sched_preempt()
may trigger a context switch by calling thread_lock(), since
thread_lock() calls critical_exit() in its slow path and the interrupted
thread may have already been marked for preemption.  This would happen
before tdq_ipipending is cleared, blocking further preemption IPIs.  The
CPU can be left in this state indefinitely if the interrupted thread
migrates.

Rename tdq_ipipending to tdq_owepreempt.  Any switch satisfies a remote
preemption request, so clear tdq_owepreempt in sched_switch() instead of
sched_preempt() to avoid subtle problems of the sort described above.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22758
2019-12-12 02:43:24 +00:00
Mateusz Guzik
c8b29d1212 vfs: locking primitives which elide ->v_vnlock and shared locking disablement
Both of these features are not needed by many consumers and result in avoidable
reads which in turn puts them on profiles due to cache-line ping ponging.

On top of that the current lockgmr entry point is slower than necessary
single-threaded. As an attempted clean up preparing for other changes,
provide new routines which don't support any of the aforementioned features.

With these patches in place vop_stdlock and vop_stdunlock disappear from
flamegraphs during -j 104 buildkernel.

Reviewed by:	jeff (previous version)
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22665
2019-12-11 23:11:21 +00:00
Mateusz Guzik
55eb92db8d fd: static-ize and devolatile openfiles
Almost all access is using atomics. The only read is sysctl which should use
a whole-int-at-a-time friendly read internally.
2019-12-11 23:09:12 +00:00
Andriy Gapon
64ebbdd54d add a sanity check to the system call registration code
A system call number should be at least reserved.
We do not expect an attempt to register a fixed number system call
when nothing at all is known about it.

MFC after:	3 weeks
Sponsored by:	Panzura
2019-12-11 15:52:29 +00:00
John Baldwin
a8a03706fb Add a callout_func_t typedef for functions used with callout_*().
This typedef is the same as timeout_t except that it is in the callout
namespace and header.

Use this typedef in various places of the callout implementation that
were either using the raw type or timeout_t.

While here, add <sys/callout.h> to the manpage.

Reviewed by:	kib, imp
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D22751
2019-12-10 21:58:30 +00:00
Mateusz Guzik
ff4486e827 vfs: refactor vhold and vdrop
No fuctional changes.
2019-12-10 00:08:05 +00:00
John Baldwin
d8010b1175 Copy out aux args after the argument and environment vectors.
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector.  Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack.  It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.

This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).

Reviewed by:	kib
Tested on:	amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22695
2019-12-09 19:17:28 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Mateusz Guzik
791a24c7ea vfs: clean up vputx a little
1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone

Reviewed by:	kib, jeff (previous version)
Differential Revision:	https://reviews.freebsd.org/D22718
2019-12-08 21:13:07 +00:00
Mateusz Guzik
fd6e0c43a6 vfs: factor out vnode destruction out of vdrop
Sponsored by:	The FreeBSD Foundation
2019-12-08 21:11:25 +00:00
Jeff Roberson
c3cccf95bf Handle multiple clock interrupts simultaneously in sched_clock().
Reviewed by:	kib, markj, mav
Differential Revision:	https://reviews.freebsd.org/D22625
2019-12-08 01:17:38 +00:00
Konstantin Belousov
0cc9fb7551 Only return EPERM from kill(-pid) when no process was signalled.
As mandated by POSIX.  Also clarify the kill(2) manpage.

While there, restructure the code in killpg1() to use helper which
keeps overall state of the process list iteration in the killpg1_ctx
structued, later used to infer the error returned.

Reported by:	amdmi3
Reviewed by:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22621
2019-12-07 18:07:49 +00:00
Mateusz Guzik
12e483e5f7 vfs: clean up delmntque similarly to vdrop r355414 2019-12-07 12:56:24 +00:00
Mateusz Guzik
4f4d9a086a vfs: catch vn_printf up with reality
- add the missing VV_VMSIZEVNLOCK and VV_READLINK flags
- add decoding v_mflag

While here sort flags.
2019-12-07 12:55:58 +00:00
Brooks Davis
af796bfa71 sysent: Reduce duplication and improve readability.
Use the power of variable to avoid spelling out source and generated
files too many times.  The previous Makefiles were hard to read, hard to
edit, and badly formatted.

Reviewed by:	kevans, emaste
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22714
2019-12-06 23:59:23 +00:00
Alexander Motin
cb847b8152 Make devstat_end_transaction_bio() count BIO_ORDERED.
MFC after:	2 weeks
2019-12-06 18:39:05 +00:00
Bjoern A. Zeeb
173c062a56 Improve EPOCH_TRACE
Two changes to EPOCH_TRACE:
(1) add a sysctl to surpress the backtrace from epoch_trace_report().
    Sometimes the log line for the recursion is enough and the
    backtrace massively spams the console.
(2) In order to be able to go without the backtrace do not only
    print where the previous occurance happened, but also where
    the current one happens.  That way we have file:line information
    for both and can look at them without the need for getting line
    numbers from backtrace and a debugging tool.

Reviewed by:	glebius
Sponsored by:	Netflix (originally)
Differential Revision:	https://reviews.freebsd.org/D22641
2019-12-06 16:34:04 +00:00
Mateusz Guzik
befd3e35b3 sx: check for SX_LOCK_SHARED | SX_LOCK_WRITE_SPINNER when exclusive-locking
First, this removes a spurious difference compared to rw locks.
More importantly though this avoids a trip through sleepq code if the lock
happens to be caught in this state.
2019-12-05 13:43:44 +00:00
Mateusz Guzik
3eeb8a1fba vfs: remove 'active' variable from _vdrop
No functional changes.
2019-12-05 13:40:10 +00:00
Alexander Motin
61322a0a8a Mark some more hot global variables with __read_mostly.
MFC after:	1 week
2019-12-04 21:26:03 +00:00
Ryan Libby
30be9685a3 mbuf zones: take out the trash
The mbuf zones were explicitly specifying the uma trash procedures on
zcreate, conditionally on INVARIANTS, because that used to be necessary
in order to get use-after-free checking for uma zones with non-empty
constructors or destructors.  After r355137 uma automatically invokes
the trash constructor and destructor as long as no init and fini are
specified.  This now allows the mbuf zones to pass their constructors
and destructors without needing to add on the uma trash procedures
conditionally.

Reviewed by:	cem, jhb, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22583
2019-12-04 18:21:29 +00:00
John Baldwin
31174518d2 Use uintptr_t instead of register_t * for the stack base.
- Use ustringp for the location of the argv and environment strings
  and allow destp to travel further down the stack for the stackgap
  and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
  stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs.  This
  used to hold translated system call arguments, but hasn't been used
  since r159992.

Reviewed by:	kib
Tested on:	md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22501
2019-12-03 23:17:54 +00:00
Kirk McKusick
d00066a5f9 Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2019-12-03 23:07:09 +00:00
Jeff Roberson
9b78b1f433 Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by:	markj, rlibby (prior version)
Differential Revision:	https://reviews.freebsd.org/D22584
2019-12-02 22:44:34 +00:00
Jeff Roberson
4504268a1b Fix the last few cases that grab without busy or valid. The grab functions must
return the page in some held state for consistency elsewhere.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22610
2019-12-02 22:38:25 +00:00
Jeff Roberson
e15046952d Initialize the idle thread's lock sooner so it's not evaluated on every fork
exit and we can rely on it elsewhere.

Reviewed by:	mav, kib, jhb, markj
Differential Revision:	https://reviews.freebsd.org/D22624
2019-12-02 22:35:45 +00:00
Mateusz Guzik
5fe188b1e8 lockmgr: remove more remnants of adaptive spinning
Sponsored by:	The FreeBSD Foundation
2019-12-01 00:35:08 +00:00
Kyle Evans
1b50b999f9 tty: implement TIOCNOTTY
Generally, it's preferred that an application fork/setsid if it doesn't want
to keep its controlling TTY, but it could be that a debugger is trying to
steal it instead -- so it would hook in, drop the controlling TTY, then do
some magic to set things up again. In this case, TIOCNOTTY is quite handy
and still respected by at least OpenBSD, NetBSD, and Linux as far as I can
tell.

I've dropped the note about obsoletion, as I intend to support TIOCNOTTY as
long as it doesn't impose a major burden.

Reviewed by:	bcr (manpages), kib
Differential Revision:	https://reviews.freebsd.org/D22572
2019-11-30 20:10:50 +00:00
Mateusz Guzik
e0a1a1e6cb smp: cast the read in quiesce_all_critical through void *
Fixes compilation on some 32-bit arm platforms.

Sponsored by:	The FreeBSD Foundation
2019-11-30 19:33:02 +00:00
Mateusz Guzik
3ac2ac2e08 lockprof: use IPI-injecetd fences to fix hangs on stat dump and reset
The previously used quiesce_all_cpus walks all CPUs and waits until curthread
can run on them. Even on contemporary machines this becomes a significant
problem under load when it can literally take minutes for the operation to
complete. With the patch the stall is normally less than 1 second.

Reviewed by:	kib, jeff (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21740
2019-11-30 17:24:42 +00:00
Mateusz Guzik
5032fe17a2 Add a way to inject fences using IPIs
A variant of this facility was already used by rmlocks where IPIs would
enforce ordering.

This allows to elide fences where they are rarely needed and the cost of
IPI (should it be necessary) is cheaper.

Reviewed by:	kib, jeff (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21740
2019-11-30 17:22:10 +00:00
Mateusz Guzik
a02cab334c devfs: introduce a per-dev lock to protect ->si_devsw
This allows bumping threadcount without taking the global devmtx lock.

In particular this eliminates contention on said lock while using bhyve
with multiple vms.

Reviewed by:	kib
Tested by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22548
2019-11-30 16:46:19 +00:00
Kyle Evans
9e387c3da2 tty_rel_gone: add locking assertion
We already assert the lock is held later during tty_rel_free(), but it is
arguably good form to clarify locking expectations here as well at the
top-level that other drivers use.
2019-11-29 14:46:13 +00:00
Konstantin Belousov
fdc6b10d44 Add a VN_OPEN_INVFS flag.
vn_open_cred() assumes that it is called from the top-level of a VFS
syscall.  Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.

ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache).  VN_OPEN_INVFS allows caller to skip
bwillwrite.

Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.

Reported by:	Willem Jan Withagen <wjw@digiware.nl>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-29 14:02:32 +00:00
Ryan Libby
815db2f6f8 ktls_session zone: don't need to specify uma trash
The use of the uma trash procedures is automatic, there's no need to
pass them explicitly here.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22582
2019-11-29 06:25:03 +00:00
Kyle Evans
cf29433090 tty_pts: don't rely on tty header pollution for sys/mutex.h
tty_pts.c relies on sys/tty.h for sys/mutex.h. Include it directly instead
of relying on this pollution to ease the diff for anyone that wants to try
converting the tty lock to anything other than a mutex.
2019-11-29 03:56:01 +00:00
Jeff Roberson
6d6a03d7a8 Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22563
2019-11-29 03:14:10 +00:00
Jeff Roberson
b476ae7f52 Fix DEBUG_REDZONE build after r355169 2019-11-28 08:56:14 +00:00
Hans Petter Selasky
c2a8682ae8 Factor out check for mounted root file system.
Differential Revision:	https://reviews.freebsd.org/D22571
PR:		241639
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-11-28 08:47:36 +00:00