Commit Graph

16958 Commits

Author SHA1 Message Date
Ryan Libby
59fb4a95c7 witness: sleepable rm locks are not sleepable in read mode
There are two classes of rm lock, one "sleepable" and one not.  But even
a "sleepable" rm lock is only sleepable in write mode, and is
non-sleepable when taken in read mode.

Warn about sleepable rm locks in read mode as non-sleepable locks.  Do
this by defining a new lock operation flag, LOP_NOSLEEP, to indicate
that a lock is non-sleepable despite what the LO_SLEEPABLE flag would
indicate, and defining a new witness lock instance flag, LI_SLEEPABLE,
to track the product of LO_SLEEPABLE and LOP_NOSLEEP on the lock
instance.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22527
2019-11-27 01:54:39 +00:00
Mateusz Guzik
588e69e2fd cache: stop reusing .. entries on enter
It almost never happens in practice anyway. With this eliminated ->nc_vp
cannot change vnodes, removing an obstacle on the road to lockless
lookup.
2019-11-27 01:21:42 +00:00
Mateusz Guzik
2ac930e32c cache: fix numcache accounting on entry
. entries are never created and .. can reuse existing entries,
meaning the early count bump is both spurious and leading to
overcounting in certain cases.
2019-11-27 01:20:55 +00:00
Mateusz Guzik
36afce39ae cache: hide "doingcache" behind DEBUG_CACHE 2019-11-27 01:20:21 +00:00
Hans Petter Selasky
aa4612d133 Fix panic when loading kernel modules before root file system is mounted.
Make sure the rootvnode is always NULL checked.

Differential Revision:	https://reviews.freebsd.org/D22545
PR:		241639
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-11-26 12:20:44 +00:00
Mariusz Zaborski
8e49361164 procdesc: allow to collect status through wait(1) if process is traced
The debugger like truss(1) depends on the wait(2) syscall. This syscall
waits for ALL children. When it is waiting for ALL child's the children
created by process descriptors are not returned. This behavior was
introduced because we want to implement libraries which may pdfork(1).

The behavior of process descriptor brakes truss(1) because it will
not be able to collect the status of processes with process descriptors.

To address this problem the status is returned to parent when the
child is traced. While the process is traced the debugger is the new parent.
In case the original parent and debugger are the same process it means the
debugger explicitly used pdfork() to create the child. In that case the debugger
should be using kqueue()/pdwait() instead of wait().

Add test case to verify that. The test case was implemented by markj@.

Reviewed by:	kib, markj
Discussed with:	jhb
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D20362
2019-11-25 18:33:21 +00:00
Ryan Libby
43cefe8b19 sysctl sysctls: wire old buf before output with sysctl lock
Several sysctl sysctls output to a user buffer while holding a
non-sleepable lock that protects the sysctl topology.  They need to wire
the output buffer, or else they may try to sleep on a page fault.

Reviewed by:	cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22528
2019-11-25 07:38:27 +00:00
Konstantin Belousov
b631c36f0d Record part of the owner struct thread pointer into busy_lock.
Record as much bits from curthread into busy_lock as fits.  Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0).  Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler.  For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22298
2019-11-24 19:12:23 +00:00
Warner Losh
a921c2003f Add a warning about Giant Locked devices
Add a warning when a device registers with devfs and requests
D_NEEDGIANT. The warning says the device will go away before
13.0. This is needed to flush out the devices in the tree that are
still Giant locked. This warning, or some variant of it, should have
gone into the tree a long time ago...

The intention is to require all devices be converted to not use
automatic giant in this way, or remove any such devices that remain
that we don't have the hardware to test a conversion of.

kbd so far is the only device that can't leave the tree, yet needs
something sensible done to avoid the auto giant lock (even if it is
just doing the wrapping itself). There may be others added to this
list... Any discussions of this topic will take place on arch@.
2019-11-23 23:57:26 +00:00
Conrad Meyer
7993a104a1 Add explicit SI_SUB_EPOCH
Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP).  Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.

epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP.  Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D22503
2019-11-22 23:23:40 +00:00
Gleb Smirnoff
329377f44b cc_ktr_event_name is used only with KTR 2019-11-21 23:55:43 +00:00
Alexander Motin
130fffa2a3 Add variant of root_mount_hold() without allocation.
It allows to use this KPI in non-sleepable contexts.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-21 21:59:35 +00:00
Andrew Turner
a27ac4644a Disable KCSAN within a panic.
The kernel is single threaded at this point and the panic is more
important.

Sponsored by:	DARPA, AFRL
2019-11-21 13:59:01 +00:00
Andrew Turner
68cad68149 Add kcsan_md_unsupported from NetBSD.
It's used to ignore virtual addresses that may have a different physical
address depending on the CPU.

Sponsored by:	DARPA, AFRL
2019-11-21 13:22:23 +00:00
Andrew Turner
bba0065f0d Fix the bus_space functions with KCSAN on arm64.
Arm64 doesn't define the bus_space_set_multi_stream and
bus_space_set_region_stream functions. Don't try to define them there.

Sponsored by:	DARPA, AFRL
2019-11-21 13:12:58 +00:00
Andrew Turner
849aef496d Port the NetBSD KCSAN runtime to FreeBSD.
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.

This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.

Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22315
2019-11-21 11:22:08 +00:00
Andrew Turner
0cb5357037 Import the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime.
KCSAN is a tool to find concurrent memory access that may race each other.
After a determined number of memory accesses a cell is created, this
describes the current access. It will then delay for a short period
to allow other CPUs a chance to race. If another CPU performs a memory
access to an overlapping region during this delay the race is reported.

This is a straight import of the NetBSD code, it will be adapted to
FreeBSD in a future commit.

Sponsored by:	DARPA, AFRL
2019-11-20 14:37:48 +00:00
Mateusz Guzik
d578a4256e cache: minor stat cleanup
Remove duplicated stats and move numcachehv from debug to vfs.cache.
2019-11-20 12:08:32 +00:00
Mateusz Guzik
d957f3a4f0 vfs: perform a more racy check in vfs_notify_upper
Locking mp does not buy anything interms of correctness and only contributes to
contention.
2019-11-20 12:07:54 +00:00
Mateusz Guzik
1fccb43c39 vfs: change si_usecount management to count used vnodes
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.

There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).

Reviewed by:	kib, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22202
2019-11-20 12:05:59 +00:00
Jeff Roberson
639676877b Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates
reudundant complicated checks and additional locking required only for
anonymous memory.  Introduce vm_object_allocate_anon() to create these
objects.  DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.

Reviewed by:	alc, dougm, kib, markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22119
2019-11-19 23:19:43 +00:00
Kyle Evans
4cc12fb848 sysent: regenerate after r354835
The lua-based makesyscalls produces slightly different output than its
makesyscalls.sh predecessor, all whitespace differences more closely
matching the source syscalls.master.
2019-11-18 23:31:12 +00:00
Kyle Evans
f22a592111 Convert in-tree sysent targets to use new makesyscalls.lua
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.

Reviewed by:	brooks
Differential Revision:	https://reviews.freebsd.org/D21894
2019-11-18 23:28:23 +00:00
John Baldwin
03b0d68c72 Check for errors from copyout() and suword*() in sv_copyout_args/strings.
Reviewed by:	brooks, kib
Tested on:	amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22401
2019-11-18 20:07:43 +00:00
David Bright
2d5603fe65 Jail and capability mode for shm_rename; add audit support for shm_rename
Co-mingling two things here:

  * Addressing some feedback from Konstantin and Kyle re: jail,
    capability mode, and a few other things
  * Adding audit support as promised.

The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	kib
Relnotes:	Yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22083
2019-11-18 13:31:16 +00:00
Konstantin Belousov
01a2b5679b kern_exec: p_osrel and p_fctl0 were obliterated by failed execve(2) attempt.
Zeroing of them is needed so that an image activator can update the
values as appropriate (or not set at all).

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22379
2019-11-17 14:52:45 +00:00
Scott Long
de890ea465 Create a new sysctl subtree, machdep.mitigations. Its purpose is to organize
knobs and indicators for code that mitigates functional and security issues
in the architecture/platform.  Controls for regular operational policy should
still go into places security, hw, kern, etc.

The machdep root node is inherently architecture dependent, but mitigations
tend to be architecture dependent as well.  Some cases like Spectre do cross
architectural boundaries, but the mitigation code for them tends to be
architecture dependent anyways, and multiple architectures won't be active
in the same image of the kernel.

Many mitigation knobs already exist in the system, and they will be moved
with compat naming in the future.  Going forward, mitigations should collect
in machdep.mitigations.

Reviewed by:	imp, brooks, rwatson, emaste, jhb
Sponsored by:	Intel
2019-11-15 23:27:17 +00:00
John Baldwin
e353233118 Add a sv_copyout_auxargs() hook in sysentvec.
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook.  In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland.  This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by:	brooks, emaste
Tested on:	amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22355
2019-11-15 18:42:13 +00:00
Brooks Davis
96c914ee97 Tidy syscall declerations.
Pointer arguments should be of the form "<type> *..." and not "<type>* ...".

No functional change.

Reviewed by:	kevans
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22373
2019-11-14 17:11:52 +00:00
Mark Johnston
1cbfe73da5 Fix handling of PIPE_EOF in the direct write path.
Suppose a writing thread has pinned its pages and gone to sleep with
pipe_map.cnt > 0.  Suppose that the thread is woken up by a signal (so
error != 0) and the other end of the pipe has simultaneously been
closed.  In this case, to satisfy the assertion about pipe_map.cnt in
pipe_destroy_write_buffer(), we must mark the buffer as empty.

Reported by:	syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com
Reviewed by:	kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22261
2019-11-11 20:44:30 +00:00
Rick Macklem
48e4857859 Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
2019-11-10 01:08:14 +00:00
Rick Macklem
15930ae180 Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.
2019-11-08 23:39:17 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Gleb Smirnoff
b8c923032f If vm_pager_get_pages_async() returns an error synchronously we leak wired
and busy pages.  Add code that would carefully cleanups the state in case
of synchronous error return.  Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.

In collaboration with:	chs
Reviewed by:		jtl, kib
2019-11-06 23:45:43 +00:00
Bjoern A. Zeeb
28d7601989 m_pulldown(): Change an if () panic() into a KASSERT().
If we pass in a NULL mbuf to m_pulldown() we are in a bad situation
already.  There is no point in doing that check for production code.
Change the if () panic() into a KASSERT.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-06 22:40:19 +00:00
Brooks Davis
89f34d4611 libstats: Improve ABI assertion.
On platforms where pointers are larger than 64-bits, struct statsblob
may be harmlessly padded out such that opaque[] always has some included
space.  Make the assertion more general by comparing to the offset of
opaque rather than the size of struct statsblob.

Discussed with:	jhb, James Clarke
Reviewed by:	trasz, lstewart
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22188
2019-11-06 19:44:44 +00:00
Alexander Motin
3db35ffa2a Some more taskqueue optimizations.
- Optimize enqueue for two task priority values by adding new tq_hint
field, pointing to the last task inserted into the middle of the list.
In case of more then two priority values it should halve average search.
 - Move tq_active insert/remove out of the taskqueue_run_locked loop.
Instead of dirtying few shared cache lines per task introduce different
mechanism to drain active tasks, based on task sequence number counter,
that uses only cache lines already present in cache.  Since the new
mechanism does not need ordering, switch tq_active from TAILQ to LIST.
 - Move static and dynamic struct taskqueue fields into different cache
lines.  Move lock into its own cache line, so that heavy lock spinning
by multiple waiting threads would not affect the running thread.
 - While there, correct some TQ_SLEEP() wait messages.

This change fixes certain ZFS write workloads, causing huge congestion
on taskqueue lock.  Those workloads combine some large block writes to
saturate the pool and trigger allocation throttling, which uses higher
priority tasks to requeue the delayed I/Os, with many small blocks to
generate deep queue of small tasks for taskqueue to sort.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-11-01 22:49:44 +00:00
Ed Maste
2e5f9189bb avoid kernel stack data leak in core dump thrmisc note
bzero the entire thrmisc struct, not just the padding.  Other core dump
notes are already done this way.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:	markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-10-31 20:42:36 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Jeff Roberson
6ee653cfeb Drop the object lock in vfs_bio and cluster where it is now safe to do so.
Recent changes to busy/valid/dirty have enabled page based synchronization
and the object lock is no longer required in many cases.

Reviewed by:	kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21597
2019-10-29 20:37:59 +00:00
Gleb Smirnoff
5757b59f3e Merge td_epochnest with td_no_sleeping.
Epoch itself doesn't rely on the counter and it is provided
merely for sleeping subsystems to check it.

- In functions that sleep use THREAD_CAN_SLEEP() to assert
  correctness.  With EPOCH_TRACE compiled print epoch info.
- _sleep() was a wrong place to put the assertion for epoch,
  right place is sleepq_add(), as there ways to call the
  latter bypassing _sleep().
- Do not increase td_no_sleeping in non-preemptible epochs.
  The critical section would trigger all possible safeguards,
  no sleeping counter is extraneous.

Reviewed by:	kib
2019-10-29 17:28:25 +00:00
Konstantin Belousov
5e921ff49e amd64: move pcb out of kstack to struct thread.
This saves 320 bytes of the precious stack space.

The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that.  Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22138
2019-10-25 20:09:42 +00:00
Gleb Smirnoff
ed9d69b5e8 Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no
functional change.

Discussed with:	kib
2019-10-24 21:55:19 +00:00
John Baldwin
7d29eb9a91 Use a counter with a random base for explicit IVs in GCM.
This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq().  This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22117
2019-10-24 18:13:26 +00:00
Konstantin Belousov
c92f130498 Fix undefined behavior.
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22126
2019-10-23 16:06:47 +00:00
Konstantin Belousov
8076c4e7d1 vn_printf(): Decode VI_TEXT_REF.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-10-23 15:51:26 +00:00
Gleb Smirnoff
080e9496b8 Allow epoch tracker to use the very last byte of the stack. Not sure
this will help to avoid panic in this function, since it will also use
some stack, but makes code more strict.

Submitted by:	hselasky
2019-10-22 18:05:15 +00:00
Gleb Smirnoff
77d70e515f Assert that any epoch tracker belongs to the thread stack.
Reviewed by:	kib
2019-10-21 23:12:14 +00:00
Gleb Smirnoff
279b9aabe3 Remove epoch tracker from struct thread. It was an ugly crutch to emulate
locking semantics for if_addr_rlock() and if_maddr_rlock().
2019-10-21 18:19:32 +00:00
Andriy Gapon
3ad1ce46d3 debug,kassert.warnings is a statistic, not a tunable
MFC after:	1 week
2019-10-21 12:21:56 +00:00