Commit Graph

17154 Commits

Author SHA1 Message Date
Hans Petter Selasky
aa4612d133 Fix panic when loading kernel modules before root file system is mounted.
Make sure the rootvnode is always NULL checked.

Differential Revision:	https://reviews.freebsd.org/D22545
PR:		241639
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-11-26 12:20:44 +00:00
Mariusz Zaborski
8e49361164 procdesc: allow to collect status through wait(1) if process is traced
The debugger like truss(1) depends on the wait(2) syscall. This syscall
waits for ALL children. When it is waiting for ALL child's the children
created by process descriptors are not returned. This behavior was
introduced because we want to implement libraries which may pdfork(1).

The behavior of process descriptor brakes truss(1) because it will
not be able to collect the status of processes with process descriptors.

To address this problem the status is returned to parent when the
child is traced. While the process is traced the debugger is the new parent.
In case the original parent and debugger are the same process it means the
debugger explicitly used pdfork() to create the child. In that case the debugger
should be using kqueue()/pdwait() instead of wait().

Add test case to verify that. The test case was implemented by markj@.

Reviewed by:	kib, markj
Discussed with:	jhb
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D20362
2019-11-25 18:33:21 +00:00
Ryan Libby
43cefe8b19 sysctl sysctls: wire old buf before output with sysctl lock
Several sysctl sysctls output to a user buffer while holding a
non-sleepable lock that protects the sysctl topology.  They need to wire
the output buffer, or else they may try to sleep on a page fault.

Reviewed by:	cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22528
2019-11-25 07:38:27 +00:00
Konstantin Belousov
b631c36f0d Record part of the owner struct thread pointer into busy_lock.
Record as much bits from curthread into busy_lock as fits.  Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0).  Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler.  For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22298
2019-11-24 19:12:23 +00:00
Warner Losh
a921c2003f Add a warning about Giant Locked devices
Add a warning when a device registers with devfs and requests
D_NEEDGIANT. The warning says the device will go away before
13.0. This is needed to flush out the devices in the tree that are
still Giant locked. This warning, or some variant of it, should have
gone into the tree a long time ago...

The intention is to require all devices be converted to not use
automatic giant in this way, or remove any such devices that remain
that we don't have the hardware to test a conversion of.

kbd so far is the only device that can't leave the tree, yet needs
something sensible done to avoid the auto giant lock (even if it is
just doing the wrapping itself). There may be others added to this
list... Any discussions of this topic will take place on arch@.
2019-11-23 23:57:26 +00:00
Conrad Meyer
7993a104a1 Add explicit SI_SUB_EPOCH
Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP).  Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.

epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP.  Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D22503
2019-11-22 23:23:40 +00:00
Gleb Smirnoff
329377f44b cc_ktr_event_name is used only with KTR 2019-11-21 23:55:43 +00:00
Alexander Motin
130fffa2a3 Add variant of root_mount_hold() without allocation.
It allows to use this KPI in non-sleepable contexts.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-21 21:59:35 +00:00
Andrew Turner
a27ac4644a Disable KCSAN within a panic.
The kernel is single threaded at this point and the panic is more
important.

Sponsored by:	DARPA, AFRL
2019-11-21 13:59:01 +00:00
Andrew Turner
68cad68149 Add kcsan_md_unsupported from NetBSD.
It's used to ignore virtual addresses that may have a different physical
address depending on the CPU.

Sponsored by:	DARPA, AFRL
2019-11-21 13:22:23 +00:00
Andrew Turner
bba0065f0d Fix the bus_space functions with KCSAN on arm64.
Arm64 doesn't define the bus_space_set_multi_stream and
bus_space_set_region_stream functions. Don't try to define them there.

Sponsored by:	DARPA, AFRL
2019-11-21 13:12:58 +00:00
Andrew Turner
849aef496d Port the NetBSD KCSAN runtime to FreeBSD.
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.

This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.

Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22315
2019-11-21 11:22:08 +00:00
Andrew Turner
0cb5357037 Import the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime.
KCSAN is a tool to find concurrent memory access that may race each other.
After a determined number of memory accesses a cell is created, this
describes the current access. It will then delay for a short period
to allow other CPUs a chance to race. If another CPU performs a memory
access to an overlapping region during this delay the race is reported.

This is a straight import of the NetBSD code, it will be adapted to
FreeBSD in a future commit.

Sponsored by:	DARPA, AFRL
2019-11-20 14:37:48 +00:00
Mateusz Guzik
d578a4256e cache: minor stat cleanup
Remove duplicated stats and move numcachehv from debug to vfs.cache.
2019-11-20 12:08:32 +00:00
Mateusz Guzik
d957f3a4f0 vfs: perform a more racy check in vfs_notify_upper
Locking mp does not buy anything interms of correctness and only contributes to
contention.
2019-11-20 12:07:54 +00:00
Mateusz Guzik
1fccb43c39 vfs: change si_usecount management to count used vnodes
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.

There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).

Reviewed by:	kib, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22202
2019-11-20 12:05:59 +00:00
Jeff Roberson
639676877b Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates
reudundant complicated checks and additional locking required only for
anonymous memory.  Introduce vm_object_allocate_anon() to create these
objects.  DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.

Reviewed by:	alc, dougm, kib, markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22119
2019-11-19 23:19:43 +00:00
Kyle Evans
4cc12fb848 sysent: regenerate after r354835
The lua-based makesyscalls produces slightly different output than its
makesyscalls.sh predecessor, all whitespace differences more closely
matching the source syscalls.master.
2019-11-18 23:31:12 +00:00
Kyle Evans
f22a592111 Convert in-tree sysent targets to use new makesyscalls.lua
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.

Reviewed by:	brooks
Differential Revision:	https://reviews.freebsd.org/D21894
2019-11-18 23:28:23 +00:00
John Baldwin
03b0d68c72 Check for errors from copyout() and suword*() in sv_copyout_args/strings.
Reviewed by:	brooks, kib
Tested on:	amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22401
2019-11-18 20:07:43 +00:00
David Bright
2d5603fe65 Jail and capability mode for shm_rename; add audit support for shm_rename
Co-mingling two things here:

  * Addressing some feedback from Konstantin and Kyle re: jail,
    capability mode, and a few other things
  * Adding audit support as promised.

The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	kib
Relnotes:	Yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22083
2019-11-18 13:31:16 +00:00
Konstantin Belousov
01a2b5679b kern_exec: p_osrel and p_fctl0 were obliterated by failed execve(2) attempt.
Zeroing of them is needed so that an image activator can update the
values as appropriate (or not set at all).

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22379
2019-11-17 14:52:45 +00:00
Scott Long
de890ea465 Create a new sysctl subtree, machdep.mitigations. Its purpose is to organize
knobs and indicators for code that mitigates functional and security issues
in the architecture/platform.  Controls for regular operational policy should
still go into places security, hw, kern, etc.

The machdep root node is inherently architecture dependent, but mitigations
tend to be architecture dependent as well.  Some cases like Spectre do cross
architectural boundaries, but the mitigation code for them tends to be
architecture dependent anyways, and multiple architectures won't be active
in the same image of the kernel.

Many mitigation knobs already exist in the system, and they will be moved
with compat naming in the future.  Going forward, mitigations should collect
in machdep.mitigations.

Reviewed by:	imp, brooks, rwatson, emaste, jhb
Sponsored by:	Intel
2019-11-15 23:27:17 +00:00
John Baldwin
e353233118 Add a sv_copyout_auxargs() hook in sysentvec.
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook.  In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland.  This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by:	brooks, emaste
Tested on:	amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22355
2019-11-15 18:42:13 +00:00
Brooks Davis
96c914ee97 Tidy syscall declerations.
Pointer arguments should be of the form "<type> *..." and not "<type>* ...".

No functional change.

Reviewed by:	kevans
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22373
2019-11-14 17:11:52 +00:00
Mark Johnston
1cbfe73da5 Fix handling of PIPE_EOF in the direct write path.
Suppose a writing thread has pinned its pages and gone to sleep with
pipe_map.cnt > 0.  Suppose that the thread is woken up by a signal (so
error != 0) and the other end of the pipe has simultaneously been
closed.  In this case, to satisfy the assertion about pipe_map.cnt in
pipe_destroy_write_buffer(), we must mark the buffer as empty.

Reported by:	syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com
Reviewed by:	kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22261
2019-11-11 20:44:30 +00:00
Rick Macklem
48e4857859 Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
2019-11-10 01:08:14 +00:00
Rick Macklem
15930ae180 Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.
2019-11-08 23:39:17 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Gleb Smirnoff
b8c923032f If vm_pager_get_pages_async() returns an error synchronously we leak wired
and busy pages.  Add code that would carefully cleanups the state in case
of synchronous error return.  Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.

In collaboration with:	chs
Reviewed by:		jtl, kib
2019-11-06 23:45:43 +00:00
Bjoern A. Zeeb
28d7601989 m_pulldown(): Change an if () panic() into a KASSERT().
If we pass in a NULL mbuf to m_pulldown() we are in a bad situation
already.  There is no point in doing that check for production code.
Change the if () panic() into a KASSERT.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-06 22:40:19 +00:00
Brooks Davis
89f34d4611 libstats: Improve ABI assertion.
On platforms where pointers are larger than 64-bits, struct statsblob
may be harmlessly padded out such that opaque[] always has some included
space.  Make the assertion more general by comparing to the offset of
opaque rather than the size of struct statsblob.

Discussed with:	jhb, James Clarke
Reviewed by:	trasz, lstewart
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22188
2019-11-06 19:44:44 +00:00
Alexander Motin
3db35ffa2a Some more taskqueue optimizations.
- Optimize enqueue for two task priority values by adding new tq_hint
field, pointing to the last task inserted into the middle of the list.
In case of more then two priority values it should halve average search.
 - Move tq_active insert/remove out of the taskqueue_run_locked loop.
Instead of dirtying few shared cache lines per task introduce different
mechanism to drain active tasks, based on task sequence number counter,
that uses only cache lines already present in cache.  Since the new
mechanism does not need ordering, switch tq_active from TAILQ to LIST.
 - Move static and dynamic struct taskqueue fields into different cache
lines.  Move lock into its own cache line, so that heavy lock spinning
by multiple waiting threads would not affect the running thread.
 - While there, correct some TQ_SLEEP() wait messages.

This change fixes certain ZFS write workloads, causing huge congestion
on taskqueue lock.  Those workloads combine some large block writes to
saturate the pool and trigger allocation throttling, which uses higher
priority tasks to requeue the delayed I/Os, with many small blocks to
generate deep queue of small tasks for taskqueue to sort.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-11-01 22:49:44 +00:00
Ed Maste
2e5f9189bb avoid kernel stack data leak in core dump thrmisc note
bzero the entire thrmisc struct, not just the padding.  Other core dump
notes are already done this way.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:	markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-10-31 20:42:36 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Jeff Roberson
6ee653cfeb Drop the object lock in vfs_bio and cluster where it is now safe to do so.
Recent changes to busy/valid/dirty have enabled page based synchronization
and the object lock is no longer required in many cases.

Reviewed by:	kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21597
2019-10-29 20:37:59 +00:00
Gleb Smirnoff
5757b59f3e Merge td_epochnest with td_no_sleeping.
Epoch itself doesn't rely on the counter and it is provided
merely for sleeping subsystems to check it.

- In functions that sleep use THREAD_CAN_SLEEP() to assert
  correctness.  With EPOCH_TRACE compiled print epoch info.
- _sleep() was a wrong place to put the assertion for epoch,
  right place is sleepq_add(), as there ways to call the
  latter bypassing _sleep().
- Do not increase td_no_sleeping in non-preemptible epochs.
  The critical section would trigger all possible safeguards,
  no sleeping counter is extraneous.

Reviewed by:	kib
2019-10-29 17:28:25 +00:00
Konstantin Belousov
5e921ff49e amd64: move pcb out of kstack to struct thread.
This saves 320 bytes of the precious stack space.

The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that.  Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22138
2019-10-25 20:09:42 +00:00
Gleb Smirnoff
ed9d69b5e8 Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no
functional change.

Discussed with:	kib
2019-10-24 21:55:19 +00:00
John Baldwin
7d29eb9a91 Use a counter with a random base for explicit IVs in GCM.
This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq().  This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22117
2019-10-24 18:13:26 +00:00
Konstantin Belousov
c92f130498 Fix undefined behavior.
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22126
2019-10-23 16:06:47 +00:00
Konstantin Belousov
8076c4e7d1 vn_printf(): Decode VI_TEXT_REF.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-10-23 15:51:26 +00:00
Gleb Smirnoff
080e9496b8 Allow epoch tracker to use the very last byte of the stack. Not sure
this will help to avoid panic in this function, since it will also use
some stack, but makes code more strict.

Submitted by:	hselasky
2019-10-22 18:05:15 +00:00
Gleb Smirnoff
77d70e515f Assert that any epoch tracker belongs to the thread stack.
Reviewed by:	kib
2019-10-21 23:12:14 +00:00
Gleb Smirnoff
279b9aabe3 Remove epoch tracker from struct thread. It was an ugly crutch to emulate
locking semantics for if_addr_rlock() and if_maddr_rlock().
2019-10-21 18:19:32 +00:00
Andriy Gapon
3ad1ce46d3 debug,kassert.warnings is a statistic, not a tunable
MFC after:	1 week
2019-10-21 12:21:56 +00:00
Mark Johnston
f822c9e287 Apply mapping protections to preloaded kernel modules on amd64.
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21860
2019-10-18 13:56:45 +00:00
Mark Johnston
1d9eae9fb2 Apply mapping protections to .o kernel modules.
Use the section flags to derive mapping protections.  When multiple
sections overlap within a page, the union of their protections must be
applied.  With r353701 the .text and .rodata sections are padded to
ensure that this does not happen on amd64.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21896
2019-10-18 13:53:14 +00:00
Conrad Meyer
dda17b3672 Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.

There are three pieces in the complete NetGDB system.

First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.

Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic.  Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).

Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.

The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU).  There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with:	emaste, markj
Relnotes:	sure
Differential Revision:	https://reviews.freebsd.org/D21568
2019-10-17 21:33:01 +00:00
Mark Johnston
092bacb2c4 Clean up some nits in link_elf_(un)load_file().
- Remove a redundant assignment of ef->address.
- Don't return a Mach error number to the caller if vm_map_find() fails.
- Use ptoa() and fix style.

MFC after:	2 weeks
Sponsored by:	Netflix
2019-10-17 21:25:50 +00:00
Conrad Meyer
addccb8c51 Add a very limited DDB dumpon(8)-alike to MI dumper code
This allows ddb(4) commands to construct a static dumperinfo during
panic/debug and invoke doadump(false) using the provided dumper
configuration (always inserted first in the list).

The intended usecase is a ddb(4)-time netdump(4) command.

Reviewed by:	markj (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21448
2019-10-17 18:29:44 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Andriy Gapon
5fdc2c044e provide a way to assign taskqueue threads to a kernel process
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.

MFC after:	4 weeks
2019-10-17 06:32:34 +00:00
Mark Johnston
6d775f0ba1 Use KOBJMETHOD_END in the kernel linker.
MFC after:	1 week
2019-10-16 22:06:19 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Andrew Turner
9bb37c03fb Stop leaking information from the kernel through timespec
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.

Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.

In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.

This doesn't affect Tier-1 architectures so no SA is expected.

admbugs:	651
MFC after:	1 week
Sponsored by:	DARPA, AFRL
2019-10-16 13:21:01 +00:00
Kristof Provost
1d95443818 Generalize ARM specific comments in devmap
The comments in devmap are very ARM specific, this generalizes them for other
architectures.

Submitted by:	Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by:	manu, philip
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D22035
2019-10-15 23:21:52 +00:00
Gleb Smirnoff
4b25d1f2e3 Missing from r353596. 2019-10-15 21:32:38 +00:00
Gleb Smirnoff
bac060388f When assertion for a thread not being in an epoch fails also print all
entered epochs. Works with EPOCH_TRACE only.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D22017
2019-10-15 21:24:25 +00:00
Gleb Smirnoff
237c1f932b Remove pfctlinput2(). It came from KAME and had never ever been in use. 2019-10-15 15:40:03 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
63e9755548 (1/6) Replace busy checks with acquires where it is trival to do so.
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21548
2019-10-15 03:35:11 +00:00
Leandro Lupori
0ecc478b74 [PPC64] Initial kernel minidump implementation
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D21551
2019-10-14 13:04:04 +00:00
Gleb Smirnoff
f6eccf96a0 Since EPOCH_TRACE had been moved to opt_global.h, we don't need to waste
extra space in struct thread.
2019-10-14 04:17:56 +00:00
Mateusz Guzik
d1cbf3eeea vfs: add MNTK_NOMSYNC
On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22009
2019-10-13 15:40:34 +00:00
Mateusz Guzik
737241cd51 vfs: return free vnode batches in sync instead of vfs_msync
It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22008
2019-10-13 15:39:11 +00:00
Alexander Motin
a89a562b60 Allocate device softc from the device domain.
Since we are trying to bind device interrupt threads to the device domain,
it should have sense to make memory often accessed by them local. If domain
is not known, fall back to round-robin.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-10-12 19:03:07 +00:00
Kristof Provost
85d1151f96 mountroot: run statfs after mounting devfs
The usual flow for mounting a file system is to VFS_MOUNT() and then
immediately VFS_STATFS().

That's not done in vfs_mountroot_devfs(), which means the
mp->mnt_stat.f_iosize field is not correctly populated, which in turn
causes us to mark valid aio operations as unsafe (because the io size is
set to 0), ultimately causing the aio_test:md_waitcomplete test to fail.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D21897
2019-10-11 17:04:38 +00:00
Conrad Meyer
46d70077be ddb: Add CSV option, sorting to 'show (malloc|uma)'
Add /i option for machine-parseable CSV output.  This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first).  This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by:	Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by:	Dell EMC Isilon
2019-10-11 01:31:31 +00:00
John Baldwin
97ecf6efa0 Don't free the cursor boundary tag during vmem_destroy().
The cursor boundary tag is statically allocated in the vmem instead of
from the vmem_bt_zone.  Explicitly remove it from the vmem's segment
list in vmem_destroy before freeing all the segments from the vmem.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21953
2019-10-09 21:20:39 +00:00
Gleb Smirnoff
975b8f8462 Cleanup unneeded includes that crept in with r353292. 2019-10-09 16:59:42 +00:00
Gleb Smirnoff
ff3cfc330e Enter network epoch in domain callouts. 2019-10-09 16:21:05 +00:00
Mark Johnston
4013d72684 Fix handling of empty SCM_RIGHTS messages.
As unp_internalize() processes the input control messages, it builds
an output mbuf chain containing the internalized representations of
those messages.  In one special case, that of an empty SCM_RIGHTS
message, the message is simply discarded.  However, the loop which
appends mbufs to the output chain assumed that each iteration would
produce an mbuf, resulting in a null pointer dereference if an empty
SCM_RIGHTS message was followed by a non-empty message.

Fix this by advancing the output mbuf chain tail pointer only if an
internalized control message was produced.

Reported by:	syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-08 23:34:48 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Doug Moore
2288078c5e Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21882
2019-10-08 07:14:21 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Edward Tomasz Napierala
1a13f2e6b4 Introduce stats(3), a flexible statistics gathering API.
This provides a framework to define a template describing
a set of "variables of interest" and the intended way for
the framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof).  Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs.  The stats(3) framework can be used in
both userspace and the kernel.

See the stats(3) manual page for details.

This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.

The stats(3) framework is disabled by default for now, except
in the NOTES kernel (for QA); it is expected to be enabled
in amd64 GENERIC after a cool down period.

Reviewed by:	sef (earlier version)
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20477
2019-10-07 19:05:05 +00:00
Mateusz Guzik
dc20b834ca vfs: add optional root vnode caching
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by:	jeff (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21646
2019-10-06 22:14:32 +00:00
Kyle Evans
e3f35d562f Remove the remnants of SI_CHEAPCLONE
SI_CHEAPCLONE was introduced in r66067 for use with cloned bpfs. It was
later also used in tty, tun, tap at points. The rough timeline for being
removed in each of these is as follows:

- r181690: bpf switched to use cdevpriv API by ed@
- r181905: ed@ rewrote the TTY later to be mpsafe
- r204464: kib@ removes it from tun/tap, declaring it unused

I've not yet been able to dig up any other consumers in the intervening 9
years. It is no longer set on any devices in the tree and leaves an
interesting situation in make_dev_sv where we're ok with the device already
being set SI_NAMED.
2019-10-05 21:52:06 +00:00
Kyle Evans
d42fecb5c1 kern_conf: fully initialize cloned devices with make_dev_args, too
Attempting to initialize si_drv{1,2} with mda_si_drv{1,2} does not work if
you are operating on cloned devices.

clone_create must be called prior to the make_dev* family to create/return
the device on the clonelist as needed. This device is later returned early
in newdev(), prior to si_drv{0,1,2} initialization.

This patch simply breaks out of the loop if we've found a device and
finishes init.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21904
2019-10-05 21:44:18 +00:00
Mateusz Guzik
dfa8dae493 devfs: plug redundant bwillwrite avoidance
vn_write already checks for vnode type to see if bwillwrite should be called.

This effectively reverts r244643.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21905
2019-10-05 17:44:33 +00:00
Eric van Gyzen
e61e783b83 Add CTLFLAG_STATS to some vfs sysctl OIDs
Add CTLFLAG_STATS to the following OIDs:

vfs.altbufferflushes
vfs.recursiveflushes
vfs.barrierwrites
vfs.flushwithdeps
vfs.reassignbufcalls

Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:43:43 +00:00
Ed Maste
f91dd6091b simplify path handling in sysctl_try_reclaim_vnode
MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL.  Thus there is no need to increase the
buffer size here.

The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21876
2019-10-02 21:01:23 +00:00
Mark Johnston
5131cba6d6 Use OBJT_PHYS VM objects for kernel modules.
OBJT_DEFAULT incurs some unnecessary overhead given that kernel module
pages cannot be paged out.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21862
2019-10-02 16:34:42 +00:00
Mark Johnston
4a7b33ecf4 Disallow fcntl(F_READAHEAD) when the vnode is not a regular file.
The mountpoint may not have defined an iosize parameter, so an attempt
to configure readahead on a device file can lead to a divide-by-zero
crash.

The sequential heuristic is not applied to I/O to or from device files,
and posix_fadvise(2) returns an error when v_type != VREG, so perform
the same check here.

Reported by:	syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21864
2019-10-02 15:45:49 +00:00
Kyle Evans
5a391b572b shm_open2(2): completely unbreak
kern_shm_open2(), since conception, completely fails to pass the mode along
to kern_shm_open(). This breaks most uses of it.

Add tests alongside this that actually check the mode of the returned
files.

PR:		240934 [pulseaudio breakage]
Reported by:	ler, Andrew Gierth [postgres breakage]
Diagnosed by:	Andrew Gierth (great catch)
Tested by:	ler, tmunro
Pointy hat to:	kevans
2019-10-02 02:37:34 +00:00
Ed Maste
f403831e6c sysalls.master: remove superfluous ellipsis in comment
A single period is sufficient in this comment, and making this change
lets us find references to varargs syscalls by searching for ...
2019-10-01 17:05:21 +00:00
Brooks Davis
3a94552174 Restore the ability to set capenabled directly in syscalls.conf.
This fixes generation of cloudabi syscall tables broken in r340424.

Reviewed by:	kevans, emaste
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D21821
2019-09-30 20:58:29 +00:00
Kyle Evans
11fd6a60e7 syscalls.master: consistency, move ); to newline (no functional change) 2019-09-30 13:26:16 +00:00
Mark Johnston
1aa696babc Fix some problems with the SPARSE_MAPPING option in the kernel linker.
- Ensure that the end of the mapping passed to vm_page_wire() is
  page-aligned.  vm_page_wire() expects this.
- Wire pages before reading data into them.
- Apply protections specified in the segment descriptor using
  vm_map_protect() once relocation processing is done.
- On amd64, ensure that we load KLDs above KERNBASE, since they
  are compiled with the "kernel" memory model by default.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21756
2019-09-28 01:42:59 +00:00
Andrew Gallatin
b2dba6634b kTLS: Fix a bug where we would not encrypt anon data inplace.
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.

When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.

This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21796
2019-09-27 20:08:19 +00:00
Andrew Gallatin
6554362c66 kTLS support for TLS 1.3
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by:	jhb, hselasky
Sponsored by:	Netflix
Differential Revision:	D21801
2019-09-27 19:17:40 +00:00
Mateusz Guzik
708cf7eb6c cache: decrease ncnegfactor to 5
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary

Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.

Sample result about afer 90 minutes of poudriere -j 104:

           current    no evict   % of the original
numneg     219737     2013157    916
numneghits 266711906  263544562  98 [1]

[1] this may look funny but there is a certain dose of variation to the
build

The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:14:03 +00:00
Mateusz Guzik
e643141838 cache: stop requeuing negative entries on the hot list
Turns out it does not improve hit ratio, but it does come with a cost
induces stemming from dirtying hit entries.

Sample result: hit counts of evicted entries after 2 buildworlds

before:

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                180865
               1 |@@@@@@@                                  49150
               2 |@@@                                      19067
               4 |@                                        9825
               8 |@                                        7340
              16 |@                                        5952
              32 |@                                        5243
              64 |@                                        4446
             128 |                                         3556
             256 |                                         3035
             512 |                                         1705
            1024 |                                         1078
            2048 |                                         365
            4096 |                                         95
            8192 |                                         34
           16384 |                                         26
           32768 |                                         23
           65536 |                                         8
          131072 |                                         6
          262144 |                                         0

after:
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                184004
               1 |@@@@@@                                   47577
               2 |@@@                                      19446
               4 |@                                        10093
               8 |@                                        7470
              16 |@                                        5544
              32 |@                                        5475
              64 |@                                        5011
             128 |                                         3451
             256 |                                         3002
             512 |                                         1729
            1024 |                                         1086
            2048 |                                         363
            4096 |                                         86
            8192 |                                         26
           16384 |                                         25
           32768 |                                         24
           65536 |                                         7
          131072 |                                         5
          262144 |                                         0

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:13:22 +00:00
Mateusz Guzik
312196df0f cache: make negative list shrinking a little bit concurrent
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.

While here count how many times we skipped shrinking due to the lock
being already taken.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:43 +00:00
Mateusz Guzik
95c6dd890a cache: stop recalculating upper limit each time a new entry is added
Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:20 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Andrew Turner
50bb04b750 Check the vfs option length is valid before accessing through
When a VFS option passed to nmount is present but NULL the kernel will
place an empty option in its internal list. This will have a NULL
pointer and a length of 0. When we come to read one of these the kernel
will try to load from the last address of virtual memory. This is
normally invalid so will fault resulting in a kernel panic.

Fix this by checking if the length is valid before dereferencing.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2019-09-27 16:22:28 +00:00
David Bright
c4571256af sysent: regenerate after r352747.
Sponsored by:	Dell EMC Isilon
2019-09-26 15:41:10 +00:00
Mark Johnston
55248d32f2 Fix handling of invalid pages in exec_map_first_page().
exec_map_first_page() would unconditionally free an unbacked, invalid
page from the executable image.  However, it is possible that the page
is wired, in which case it is incorrect to free the page, so check for
additional wirings first.

Reported by:	syzkaller
Tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21767
2019-09-26 15:35:35 +00:00
David Bright
9afb12bab4 Add an shm_rename syscall
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.

truss support is included; audit support will come later.

This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	jilles (earlier revision)
Reviewed by:	brueffer (manpages, earlier revision)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21423
2019-09-26 15:32:28 +00:00
Toomas Soome
11fc80a098 kernel terminal should initialize fg and bg variables before calling TUNABLE_INT_FETCH
We have two ways to check if kenv variable exists - either we check return
value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check
if this value did change. In terminal_init() it is more convinient to
use pre-initialized variables.

Problem was revealed by older loader.efi, which did not set teken.* variables.

Reported by:	tuexen
2019-09-26 07:19:26 +00:00
Alexander Motin
176dd236dc Microoptimize sched_pickcpu() CPU affinity on SMT.
Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5%
of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-26 00:35:06 +00:00
Alexander Motin
c55dc51c37 Microoptimize sched_pickcpu() after r352658.
I've noticed that I missed intr check at one more SCHED_AFFINITY(),
so instead of adding one more branching I prefer to remove few.

Profiler shows the function CPU time reduction from 0.24% to 0.16%.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-25 19:29:09 +00:00
Kyle Evans
079c5b9ed8 rfork(2): add RFSPAWN flag
When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets
signal handlers in the child during creation to avoid a point of corruption
of parent state from the child.

This flag will be used by posix_spawn(3) to handle potential signal issues.

Reviewed by:	jilles, kib
Differential Revision:	https://reviews.freebsd.org/D19058
2019-09-25 19:20:41 +00:00
Gleb Smirnoff
dd902d015a Add debugging facility EPOCH_TRACE that checks that epochs entered are
properly nested and warns about recursive entrances.  Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.

Reviewed by:	hselasky
Sponsored by:	Netflix
Pull Request:	https://reviews.freebsd.org/D21610
2019-09-25 18:26:31 +00:00
Kyle Evans
a9ac5e1424 sysent: regenerate after r352705
This also implements it, fixes kdump, and removes no longer needed bits from
lib/libc/sys/shm_open.c for the interim.
2019-09-25 18:09:19 +00:00
Kyle Evans
234879a7e3 Mark shm_open(2) as COMPAT12, succeeded by shm_open2
Implementation and regenerated files will follow.
2019-09-25 18:06:48 +00:00
Kyle Evans
460211e730 sysent: regenerate after r352700 2019-09-25 17:59:58 +00:00
Kyle Evans
20f7057685 Add a shm_open2 syscall to support upcoming memfd_create
shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.

shm_open and memfd_create will both be implemented in libc to use this new
syscall.

__FreeBSD_version is bumped to indicate the presence.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21393
2019-09-25 17:59:15 +00:00
Kyle Evans
0cd95859c8 [2/3] Add an initial seal argument to kern_shm_open()
Now that flags may be set on posixshm, add an argument to kern_shm_open()
for the initial seals. To maintain past behavior where callers of
shm_open(2) are guaranteed to not have any seals applied to the fd they're
given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special
flag could be opened later for shm_open(2) to indicate that sealing should
be allowed.

We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if
F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a
shmfd that already existed. A note's been added about the assumptions we've
made here as a hint towards anyone wanting to allow other seals to be
applied at creation.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21392
2019-09-25 17:35:03 +00:00
Kyle Evans
af755d3e48 [1/3] Add mostly Linux-compatible file sealing support
File sealing applies protections against certain actions
(currently: write, growth, shrink) at the inode level. New fileops are added
to accommodate seals - EINVAL is returned by fcntl(2) if they are not
implemented.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D21391
2019-09-25 17:32:43 +00:00
Kyle Evans
85c5f3cb57 Add COMPAT12 support to makesyscalls.sh
Reviewed by:	kib, imp, brooks (all without syscalls.master edits)
Differential Revision:	https://reviews.freebsd.org/D21366
2019-09-25 17:29:45 +00:00
Toomas Soome
3001e0c942 kernel: terminal_init() should check for teken colors from kenv
Check for teken.fg_color and teken.bg_color and prepare the color
attributes accordingly.

When white background is used, make it light to improve visibility.
When black background is used, make kernel messages light.
2019-09-25 13:21:07 +00:00
Alexander Motin
bb3dfc6ae9 Fix wrong assertion in r352658.
MFC after:	1 month
2019-09-25 11:58:54 +00:00
Alexander Motin
c9205e3500 Fix/improve interrupt threads scheduling.
Doing some tests with very high interrupt rates I've noticed that one of
conditions I added in r232207 to make interrupt threads in most cases
run on local CPU never worked as expected (worked only if previous time
it was executed on some other CPU, that is quite opposite).  It caused
additional CPU usage to run full CPU search and could schedule interrupt
threads to some other CPU.

This patch removes that code and instead reuses existing non-interrupt
code path with some tweaks for interrupt case:
 - On SMT systems, if current thread is idle, don't look on other threads.
Even if they are busy, it may take more time to do fill search and bounce
the interrupt thread to other core then execute it locally, even sharing
CPU resources.  It is other threads should migrate, not bound interrupts.
 - Try hard to keep interrupt threads within LLC of their original CPU.
This improves scheduling cost and supposedly cache and memory locality.

On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few
percents of CPU time while adding few percents to IOPS.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-24 20:01:20 +00:00
Randall Stewart
35c7bb3407 This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on  implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D21582
2019-09-24 18:18:11 +00:00
Mateusz Guzik
93a85508ad cache: tidy up handling of negative entries
- track the total count of hot entries
- pre-read the lock when shrinking since it is typically already taken
- place the lock in its own cacheline
- shorten the hold time of hot lock list when zapping

Sponsored by:	The FreeBSD Foundation
2019-09-23 20:50:04 +00:00
Mark Johnston
38dae42c26 Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations.
This is required for DPCPU and VNET data variable definitions to work when
KLDs are linked as DSOs.  R_X86_64_RELATIVE relocations should not appear
in object files, so assert this in elf_relocaddr().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21755
2019-09-23 14:14:43 +00:00
Mateusz Guzik
afe257e3ca cache: count evictions of negatve entries
Sponsored by:	The FreeBSD Foundation
2019-09-23 08:53:14 +00:00
Sean Eric Fagan
ba7a55d934 Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover:  Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir:  Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by:	allanjude, kib
Differential Revision:	https://reviews.freebsd.org/D21458
2019-09-23 04:28:07 +00:00
Mateusz Guzik
7505cffa56 cache: try to avoid vhold if locks held
Sponsored by:	The FreeBSD Foundation
2019-09-22 20:50:24 +00:00
Mateusz Guzik
cd2112c305 cache: jump in negative success instead of positive
Sponsored by:	The FreeBSD Foundation
2019-09-22 20:49:17 +00:00
Mateusz Guzik
d2be3ef05c lockprof: move per-cpu data to dpcpu
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21747
2019-09-22 20:44:24 +00:00
Konstantin Belousov
f33533da8c kern.elf{32,64}.pie_base sysctl: enforce page alignment.
Requested by:	rstone
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-21 20:03:17 +00:00
Mateusz Guzik
cbba2cb367 lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks
Sponsored by:	The FreeBSD Foundation
2019-09-21 19:05:38 +00:00
Konstantin Belousov
95aafd6900 Make non-ASLR pie base tunable.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-21 18:00:23 +00:00
Alexander Motin
36d151a237 Allocate callout wheel from the respective memory domain.
MFC after:	1 week
2019-09-21 15:38:08 +00:00
Andrew Gallatin
61b8a4af71 remove redundant "ktls" in KTLS thr name
This reducesthe string width of the ktls thread name
and improves "ps" output.

Glanced at by: jhb
Event: EuroBSDCon hackathon
Sponsored by:	Netflix
2019-09-20 09:36:07 +00:00
Mateusz Guzik
b488246b45 vfs: group fields used for per-cpu ops in one cacheline
Sponsored by:	The FreeBSD Foundation
2019-09-19 21:23:14 +00:00
Konstantin Belousov
382e01c8dc sysctl: use names instead of magic numbers.
Replace magic numbers with symbols for internal sysctl operations.
Convert in-kernel and libc consumers.

Submitted by:	Pawel Biernacki
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21693
2019-09-18 16:13:10 +00:00
Konstantin Belousov
55894117b1 Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY.
Reviewed by:	bcr (man page), emaste (previous version)
PR:	240452
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
DIfferential revision:	https://reviews.freebsd.org/D21634
2019-09-17 18:32:18 +00:00
Kirk McKusick
100369071d The VFS-level clustering code collects together sequential blocks
by issuing delayed-writes (bdwrite()) until a non-sequential block
is written or the maximum cluster size is reached. At that point
it collects the delayed buffers together (using bread()) to write
them in a single operation. The assumption was that since we just
looked at them they will still be in memory so there is no need to
check for a read error from bread(). Very occationally (apparently
every 10-hours or so when being pounded by Peter Holm's tests)
this assumption is wrong.

The fix is to check for errors from bread() and fail the cluster
write thus falling back to the default individual flushing of any
still dirty buffers.

Reported by: Peter Holm and Chuck Silvers
Reviewed by: kib
MFC after:   3 days
2019-09-17 17:44:50 +00:00
Mateusz Guzik
d245aa1e72 vfs: apply r352437 to the fast path as well
This one is very hard to run into. If the filesystem is being unmounted or
the mount point is freed the vfs_op_thread_enter will fail. For it to
succeed the mount point itself would have to be reallocated in the time
window between the initial read and the attempt to enter.

Sponsored by:	The FreeBSD Foundation
2019-09-17 15:53:40 +00:00
Mateusz Guzik
7f65185940 vfs: fix braino resulting in NULL pointer deref in r352424
The breakage was added after all the testing and the testing which followed
was not sufficient to find it.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation
2019-09-17 08:09:39 +00:00
Mateusz Guzik
4cace859c2 vfs: convert struct mount counters to per-cpu
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before:   852393 ops/s
after:  76682077 ops/s

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21637
2019-09-16 21:37:47 +00:00
Mateusz Guzik
e87f3f72f1 vfs: manage mnt_writeopcount with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21575
2019-09-16 21:33:16 +00:00
Mateusz Guzik
ee831b2543 vfs: manage mnt_lockref with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21574
2019-09-16 21:32:21 +00:00
Mateusz Guzik
a8c8e44bf0 vfs: manage mnt_ref with atomics
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by:	kib, jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21425
2019-09-16 21:31:02 +00:00
Kyle Evans
3155f2f0e2 rangelock: add rangelock_cookie_assert
A future change to posixshm to add file sealing (in DIFF_21391[0] and child)
will move locking out of shm_dotruncate as kern_shm_open() will require the
lock to be held across the dotruncate until the seal is actually applied.
For this, the cookie is passed into shm_dotruncate_locked which asserts
RCA_WLOCKED.

[0] Name changed to protect the innocent, hopefully, from getting autoclosed
due to this reference...

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D21628
2019-09-15 02:59:53 +00:00
Mateusz Guzik
ce3ba63f67 vfs: release usecount using fetchadd
1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the
interlock

These facts combined mean we can fetchadd instead of having a cmpset loop.

Reviewed by:	kib (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21528
2019-09-13 15:49:04 +00:00
Mark Johnston
45cdd437ae Remove a redundant NULL pointer check in cpuset_modify_domain().
cpuset_getroot() is guaranteed to return a non-NULL pointer.

Reported by:	Mark Millard <marklmi@yahoo.com>
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-09-12 16:47:38 +00:00
Hans Petter Selasky
11b57401e6 Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision:	https://reviews.freebsd.org/D21620
Reviewed by:	kib@, markj@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-09-12 16:26:59 +00:00
Kyle Evans
5163b1a75c Follow up r352244: kenv: tighten up assertions
As I like to forget: static kenv var formatting is actually such that an
empty environment would be double null bytes. We should make sure that a
non-zero buffer has at least enough for this, though most of the current
usage is with a 4k buffer.
2019-09-12 14:34:46 +00:00
Kyle Evans
436c46875d kenv: assert that an empty static buffer passed in is "empty"
Garbage in the passed-in buffer can cause problems if any attempts to read
the kenv are inadvertently made between init_static_kenv and the first
kern_setenv -- assuming there is one.

This is cheap and easy, so do it. This also helps rule out some class of
bugs as one tries to debug; tunables fetch from the static environment up
until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that
rely on BSS clearing while others just grab a page of free memory and use it
(e.g. xen).
2019-09-12 13:51:43 +00:00
Conrad Meyer
aaa3852435 buf: Add B_INVALONERR flag to discard data
Setting the B_INVALONERR flag before a synchronous write causes the buf
cache to forcibly invalidate contents if the write fails (BIO_ERROR).

This is intended to be used to allow layers above the buffer cache to make
more informed decisions about when discarding dirty buffers without
successful write is acceptable.

As a proof of concept, use in msdosfs to handle failures to mark the on-disk
'dirty' bit during rw mount or ro->rw update.

Extending this to other filesystems is left as future work.

PR:		210316
Reviewed by:	kib (with objections)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21539
2019-09-11 21:24:14 +00:00
Mateusz Guzik
b088a4d6f9 cache: avoid excessive relocking on entry removal during lookup
Due to lock ordering issues (bucket lock held, vnode locks wanted) the code
starts with trylocking which in face of contention often fails. Prior to
the change it would loop back with a possible yield.

Instead note we know what locks are needed and can take them in the right
order, avoiding retries. Then we can safely re-lookup and see if the entry
we are looking for is still there.

On a 104-way box poudriere would result in constant retries during an 11h
run as seen in the vfs.cache.zap_and_exit_bucket_fail counter.

before: 408866592
after :         0

However, a new stat reports:
vfs.cache.zap_and_exit_bucket_relock_success: 32638

Note this is only a bandaid over current design issues.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2019-09-10 20:19:29 +00:00
Mateusz Guzik
a6cacb0dca cache: change the formula for calculating lock array sizes
It used to be mp_ncpus * 64, but this gives unnecessarily big values for small
machines and at the same time constraints bigger ones. In particular this helps
on a 104-way box for which the count is now doubled.

While here make cache_purgevfs less likely. Currently it is not efficient in
face of contention due to lock ordering issues. These are fixable but not worth
it at the moment.

Sponsored by:	The FreeBSD Foundation
2019-09-10 20:11:00 +00:00
Mateusz Guzik
1214618c05 cache: assorted cleanups
Sponsored by:	The FreeBSD Foundation
2019-09-10 20:08:24 +00:00
Jeff Roberson
c75757481f Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
 - vm_page_grab_valid() will grab and page in if necessary.
 - vm_page_busy_acquire() automates some busy acquire loops.

Discussed with:	alc, kib, markj
Tested by:	pho (part of larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21546
2019-09-10 19:08:01 +00:00
Jeff Roberson
4cdea4a853 Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state.  The object lock is still used as an interlock
to ensure that the identity stays valid.  Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by:	alc, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21255
2019-09-10 18:27:45 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Konstantin Belousov
6c46ce7ea3 Initialize timehands linkage much earlier.
Reported and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-09 12:42:48 +00:00
Konstantin Belousov
4b23dec4c2 Make timehands count selectable at boottime.
Tested by:	O'Connor, Daniel <darius@dons.net.au>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21563
2019-09-09 11:29:58 +00:00
Konstantin Belousov
1040254b75 In do_execve(), use shared text vnode lock consistently.
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 16:10:57 +00:00
Konstantin Belousov
1c36b72874 In do_execve(), clear imgp->textset when restarting for interpreter.
Otherwise, we might left the boolean set, which would affect cleanup
after an error on interpreter activation.

Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 16:05:17 +00:00
Konstantin Belousov
1073d17eeb When loading ELF interpreter, initialize whole nested image_params with zero.
Otherwise we could mishandle imgp->textset.

Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 16:03:26 +00:00
Philip Paeps
bdc786cc7c riscv: restore default HZ=1000, keep QEMU at HZ=100
This reverts r351918 and r351919.

Discussed with:	br, ian, imp
2019-09-07 05:13:31 +00:00
Philip Paeps
7f0851ab19 riscv: default to HZ=100
Most current RISC-V development platforms are not fast enough to benefit
from the increased granularity provided by HZ=1000.

Sponsored by:	Axiado
2019-09-06 01:19:31 +00:00
Conrad Meyer
a6935d085c Remove long-dead BUF_ASSERT_{,UN}HELD assertions
These were fully neutered in r177676 (2008), but not removed at the time for
unclear reasons.  They're totally dead code, so go ahead and yank them now.

No functional change.
2019-09-05 21:43:33 +00:00
Mateusz Guzik
68c3c1abe1 vfs: temporarily revert r351825
There are 2 problems:
- it introduces a funny bug where it can end up trylocking the same vnode [1]
- it exposes a pre-existing softdep deadlock [2]

Both are easier to run into that the bug which got fixed, so revert until
a complete solution is worked out.

Reported by:	cy [1], pho [2]
Sponsored by:	The FreeBSD Foundation
2019-09-05 18:19:51 +00:00
Stephen J. Kiernan
d57cd5ccd3 Bump up the low range of cpuset numbers to account for the kernel cpuset.
Reviewed by:	jeff
Obtained from:	Juniper Networks, Inc.
2019-09-05 17:48:39 +00:00
Mateusz Guzik
c07d4a0a68 vfs: fully hold vnodes in vnlru_free_locked
Currently the code only bumps holdcnt and clears the VI_FREE flag, not
performing actual vhold. Since the vnode is still visible elsewhere, a
potential new user can find it and incorrectly assume it is properly held.

Use vholdl instead to correctly hold the vnode. Another place recycling
(vlrureclaim) does this already.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21522
2019-09-04 19:23:18 +00:00
Andriy Gapon
387df3b805 shutdown_halt: make sure that watchdog timer is stopped
The point of halt is to keep the machine in limbo.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D21222
2019-09-04 13:26:59 +00:00
Kyle Evans
dca52ab480 posixshm: start counting writeable mappings
r351650 switched posixshm to using OBJT_SWAP for shm_object

r351795 added support to the swap_pager for tracking writeable mappings

Take advantage of this and start tracking writeable mappings; fd sealing
will use this to reject a seal on writing with EBUSY if any such mapping
exist.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21456
2019-09-03 20:33:38 +00:00
Kyle Evans
fe7bcbaf50 vm pager: writemapping accounting for OBJT_SWAP
Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21456
2019-09-03 20:31:48 +00:00
Konstantin Belousov
fe69291ff4 Add procctl(PROC_STACKGAP_CTL)
It allows a process to request that stack gap was not applied to its
stacks, retroactively.  Also it is possible to control the gaps in the
process after exec.

PR:	239894
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21352
2019-09-03 18:56:25 +00:00
Mateusz Guzik
e3c3248cc7 vfs: implement usecount implying holdcnt
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by:	kib
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21471
2019-09-03 15:42:11 +00:00
Mateusz Guzik
d05b53e0ba Add sysctlbyname system call
Previously userspace would issue one syscall to resolve the sysctl and then
another one to actually use it. Do it all in one trip.

Fallback is provided in case newer libc happens to be running on an older
kernel.

Submitted by:	Pawel Biernacki
Reported by:	kib, brooks
Differential Revision:	https://reviews.freebsd.org/D17282
2019-09-03 04:16:30 +00:00
Mateusz Guzik
1874c61b90 vfs: restore mp null check in vop_stdgetwritemount
The initially read mount point can already be NULL.

Reported by:	markj
Fixes: r351656 ("vfs: stop refing freed mount points in vop_stdgetwritemount")
Sponsored by:	The FreeBSD Foundation
2019-09-02 15:24:25 +00:00
Mateusz Guzik
9576ff5803 proc: clear pid bitmap entry after dropping proctree lock
There is no correctness change here, but the procid lock is contended in
the fork path and taking it while holding proctree avoidably extends its
hold time.

Note that there are other ids which can end up getting cleared with the
lock.

Sponsored by:	The FreeBSD Foundation
2019-09-02 12:46:43 +00:00
Mark Johnston
08cfa56ea3 Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure.  This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets.  Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure.  In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim().  When trimming, only
items in excess of the estimate are reclaimed.  Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them.  As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by:	jeff
Tested by:	pho (an earlier version)
MFC after:	2 months
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16667
2019-09-01 22:22:43 +00:00
Mark Johnston
63cdd18e40 Restrict the input domain set in cpuset_setdomain(2) to all_domains.
To permit larger values of MAXMEMDOM, which is currently 8 on amd64,
cpuset_setdomain(2) accepts a mask of size 256.  In the kernel, domain
set masks are 64 bits wide, but can only represent a set of MAXMEMDOM
domains due to the use of the ds_order table.

Domain sets passed to cpuset_setdomain(2) are restricted to a subset
of their parent set, which is typically the root set, but before this
happens we modify the input set to exclude empty domains.
domainset_empty_vm() and other code which manipulates domain sets
expect the mask to be a subset of all_domains, so enforce that when
performing validation of cpuset_setdomain(2) parameters.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21477
2019-09-01 21:38:08 +00:00
Mateusz Guzik
2796c209b0 vfs: stop refing freed mount points in vop_stdgetwritemount
The code used blindly ref based on an unsafely red address and then would
backpedal if necessary. This was safe in terms of memory access since
mounts are type-stable, but made for a potential a bug where the mount
was reused and had the count reset to 0 before this code decreased it.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21411
2019-09-01 14:01:09 +00:00
Kyle Evans
32287ea72b posixshm: switch to OBJT_SWAP in advance of other changes
Future changes to posixshm will start tracking writeable mappings in order
to support file sealing. Tracking writeable mappings for an OBJT_DEFAULT
object is complicated as it may be swapped out and converted to an
OBJT_SWAP. One may generically add this tracking for vm_object, but this is
difficult to do without increasing memory footprint of vm_object and blowing
up memory usage by a significant amount.

On the other hand, the swap pager can be expanded to track writeable
mappings without increasing vm_object size. This change is currently in
D21456. Switch over to OBJT_SWAP in advance of the other changes to the
swap pager and posixshm.
2019-09-01 00:33:16 +00:00
Mateusz Guzik
c2b600f98f vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked
Sponsored by:	The FreeBSD Foundation
2019-08-30 21:54:45 +00:00
Mateusz Guzik
3bb8d8d8c9 vfs: tidy up assertions in vfs_subr
- assert unlocked vnode interlock in vref
- assert right counts in vputx
- print debug info for panic in vdrop

Sponsored by:	The FreeBSD Foundation
2019-08-30 00:45:53 +00:00
Konstantin Belousov
6470c8d3db Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode.  Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM().  This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation.  Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:50:25 +00:00
Mateusz Guzik
1e2f0ceb2f vfs: add VOP_NEED_INACTIVE
vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.

vinactive is very rarely needed and can be tested for without the vnode lock held.

This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:

before: 557563641706 (lockmgr:tmpfs)
after:   46309603301 (lockmgr:tmpfs)

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21371
2019-08-28 20:34:24 +00:00
Mark Johnston
772dd133c6 Avoid direct accesses of the vm_page wire_count field.
No functional change intended.

Sponsored by:	Netflix
2019-08-28 18:01:54 +00:00
Mateusz Guzik
88cc62e5a5 proc: eliminate the zombproc list
It is not needed by anything in the kernel and it slightly drives up contention
on both proctree and allproc locks.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21447
2019-08-28 16:18:23 +00:00
Mark Johnston
b5d239cb97 Wire pages in vm_page_grab() when appropriate.
uiomove_object_page() and exec_map_first_page() would previously wire a
page after having grabbed it.  Ask vm_page_grab() to perform the wiring
instead: this removes some redundant code, and is cheaper in the case
where the requested page is not resident since the page allocator can be
asked to initialize the page as wired, whereas a separate vm_page_wire()
call requires the page lock.

In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of
vm_page_unwire(PQ_NONE).  The latter ensures that the page is dequeued
before returning, but this is unnecessary since vm_page_free() will
trigger a batched dequeue of the page.

Reviewed by:	alc, kib
Tested by:	pho (part of a larger patch)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21440
2019-08-28 16:08:06 +00:00
Mateusz Guzik
2319489b6e proc: remove zpfind
It is not used by anything. If someone wants it back it should be reimplemented
to use the proc hash.

Sponsored by:	The FreeBSD Foundation
2019-08-28 01:22:21 +00:00
John Baldwin
818d755318 Only define the 'tls' member of sfio in KERN_TLS is defined.
This field was not initialized in the !KERN_TLS case triggering an
assertion failure when using sendfile(2).

Reported by:	pho, asomers
Sponsored by:	Netflix
2019-08-27 22:21:18 +00:00
Mateusz Guzik
368cabbcb5 vfs: stop passing LK_INTERLOCK to VOP_UNLOCK
The plan is to drop the flags argument. There is also a temporary bug
now that nullfs ignores the flag.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21252
2019-08-27 20:30:56 +00:00
Mark Johnston
44e4def73b Remove an extraneous + 1 in _domainset_create().
DOMAINSET_FLS, like our fls(), is 1-indexed.

Reported by:	alc
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-08-27 15:42:08 +00:00
Mark Johnston
8e6975047e Fix several logic issues in domainset_empty_vm().
- Don't add 1 to the result of DOMAINSET_FLS.
- Do not modify domainsets containing only empty domains.
- Always flatten a _PREFER policy to _ROUNDROBIN if the preferred
  domain is empty.  Previously we were doing this only when ds_cnt > 1.

These bugs could cause hangs during boot if a VM domain is empty.

Tested by:	hselasky
Reviewed by:	hselasky, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21420
2019-08-27 14:06:34 +00:00
Konstantin Belousov
95acb40caa vn_vget_ino_gen(): relock the lower vnode on error.
The function' interface assumes that the lower vnode is passed and
returned locked always.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-27 08:28:38 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Xin LI
4e8671dd78 GZIO: Update to use zlib 1.2.11.
PR:		229763
Submitted by:	Yoshihiro Ota <ota j email ne jp>
Differential Revision:	https://reviews.freebsd.org/D21408
2019-08-25 07:50:44 +00:00
Mateusz Guzik
0256405e98 vfs: add vholdnz (for already held vnodes)
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21358
2019-08-25 05:11:43 +00:00
Mateusz Guzik
5b596b9fa5 Remove the obsolete pcpu_zone_ptr zone.
It was only used by flowtable (removed in r321618).

Sponsored by:	The FreeBSD Foundation
2019-08-24 00:01:19 +00:00
Konstantin Belousov
e671edac06 De-commision the MNTK_NOINSMNTQ kernel mount flag.
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-23 19:40:10 +00:00
Xin LI
a11bf9a49b INVARIANTS: treat LA_LOCKED as the same of LA_XLOCKED in mtx_assert.
The Linux lockdep API assumes LA_LOCKED semantic in lockdep_assert_held(),
meaning that either a shared lock or write lock is Ok.  On the other hand,
the timeout code uses lc_assert() with LA_XLOCKED, and we need both to
work.

For mutexes, because they can not be shared (this is unique among all lock
classes, and it is unlikely that we would add new lock class anytime soon),
it is easier to simply extend mtx_assert to handle LA_LOCKED there, despite
the change itself can be viewed as a slight abstraction violation.

Reviewed by:	mjg, cem, jhb
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D21362
2019-08-23 06:39:40 +00:00
Brooks Davis
075ac3b446 Reorganise conditionals to reduce duplication.
No functional change.

Obtained from:	CheriBSD
MFC after:	3 days
Sponsored by:	DARPA, AFRL
2019-08-22 10:21:07 +00:00
Rick Macklem
df9bc7df42 Map ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE).
Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE)
on a file in a file system that does not have its own VOP_IOCTL(), the
lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since
ENOTTY is not listed as an error return by either the lseek(2) man page
nor the POSIX draft for lseek(2).
This was discussed on freebsd-current@ here:
http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA

This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE).

Reviewed by:	markj
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D21300
2019-08-22 01:15:06 +00:00
Mark Johnston
5b699f1614 Add lockmgr(9) probes to the lockstat DTrace provider.
They follow the conventions set by rw and sx lock probes.  There is
an additional lockstat:::lockmgr-disown probe.

Update lockstat(1) to report on contention and hold events for
lockmgr locks.  Document the new probes in dtrace_lockstat.4, and
deduplicate some of the existing probe descriptions.

Reviewed by:	mjg
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21355
2019-08-21 23:43:58 +00:00
Mark Johnston
9fb7c918ef Remove manual wire_count adjustments from the unmapped mbuf code.
The original code came from a desire to minimize the number of updates
to v_wire_count, which prior to r329187 was updated using atomics.
However, there is no significant benefit to batching today, so simply
allocate pages using VM_ALLOC_WIRED and rely on system accounting.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D21323
2019-08-21 20:01:52 +00:00
Mark Johnston
6bc13e042f Modify pipe_poll() to properly check for pending direct writes.
With r349546, it is a responsibility of the writer to clear PIPE_DIRECTW
after pinned data has been read.  In particular, once a reader has
drained this data, there is a small window where the pipe is empty but
PIPE_DIRECTW is set.  pipe_poll() was using the presence of PIPE_DIRECTW
to determine whether to return POLLIN, so in this window it would
claim that data was available to read when this was not the case.

Fix this by modifying several checks for PIPE_DIRECTW to instead look
at the number of residual bytes in data pinned by a direct writer.  In
some cases we really do want to check for PIPE_DIRECTW, since the
presence of this flag indicates that any attempt to write to the pipe
will block on the existing direct writer.

Bisected and test case provided by:	mav
Tested by:	pho
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21333
2019-08-21 19:35:04 +00:00
Ed Maste
f37192064a mqueuefs: fix compat32 struct file leak
In a compat32 error case we previously leaked a struct file.

Submitted by:	Karsten König, Secfault Security
Security:	CVE-2019-5603
2019-08-20 17:44:03 +00:00
Jeff Roberson
cf27e0d125 Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by:	markj
Tested by:	pho (as part of a larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21311
2019-08-19 23:09:38 +00:00
Mateusz Guzik
4b3f767340 vfs: fix up r351193 ("stop always overwriting ->mnt_stat in VFS_STATFS")
fs-specific part of vfs_statfs routines only fill in small portion of the
structure. Previous code was always copying everything at a higher layer to
acoomodate it and this patch does the same.

'df' (no arguments) worked fine because the caller uses mnt_stat itself as the
target buffer, making all the copying a no-op for its own case.
'df /' and similar use a different consumer which passes its own buffer and
this is where you can run into trouble.

Reported by:	cy
Fixes: r351193
Sponsored by:	The FreeBSD Foundation
2019-08-19 14:11:54 +00:00
Andrey V. Elsukov
75697b16b6 Use TAILQ_FOREACH_SAFE() macro to avoid use after free in soclose().
PR:		239893
MFC after:	1 week
2019-08-19 12:42:03 +00:00
Andriy Gapon
0db7afd0ae assert that td_lk_slocks is not leaked upon return from kernel
This is similar to checks for td_sx_slocks and td_rw_rlocks.
Although td_lk_slocks is an implementation detail, it still makes sense
to validate it.

MFC after:	1 week
Sponsored by:	Panzura
2019-08-19 11:18:36 +00:00
Rick Macklem
2e1b32c0e3 Add a vop_stdioctl() that performs a trivial FIOSEEKDATA/FIOSEEKHOLE.
Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE)
on a file in a file system that does not have its own VOP_IOCTL(), the
lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since
ENOTTY is not listed as an error return by either the lseek(2) man page
nor the POSIX draft for lseek(2).
A discussion on freebsd-current@ seemed to indicate that implementing
a trivial algorithm that returns the offset argument for FIOSEEKDATA and
returns the file's size for FIOSEEKHOLE was the preferred fix.
http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA
The Linux kernel appears to implement this trivial algorithm as well.

This patch adds a vop_stdioctl() that implements this trivial algorithm.
It returns errors consistent with vn_bmap_seekhole() and, as such, will
still return ENOTTY for non-regular files.

I have proposed a separate patch that maps errors not described by the
lseek(2) man page nor POSIX draft to EINVAL. This patch is under separate
review.

Reviewed by:	kib
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D21299
2019-08-19 00:29:05 +00:00
Konstantin Belousov
de4e1aeb21 Fix an issue with executing tmpfs binary.
Suppose that a binary was executed from tmpfs mount, and the text
vnode was reclaimed while the binary was still running.  It is
possible during even the normal operations since tmpfs vnode'
vm_object has swap type, and no references on the vnode is held.  Also
assume that the text vnode was revived for some reason.  Then, on the
process exit or exec, unmapping of the text mapping tries to remove
the text reference from the vnode, but since it went from
recycle/instantiation cycle, there is no reference kept, and assertion
in VOP_UNSET_TEXT_CHECKED() triggers.

Fix this by keeping a use reference on the tmpfs vnode for each exec
reference.  This prevents the vnode reclamation while executable map
entry is active.

Do it by adding per-mount flag MNTK_TEXT_REFS that directs
vop_stdset_text() to add use ref on first vnode text use, and
per-vnode VI_TEXT_REF flag, to record the need on unref in
vop_stdunset_text() on last vnode text use going away.  Set
MNTK_TEXT_REFS for tmpfs mounts.

Reported by:	bdrewery
Tested by:	sbruno, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-18 20:36:11 +00:00
Konstantin Belousov
bb9e2184f0 Change locking requirements for VOP_UNSET_TEXT().
Require the vnode to be locked for the VOP_UNSET_TEXT() call.  This
will be used by the following bug fix for a tmpfs issue.

Tested by:	sbruno, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-18 20:24:52 +00:00
Mateusz Guzik
e7c1709aaf vfs: stop always overwriting ->mnt_stat in VFS_STATFS
The struct is already populated on each mount (and remount). Fields are either
constant or not used by filesystem in the first place.

Some infrequently used functions use it to avoid having to allocate a new buffer
and are left alone.

The current code results in an avoidable copying single-threaded and significant
cache line bouncing multithreaded

While here deduplicate initial filling of the struct.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21317
2019-08-18 18:40:12 +00:00
Jeff Roberson
33205c60e7 Add a blocking wait bit to refcount. This allows refs to be used as a simple
barrier.

Reviewed by:	markj, kib
Discussed with:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21254
2019-08-18 11:43:58 +00:00
Mateusz Guzik
50c7615fb0 fork: rework locking around do_fork
- move allproc lock into the func, it is of no use prior to it
- the code would lock p1 and p2 while holding allproc to partially
construct it after it gets added to the list. instead we can do the
work prior to adding anything.
- protect lastpid with procid_lock

As a side effect we do less work with allproc held.

Sponsored by:	The FreeBSD Foundation
2019-08-17 18:19:49 +00:00
Mateusz Guzik
60cdcb644d fork: bump process count before checking for permission to cross the limit
The limit is almost never reached. Do the check only on failure to see if
we can override it.

No change in user-visible behavior.

Sponsored by:	The FreeBSD Foundation
2019-08-17 17:56:43 +00:00
Mateusz Guzik
b05641b6bd fork: stop skipping < 100 ids on wrap around
Code doing this is commented with a claim that these IDs are occupied by
daemons, but that's demonstrably false. To an extent the range is used by init
and kernel processes (and on sufficiently big machines it indeed is fully
populated).

On a sample box 40-way box the highest id in the range is 63. On a different one
it is 23. Just use the range.

Sponsored by:	The FreeBSD Foundation
2019-08-17 17:42:01 +00:00
Alexander Motin
3a60f3dad0 Add support for 'j', 't' and 'z' flags to kernel sscanf().
MFC after:	2 weeks
2019-08-16 19:46:22 +00:00
Jeff Roberson
2194393787 Move phys_avail definition into MI code. It is consumed in the MI layer and
doing so adds more flexibility with less redundant code.

Reviewed by:	jhb, markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21250
2019-08-16 00:45:14 +00:00
Rick Macklem
c61b14315f Fix copy_file_range(2) so that unneeded blocks are not allocated to the output file.
When the byte range for copy_file_range(2) doesn't go to EOF on the
output file and there is a hole in the input file, a hole must be
"punched" in the output file. This is done by writing a block of bytes
all set to 0.
Without this patch, the write is done unconditionally which means that,
if the output file already has a hole in that byte range, a unneeded data block
of all 0 bytes would be allocated.
This patch adds code to check for a hole in the output file, so that it can
skip doing the write if there is already a hole in that byte range of
the output file. This avoids unnecessary allocation of blocks to the
output file.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D21155
2019-08-15 23:21:41 +00:00
Jeff Roberson
018ff6860f Move scheduler state into the per-cpu area where it can be allocated on the
correct NUMA domain.

Reviewed by:	markj, gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19315
2019-08-13 04:54:02 +00:00
Konstantin Belousov
7e097daa93 Only enable COMPAT_43 changes for syscalls ABI for a.out processes.
Reviewed by:	imp, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21200
2019-08-11 19:16:07 +00:00
Jonathan T. Looney
afd959f332 In m_pulldown(), before trying to prepend bytes to the subsequent mbuf,
ensure that the subsequent mbuf contains the remainder of the bytes
the caller sought. If this is not the case, fall through to the code
which gathers the bytes in a new mbuf.

This fixes a bug where m_pulldown() could fail to gather all the desired
bytes into consecutive memory.

PR:		238787
Reported by:	A reddit user
Discussed with:	emaste
Obtained from:	NetBSD
MFC after:	3 days
2019-08-09 05:18:59 +00:00
Rick Macklem
6b1bc6f7dd Remove some harmless cruft from vn_generic_copy_file_range().
An earlier version of the patch had code that set "error" between
line#s 2797-2799. When that code was moved, the second check for "error != 0"
could never be true and the check became harmless cruft.
This patch removes the cruft, mainly to make Coverity happy.

Reported by:	asomers, cem
2019-08-08 20:07:38 +00:00
Rick Macklem
614633146f Fix copy_file_range(2) for an unlikely race during hole finding.
Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the
vnode unlocked, it is possible for another thread to do:
- truncate(), lseek(), write()
between the two calls and create a hole where FIOSEEKDATA returned the
start of data.
For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for
the hole location. This could result in an infinite loop in the copy
code, since copylen is set to 0 and the copy doesn't advance.
Usually, this race is avoided because of the use of rangelocks, but the
NFS server does not do range locking and could do a sequence like the
above to create the hole.

This patch checks for this case and makes the hole search fail, to avoid
the infinite loop.

At this time, it is an open question as to whether or not the NFS server
should do range locking to avoid this race.
2019-08-08 19:53:07 +00:00
Konstantin Belousov
b706be23b4 Update comment explaining create_init().
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-08-08 16:42:53 +00:00
Xin LI
22bbc4b242 Convert DDB_CTF to use newer version of ZLIB.
PR:		229763
Submitted by:	Yoshihiro Ota <ota j email ne jp>
Differential Revision:	https://reviews.freebsd.org/D21176
2019-08-08 07:27:49 +00:00
Conrad Meyer
7d0658ad55 Fix !DDB kernel configurations after r350713
KDB is standard and the kdb_active variable is always available.  So,
de-conditionalize inclusion of sys/kdb.h in kern_sysctl.c.

Reported by:	Michael Butler <imb AT protected-networks.net>
X-MFC-With:	r350713
Sponsored by:	Dell EMC Isilon
2019-08-08 01:37:41 +00:00
Conrad Meyer
088c17b46b ddb(4): Add 'sysctl' command
Implement `sysctl` in `ddb` by overriding `SYSCTL_OUT`.  When handling the
req, we install custom ddb in/out handlers.  The out handler prints straight
to the debugger, while the in handler ignores all input.  This is intended
to allow us to print just about any sysctl.

There is a known issue when used from ddb(4) entered via 'sysctl
debug.kdb.enter=1'.  The DDB mode does not quite prevent all lock
interactions, and it is possible for the recursive Giant lock to be unlocked
when the ddb(4) 'sysctl' command is used.  This may result in a panic on
return from ddb(4) via 'c' (continue).  Obviously, this is not a problem
when debugging already-paniced systems.

Submitted by:	Travis Lane (formerly: <travis.lane AT isilon.com>)
Reviewed by:	vangyzen (earlier version), Don Morris <dgmorris AT earthlink.net>
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20219
2019-08-08 00:42:29 +00:00
Conrad Meyer
76cb1112da sbuf(9): Add sbuf_nl_terminate() API
The API is used to gracefully terminate text line(s) with a single \n.  If
the formatted buffer was empty or already ended in \n, it is unmodified.
Otherwise, a newline character is appended to it.  The API, like other
sbuf-modifying routines, is only valid while the sbuf is not FINISHED.

Reviewed by:	rlibby
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21030
2019-08-07 19:27:14 +00:00
Conrad Meyer
d23813cdb9 sbuf(9): Refactor sbuf_newbuf into sbuf_new
Code flow was somewhat difficult to read due to the combination of
multiple return sites and the 4x possible dynamic constructions of an
sbuf.  (Future consideration: do we need all 4?)  Refactored slightly to
improve legibility.

No functional change.

Sponsored by:	Dell EMC Isilon
2019-08-07 19:25:56 +00:00
Conrad Meyer
71db411eb6 sbuf(9): Add NOWAIT dynamic buffer extension mode
The goal is to avoid some kinds of low-memory deadlock when formatting
heap-allocated buffers.

Reviewed by:	vangyzen
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21015
2019-08-07 19:23:07 +00:00
Gleb Smirnoff
814f33aafb Since r350426 this KASSERT doesn't serve any useful purpose. 2019-08-06 16:11:00 +00:00
Mariusz Zaborski
c878d1eb45 procdesc: fix the function name
I changed name of the function r350429 and forgot to update
the r350612 patch.

Reported by:	jenkins
MFC after:	1 month
2019-08-05 20:31:17 +00:00
Mariusz Zaborski
9f5103abab process: style
We don't need to check if the parent is already set.
This is done already in the proc_reparent.

No functional behaviour changes intended.

MFC after:	1 month
2019-08-05 20:26:01 +00:00
Mariusz Zaborski
a05cfdf479 exit1: fix style nits
MFC after:	1 month
2019-08-05 20:20:14 +00:00
Mariusz Zaborski
fd631bcd95 procdesc: fix reparenting when the debugger is attached
The process is reparented to the debugger while it is attached.
  B          B
 /   ---->   |
A          A D

Every time when the process is reparented, it is added to the orphan list
of the previous parent:

A->orphan = B
D->orphan = NULL

When the A process will close the process descriptor to the B process,
the B process will be reparented to the init process.
  B            B - init
  |   ---->
A D          A   D

A->orphan = B
D->orphan = B

In this scenario, the B process is in the orphan list of A and D.

When the last process descriptor is closed instead of reparenting
it to the reaper let it stay with the debugger process and set
our previews parent to the reaper.

Add test case for this situation.
Notice that without this patch the kernel will crash with this test case:
panic: orphan 0xfffff8000e990530 of 0xfffff8000e990000 has unexpected oppid 1

Reviewed by:	markj, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D20361
2019-08-05 20:15:46 +00:00
Mariusz Zaborski
799d92ab78 proc: introduce the proc_add_orphan function
This API allows adding the process to its parent orphan list.

Reviewed by:	kib, markj
MFC after:	1 month
2019-08-05 20:11:57 +00:00
Mariusz Zaborski
41fadb3fca exit1: postpone clearing P_TRACED flag until the proctree lock is acquired
In case of the process being debugged. The P_TRACED is cleared very early,
which would make procdesc_close() not calling proc_clear_orphan().
That would result in the debugged process can not be able to collect
status of the process with process descriptor.

Reviewed by:	markj, kib
Tested by:	pho
MFC after:	1 month
2019-08-05 19:59:23 +00:00
Konstantin Belousov
a1549acbaf Fix mis-merge.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-05 19:19:25 +00:00
Konstantin Belousov
01c3ba9752 Fix mis-merge
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-05 19:16:33 +00:00
Justin Hibbits
937a05ba81 Add necessary bits for Linux KPI to work correctly on powerpc
PowerPC, and possibly other architectures, use different address ranges for
PCI space vs physical address space, which is only mapped at resource
activation time, when the BAR gets written.  The DRM kernel modules do not
activate the rman resources, soas not to waste KVA, instead only mapping
parts of the PCI memory at a time.  This introduces a
BUS_TRANSLATE_RESOURCE() method, implemented in the Open Firmware/FDT PCI
driver, to perform this necessary translation without activating the
resource.

In addition to system KPI changes, LinuxKPI is updated to handle a
big-endian host, by adding proper endian swaps to the I/O functions.

Submitted by:	mmacy
Reported by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D21096
2019-08-04 19:28:10 +00:00
John Baldwin
f422bc3092 Set ISOPEN in namei flags when opening executable interpreters.
These vnodes are explicitly opened via VOP_OPEN via
exec_check_permissions identical to the main exectuable image.
Setting ISOPEN allows filesystems to perform suitable checks in
VOP_LOOKUP (e.g. close-to-open consistency in the NFS client).

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D21129
2019-08-03 01:02:52 +00:00
Mark Johnston
8675f5f776 Only check the blessings table for known LORs.
Previously we would check for blessings before marking a given lock
pair as reversed, so each "reversed" lock acquisition would require
a linear scan of the table.  Instead, check the table after marking
the pair as reversed but before generating a report.

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21135
2019-08-02 18:01:47 +00:00
Konstantin Belousov
5cbdd18fd4 Make umtxq_check_susp() to correctly handle thread exit requests.
The check for P_SINGLE_EXIT was shadowed by the (P_SHOULDSTOP || traced) check.

Reported by:	bdrewery (might be)
Reviewed by:	markj
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21124
2019-08-01 14:34:27 +00:00
Konstantin Belousov
fc83c5a7d0 Make randomized stack gap between strings and pointers to argv/envs.
This effectively makes the stack base on the csu _start entry
randomized.

The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap.  Setting it to zero
disables the gap, and max is capped at 50%.

Only amd64 for now.

Reviewed by:	cem, markj
Discussed with:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21081
2019-07-31 20:23:10 +00:00
Konstantin Belousov
fd336e2ac0 Fix handling of transient casueword(9) failures in do_sem_wait().
In particular, restart should be only done when the failure is
transient.  For this, recheck the count1 value after the operation.

Note that do_sem_wait() is older usem interface.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-07-31 19:16:49 +00:00
Kyle Evans
b5a7ac997f kern_shm_open: push O_CLOEXEC into caller control
The motivation for this change is to allow wrappers around shm to be written
that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets
it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which
is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1().
Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from
the context.

sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX
requirements, and a comment has been dropped in to kern_fd_open() to explain
the situation and add a pointer to where O_CLOEXEC setting is maintained for
shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally
sets O_CLOEXEC to match previous behavior.

This also has the side-effect of making flags correctly reflect the
O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a
glance-over leads me to believe that it didn't really matter.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21119
2019-07-31 15:16:51 +00:00
Mark Johnston
49c3e8c8d1 Enable witness(4) blessings.
witness has long had a facility to "bless" designated lock pairs.  Lock
order reversals between a pair of blessed locks are not reported upon.
We have a number of long-standing false positive LOR reports; start
marking well-understood LORs as blessed.

This change hides reports about UFS vnode locks and the UFS dirhash
lock, and UFS vnode locks and buffer locks, since those are the two that
I observe most often.  In the long term it would be preferable to be
able to limit blessings to a specific site where a lock is acquired,
and/or extend witness to understand why some lock order reversals are
valid (for example, if code paths with conflicting lock orders are
serialized by a third lock), but in the meantime the false positives
frequently confuse users and generate bug reports.

Reviewed by:	cem, kib, mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21039
2019-07-30 17:09:58 +00:00
Mark Johnston
ed13ff4549 Regenerate after r350447. 2019-07-30 16:01:16 +00:00
Mark Johnston
f30f7b9870 Enable copy_file_range(2) in capability mode.
copy_file_range() operates on a pair of file descriptors; it requires
CAP_READ for the source descriptor and CAP_WRITE for the destination
descriptor.

Reviewed by:	kevans, oshogbo
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21113
2019-07-30 15:59:44 +00:00
Xin LI
d4565741c6 Remove gzip'ed a.out support.
The current implementation of gzipped a.out support was based
on a very old version of InfoZIP which ships with an ancient
modified version of zlib, and was removed from the GENERIC
kernel in 1999 when we moved to an ELF world.

PR:		205822
Reviewed by:	imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp>
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D21099
2019-07-30 05:13:16 +00:00
Mark Johnston
98549e2dc6 Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page.  Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O.  This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20986
2019-07-29 22:01:28 +00:00
Mariusz Zaborski
9db97ca0bd proc: make clear_orphan an public API
This will be useful for other patches with process descriptors.
Change its name as well.

Reviewed by:	markj, kib
2019-07-29 21:42:57 +00:00
Alan Somers
0367bca479 sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error
This is a partial merge of 350144 from projects/fuse2

PR:		236466
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21095
2019-07-29 20:50:26 +00:00