Commit Graph

17154 Commits

Author SHA1 Message Date
Edward Tomasz Napierala
be2cfdbc86 Add kern_getsid() and use it in Linuxulator; no functional changes.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22647
2019-12-13 18:39:36 +00:00
Ryan Libby
9825eadf2c bitset: rename confusing macro NAND to ANDNOT
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too.  The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with:	jeff
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22791
2019-12-13 09:32:16 +00:00
Conrad Meyer
cd5650407e kern/subr_unit: Rip srandomdev, random(3) out of dead code
The simulation cannot be reproduced, so the value of using a deterministic PRNG
like random(3) is dubious.  The number of repitions used in the sample isn't a
problem for the Chacha implementation of arc4random we have today.  (Also, no
one actually runs this code; it was provided as an example of the work the
author did validating the implementation.  It's not even test code.)
2019-12-13 04:48:20 +00:00
Rick Macklem
ea9a16b252 r355677 requires that vop_stdioctl() be global so it can be called from NFS.
r355677 modified the NFS client so that it does lseek(SEEK_DATA/SEEK_HOLE)
for NFSv4.2, but calls vop_stdioctl() otherwise. As such, vop_stdioctl()
needs to be a global function.

Missed during the code merge for r355677.
2019-12-13 00:14:12 +00:00
Edward Tomasz Napierala
d6fee74a0c Add kern_sync(9), and make kernel code call it instead of going
via sys_sync(2).  Minor cleanup, no functional changes.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19366
2019-12-12 18:45:31 +00:00
Mark Johnston
7789ab32b3 Rename tdq_ipipending and clear it in sched_switch().
This fixes a regression after r355311.  Specifically, sched_preempt()
may trigger a context switch by calling thread_lock(), since
thread_lock() calls critical_exit() in its slow path and the interrupted
thread may have already been marked for preemption.  This would happen
before tdq_ipipending is cleared, blocking further preemption IPIs.  The
CPU can be left in this state indefinitely if the interrupted thread
migrates.

Rename tdq_ipipending to tdq_owepreempt.  Any switch satisfies a remote
preemption request, so clear tdq_owepreempt in sched_switch() instead of
sched_preempt() to avoid subtle problems of the sort described above.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22758
2019-12-12 02:43:24 +00:00
Mateusz Guzik
c8b29d1212 vfs: locking primitives which elide ->v_vnlock and shared locking disablement
Both of these features are not needed by many consumers and result in avoidable
reads which in turn puts them on profiles due to cache-line ping ponging.

On top of that the current lockgmr entry point is slower than necessary
single-threaded. As an attempted clean up preparing for other changes,
provide new routines which don't support any of the aforementioned features.

With these patches in place vop_stdlock and vop_stdunlock disappear from
flamegraphs during -j 104 buildkernel.

Reviewed by:	jeff (previous version)
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22665
2019-12-11 23:11:21 +00:00
Mateusz Guzik
55eb92db8d fd: static-ize and devolatile openfiles
Almost all access is using atomics. The only read is sysctl which should use
a whole-int-at-a-time friendly read internally.
2019-12-11 23:09:12 +00:00
Andriy Gapon
64ebbdd54d add a sanity check to the system call registration code
A system call number should be at least reserved.
We do not expect an attempt to register a fixed number system call
when nothing at all is known about it.

MFC after:	3 weeks
Sponsored by:	Panzura
2019-12-11 15:52:29 +00:00
John Baldwin
a8a03706fb Add a callout_func_t typedef for functions used with callout_*().
This typedef is the same as timeout_t except that it is in the callout
namespace and header.

Use this typedef in various places of the callout implementation that
were either using the raw type or timeout_t.

While here, add <sys/callout.h> to the manpage.

Reviewed by:	kib, imp
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D22751
2019-12-10 21:58:30 +00:00
Mateusz Guzik
ff4486e827 vfs: refactor vhold and vdrop
No fuctional changes.
2019-12-10 00:08:05 +00:00
John Baldwin
d8010b1175 Copy out aux args after the argument and environment vectors.
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector.  Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack.  It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.

This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).

Reviewed by:	kib
Tested on:	amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22695
2019-12-09 19:17:28 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Mateusz Guzik
791a24c7ea vfs: clean up vputx a little
1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone

Reviewed by:	kib, jeff (previous version)
Differential Revision:	https://reviews.freebsd.org/D22718
2019-12-08 21:13:07 +00:00
Mateusz Guzik
fd6e0c43a6 vfs: factor out vnode destruction out of vdrop
Sponsored by:	The FreeBSD Foundation
2019-12-08 21:11:25 +00:00
Jeff Roberson
c3cccf95bf Handle multiple clock interrupts simultaneously in sched_clock().
Reviewed by:	kib, markj, mav
Differential Revision:	https://reviews.freebsd.org/D22625
2019-12-08 01:17:38 +00:00
Konstantin Belousov
0cc9fb7551 Only return EPERM from kill(-pid) when no process was signalled.
As mandated by POSIX.  Also clarify the kill(2) manpage.

While there, restructure the code in killpg1() to use helper which
keeps overall state of the process list iteration in the killpg1_ctx
structued, later used to infer the error returned.

Reported by:	amdmi3
Reviewed by:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22621
2019-12-07 18:07:49 +00:00
Mateusz Guzik
12e483e5f7 vfs: clean up delmntque similarly to vdrop r355414 2019-12-07 12:56:24 +00:00
Mateusz Guzik
4f4d9a086a vfs: catch vn_printf up with reality
- add the missing VV_VMSIZEVNLOCK and VV_READLINK flags
- add decoding v_mflag

While here sort flags.
2019-12-07 12:55:58 +00:00
Brooks Davis
af796bfa71 sysent: Reduce duplication and improve readability.
Use the power of variable to avoid spelling out source and generated
files too many times.  The previous Makefiles were hard to read, hard to
edit, and badly formatted.

Reviewed by:	kevans, emaste
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22714
2019-12-06 23:59:23 +00:00
Alexander Motin
cb847b8152 Make devstat_end_transaction_bio() count BIO_ORDERED.
MFC after:	2 weeks
2019-12-06 18:39:05 +00:00
Bjoern A. Zeeb
173c062a56 Improve EPOCH_TRACE
Two changes to EPOCH_TRACE:
(1) add a sysctl to surpress the backtrace from epoch_trace_report().
    Sometimes the log line for the recursion is enough and the
    backtrace massively spams the console.
(2) In order to be able to go without the backtrace do not only
    print where the previous occurance happened, but also where
    the current one happens.  That way we have file:line information
    for both and can look at them without the need for getting line
    numbers from backtrace and a debugging tool.

Reviewed by:	glebius
Sponsored by:	Netflix (originally)
Differential Revision:	https://reviews.freebsd.org/D22641
2019-12-06 16:34:04 +00:00
Mateusz Guzik
befd3e35b3 sx: check for SX_LOCK_SHARED | SX_LOCK_WRITE_SPINNER when exclusive-locking
First, this removes a spurious difference compared to rw locks.
More importantly though this avoids a trip through sleepq code if the lock
happens to be caught in this state.
2019-12-05 13:43:44 +00:00
Mateusz Guzik
3eeb8a1fba vfs: remove 'active' variable from _vdrop
No functional changes.
2019-12-05 13:40:10 +00:00
Alexander Motin
61322a0a8a Mark some more hot global variables with __read_mostly.
MFC after:	1 week
2019-12-04 21:26:03 +00:00
Ryan Libby
30be9685a3 mbuf zones: take out the trash
The mbuf zones were explicitly specifying the uma trash procedures on
zcreate, conditionally on INVARIANTS, because that used to be necessary
in order to get use-after-free checking for uma zones with non-empty
constructors or destructors.  After r355137 uma automatically invokes
the trash constructor and destructor as long as no init and fini are
specified.  This now allows the mbuf zones to pass their constructors
and destructors without needing to add on the uma trash procedures
conditionally.

Reviewed by:	cem, jhb, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22583
2019-12-04 18:21:29 +00:00
John Baldwin
31174518d2 Use uintptr_t instead of register_t * for the stack base.
- Use ustringp for the location of the argv and environment strings
  and allow destp to travel further down the stack for the stackgap
  and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
  stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs.  This
  used to hold translated system call arguments, but hasn't been used
  since r159992.

Reviewed by:	kib
Tested on:	md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22501
2019-12-03 23:17:54 +00:00
Kirk McKusick
d00066a5f9 Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2019-12-03 23:07:09 +00:00
Jeff Roberson
9b78b1f433 Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by:	markj, rlibby (prior version)
Differential Revision:	https://reviews.freebsd.org/D22584
2019-12-02 22:44:34 +00:00
Jeff Roberson
4504268a1b Fix the last few cases that grab without busy or valid. The grab functions must
return the page in some held state for consistency elsewhere.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22610
2019-12-02 22:38:25 +00:00
Jeff Roberson
e15046952d Initialize the idle thread's lock sooner so it's not evaluated on every fork
exit and we can rely on it elsewhere.

Reviewed by:	mav, kib, jhb, markj
Differential Revision:	https://reviews.freebsd.org/D22624
2019-12-02 22:35:45 +00:00
Mateusz Guzik
5fe188b1e8 lockmgr: remove more remnants of adaptive spinning
Sponsored by:	The FreeBSD Foundation
2019-12-01 00:35:08 +00:00
Kyle Evans
1b50b999f9 tty: implement TIOCNOTTY
Generally, it's preferred that an application fork/setsid if it doesn't want
to keep its controlling TTY, but it could be that a debugger is trying to
steal it instead -- so it would hook in, drop the controlling TTY, then do
some magic to set things up again. In this case, TIOCNOTTY is quite handy
and still respected by at least OpenBSD, NetBSD, and Linux as far as I can
tell.

I've dropped the note about obsoletion, as I intend to support TIOCNOTTY as
long as it doesn't impose a major burden.

Reviewed by:	bcr (manpages), kib
Differential Revision:	https://reviews.freebsd.org/D22572
2019-11-30 20:10:50 +00:00
Mateusz Guzik
e0a1a1e6cb smp: cast the read in quiesce_all_critical through void *
Fixes compilation on some 32-bit arm platforms.

Sponsored by:	The FreeBSD Foundation
2019-11-30 19:33:02 +00:00
Mateusz Guzik
3ac2ac2e08 lockprof: use IPI-injecetd fences to fix hangs on stat dump and reset
The previously used quiesce_all_cpus walks all CPUs and waits until curthread
can run on them. Even on contemporary machines this becomes a significant
problem under load when it can literally take minutes for the operation to
complete. With the patch the stall is normally less than 1 second.

Reviewed by:	kib, jeff (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21740
2019-11-30 17:24:42 +00:00
Mateusz Guzik
5032fe17a2 Add a way to inject fences using IPIs
A variant of this facility was already used by rmlocks where IPIs would
enforce ordering.

This allows to elide fences where they are rarely needed and the cost of
IPI (should it be necessary) is cheaper.

Reviewed by:	kib, jeff (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21740
2019-11-30 17:22:10 +00:00
Mateusz Guzik
a02cab334c devfs: introduce a per-dev lock to protect ->si_devsw
This allows bumping threadcount without taking the global devmtx lock.

In particular this eliminates contention on said lock while using bhyve
with multiple vms.

Reviewed by:	kib
Tested by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22548
2019-11-30 16:46:19 +00:00
Kyle Evans
9e387c3da2 tty_rel_gone: add locking assertion
We already assert the lock is held later during tty_rel_free(), but it is
arguably good form to clarify locking expectations here as well at the
top-level that other drivers use.
2019-11-29 14:46:13 +00:00
Konstantin Belousov
fdc6b10d44 Add a VN_OPEN_INVFS flag.
vn_open_cred() assumes that it is called from the top-level of a VFS
syscall.  Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.

ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache).  VN_OPEN_INVFS allows caller to skip
bwillwrite.

Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.

Reported by:	Willem Jan Withagen <wjw@digiware.nl>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-29 14:02:32 +00:00
Ryan Libby
815db2f6f8 ktls_session zone: don't need to specify uma trash
The use of the uma trash procedures is automatic, there's no need to
pass them explicitly here.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22582
2019-11-29 06:25:03 +00:00
Kyle Evans
cf29433090 tty_pts: don't rely on tty header pollution for sys/mutex.h
tty_pts.c relies on sys/tty.h for sys/mutex.h. Include it directly instead
of relying on this pollution to ease the diff for anyone that wants to try
converting the tty lock to anything other than a mutex.
2019-11-29 03:56:01 +00:00
Jeff Roberson
6d6a03d7a8 Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22563
2019-11-29 03:14:10 +00:00
Jeff Roberson
b476ae7f52 Fix DEBUG_REDZONE build after r355169 2019-11-28 08:56:14 +00:00
Hans Petter Selasky
c2a8682ae8 Factor out check for mounted root file system.
Differential Revision:	https://reviews.freebsd.org/D22571
PR:		241639
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-11-28 08:47:36 +00:00
Jeff Roberson
584061b480 Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab.  Remove some nearby
dead code.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22564
2019-11-28 07:49:25 +00:00
Konstantin Belousov
ef401a8558 Requested and tested by: kevans
Reviewed by:	kevans (previous version), markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22546
2019-11-27 20:33:53 +00:00
Ryan Libby
59fb4a95c7 witness: sleepable rm locks are not sleepable in read mode
There are two classes of rm lock, one "sleepable" and one not.  But even
a "sleepable" rm lock is only sleepable in write mode, and is
non-sleepable when taken in read mode.

Warn about sleepable rm locks in read mode as non-sleepable locks.  Do
this by defining a new lock operation flag, LOP_NOSLEEP, to indicate
that a lock is non-sleepable despite what the LO_SLEEPABLE flag would
indicate, and defining a new witness lock instance flag, LI_SLEEPABLE,
to track the product of LO_SLEEPABLE and LOP_NOSLEEP on the lock
instance.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22527
2019-11-27 01:54:39 +00:00
Mateusz Guzik
588e69e2fd cache: stop reusing .. entries on enter
It almost never happens in practice anyway. With this eliminated ->nc_vp
cannot change vnodes, removing an obstacle on the road to lockless
lookup.
2019-11-27 01:21:42 +00:00
Mateusz Guzik
2ac930e32c cache: fix numcache accounting on entry
. entries are never created and .. can reuse existing entries,
meaning the early count bump is both spurious and leading to
overcounting in certain cases.
2019-11-27 01:20:55 +00:00
Mateusz Guzik
36afce39ae cache: hide "doingcache" behind DEBUG_CACHE 2019-11-27 01:20:21 +00:00
Hans Petter Selasky
aa4612d133 Fix panic when loading kernel modules before root file system is mounted.
Make sure the rootvnode is always NULL checked.

Differential Revision:	https://reviews.freebsd.org/D22545
PR:		241639
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-11-26 12:20:44 +00:00
Mariusz Zaborski
8e49361164 procdesc: allow to collect status through wait(1) if process is traced
The debugger like truss(1) depends on the wait(2) syscall. This syscall
waits for ALL children. When it is waiting for ALL child's the children
created by process descriptors are not returned. This behavior was
introduced because we want to implement libraries which may pdfork(1).

The behavior of process descriptor brakes truss(1) because it will
not be able to collect the status of processes with process descriptors.

To address this problem the status is returned to parent when the
child is traced. While the process is traced the debugger is the new parent.
In case the original parent and debugger are the same process it means the
debugger explicitly used pdfork() to create the child. In that case the debugger
should be using kqueue()/pdwait() instead of wait().

Add test case to verify that. The test case was implemented by markj@.

Reviewed by:	kib, markj
Discussed with:	jhb
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D20362
2019-11-25 18:33:21 +00:00
Ryan Libby
43cefe8b19 sysctl sysctls: wire old buf before output with sysctl lock
Several sysctl sysctls output to a user buffer while holding a
non-sleepable lock that protects the sysctl topology.  They need to wire
the output buffer, or else they may try to sleep on a page fault.

Reviewed by:	cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22528
2019-11-25 07:38:27 +00:00
Konstantin Belousov
b631c36f0d Record part of the owner struct thread pointer into busy_lock.
Record as much bits from curthread into busy_lock as fits.  Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0).  Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler.  For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22298
2019-11-24 19:12:23 +00:00
Warner Losh
a921c2003f Add a warning about Giant Locked devices
Add a warning when a device registers with devfs and requests
D_NEEDGIANT. The warning says the device will go away before
13.0. This is needed to flush out the devices in the tree that are
still Giant locked. This warning, or some variant of it, should have
gone into the tree a long time ago...

The intention is to require all devices be converted to not use
automatic giant in this way, or remove any such devices that remain
that we don't have the hardware to test a conversion of.

kbd so far is the only device that can't leave the tree, yet needs
something sensible done to avoid the auto giant lock (even if it is
just doing the wrapping itself). There may be others added to this
list... Any discussions of this topic will take place on arch@.
2019-11-23 23:57:26 +00:00
Conrad Meyer
7993a104a1 Add explicit SI_SUB_EPOCH
Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP).  Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.

epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP.  Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D22503
2019-11-22 23:23:40 +00:00
Gleb Smirnoff
329377f44b cc_ktr_event_name is used only with KTR 2019-11-21 23:55:43 +00:00
Alexander Motin
130fffa2a3 Add variant of root_mount_hold() without allocation.
It allows to use this KPI in non-sleepable contexts.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-21 21:59:35 +00:00
Andrew Turner
a27ac4644a Disable KCSAN within a panic.
The kernel is single threaded at this point and the panic is more
important.

Sponsored by:	DARPA, AFRL
2019-11-21 13:59:01 +00:00
Andrew Turner
68cad68149 Add kcsan_md_unsupported from NetBSD.
It's used to ignore virtual addresses that may have a different physical
address depending on the CPU.

Sponsored by:	DARPA, AFRL
2019-11-21 13:22:23 +00:00
Andrew Turner
bba0065f0d Fix the bus_space functions with KCSAN on arm64.
Arm64 doesn't define the bus_space_set_multi_stream and
bus_space_set_region_stream functions. Don't try to define them there.

Sponsored by:	DARPA, AFRL
2019-11-21 13:12:58 +00:00
Andrew Turner
849aef496d Port the NetBSD KCSAN runtime to FreeBSD.
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.

This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.

Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22315
2019-11-21 11:22:08 +00:00
Andrew Turner
0cb5357037 Import the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime.
KCSAN is a tool to find concurrent memory access that may race each other.
After a determined number of memory accesses a cell is created, this
describes the current access. It will then delay for a short period
to allow other CPUs a chance to race. If another CPU performs a memory
access to an overlapping region during this delay the race is reported.

This is a straight import of the NetBSD code, it will be adapted to
FreeBSD in a future commit.

Sponsored by:	DARPA, AFRL
2019-11-20 14:37:48 +00:00
Mateusz Guzik
d578a4256e cache: minor stat cleanup
Remove duplicated stats and move numcachehv from debug to vfs.cache.
2019-11-20 12:08:32 +00:00
Mateusz Guzik
d957f3a4f0 vfs: perform a more racy check in vfs_notify_upper
Locking mp does not buy anything interms of correctness and only contributes to
contention.
2019-11-20 12:07:54 +00:00
Mateusz Guzik
1fccb43c39 vfs: change si_usecount management to count used vnodes
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.

There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).

Reviewed by:	kib, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22202
2019-11-20 12:05:59 +00:00
Jeff Roberson
639676877b Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates
reudundant complicated checks and additional locking required only for
anonymous memory.  Introduce vm_object_allocate_anon() to create these
objects.  DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.

Reviewed by:	alc, dougm, kib, markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22119
2019-11-19 23:19:43 +00:00
Kyle Evans
4cc12fb848 sysent: regenerate after r354835
The lua-based makesyscalls produces slightly different output than its
makesyscalls.sh predecessor, all whitespace differences more closely
matching the source syscalls.master.
2019-11-18 23:31:12 +00:00
Kyle Evans
f22a592111 Convert in-tree sysent targets to use new makesyscalls.lua
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.

Reviewed by:	brooks
Differential Revision:	https://reviews.freebsd.org/D21894
2019-11-18 23:28:23 +00:00
John Baldwin
03b0d68c72 Check for errors from copyout() and suword*() in sv_copyout_args/strings.
Reviewed by:	brooks, kib
Tested on:	amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22401
2019-11-18 20:07:43 +00:00
David Bright
2d5603fe65 Jail and capability mode for shm_rename; add audit support for shm_rename
Co-mingling two things here:

  * Addressing some feedback from Konstantin and Kyle re: jail,
    capability mode, and a few other things
  * Adding audit support as promised.

The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	kib
Relnotes:	Yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22083
2019-11-18 13:31:16 +00:00
Konstantin Belousov
01a2b5679b kern_exec: p_osrel and p_fctl0 were obliterated by failed execve(2) attempt.
Zeroing of them is needed so that an image activator can update the
values as appropriate (or not set at all).

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22379
2019-11-17 14:52:45 +00:00
Scott Long
de890ea465 Create a new sysctl subtree, machdep.mitigations. Its purpose is to organize
knobs and indicators for code that mitigates functional and security issues
in the architecture/platform.  Controls for regular operational policy should
still go into places security, hw, kern, etc.

The machdep root node is inherently architecture dependent, but mitigations
tend to be architecture dependent as well.  Some cases like Spectre do cross
architectural boundaries, but the mitigation code for them tends to be
architecture dependent anyways, and multiple architectures won't be active
in the same image of the kernel.

Many mitigation knobs already exist in the system, and they will be moved
with compat naming in the future.  Going forward, mitigations should collect
in machdep.mitigations.

Reviewed by:	imp, brooks, rwatson, emaste, jhb
Sponsored by:	Intel
2019-11-15 23:27:17 +00:00
John Baldwin
e353233118 Add a sv_copyout_auxargs() hook in sysentvec.
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook.  In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland.  This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by:	brooks, emaste
Tested on:	amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22355
2019-11-15 18:42:13 +00:00
Brooks Davis
96c914ee97 Tidy syscall declerations.
Pointer arguments should be of the form "<type> *..." and not "<type>* ...".

No functional change.

Reviewed by:	kevans
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22373
2019-11-14 17:11:52 +00:00
Mark Johnston
1cbfe73da5 Fix handling of PIPE_EOF in the direct write path.
Suppose a writing thread has pinned its pages and gone to sleep with
pipe_map.cnt > 0.  Suppose that the thread is woken up by a signal (so
error != 0) and the other end of the pipe has simultaneously been
closed.  In this case, to satisfy the assertion about pipe_map.cnt in
pipe_destroy_write_buffer(), we must mark the buffer as empty.

Reported by:	syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com
Reviewed by:	kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22261
2019-11-11 20:44:30 +00:00
Rick Macklem
48e4857859 Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
2019-11-10 01:08:14 +00:00
Rick Macklem
15930ae180 Update copy_file_range(2) to be Linux5 compatible.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.
2019-11-08 23:39:17 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Gleb Smirnoff
b8c923032f If vm_pager_get_pages_async() returns an error synchronously we leak wired
and busy pages.  Add code that would carefully cleanups the state in case
of synchronous error return.  Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.

In collaboration with:	chs
Reviewed by:		jtl, kib
2019-11-06 23:45:43 +00:00
Bjoern A. Zeeb
28d7601989 m_pulldown(): Change an if () panic() into a KASSERT().
If we pass in a NULL mbuf to m_pulldown() we are in a bad situation
already.  There is no point in doing that check for production code.
Change the if () panic() into a KASSERT.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-06 22:40:19 +00:00
Brooks Davis
89f34d4611 libstats: Improve ABI assertion.
On platforms where pointers are larger than 64-bits, struct statsblob
may be harmlessly padded out such that opaque[] always has some included
space.  Make the assertion more general by comparing to the offset of
opaque rather than the size of struct statsblob.

Discussed with:	jhb, James Clarke
Reviewed by:	trasz, lstewart
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22188
2019-11-06 19:44:44 +00:00
Alexander Motin
3db35ffa2a Some more taskqueue optimizations.
- Optimize enqueue for two task priority values by adding new tq_hint
field, pointing to the last task inserted into the middle of the list.
In case of more then two priority values it should halve average search.
 - Move tq_active insert/remove out of the taskqueue_run_locked loop.
Instead of dirtying few shared cache lines per task introduce different
mechanism to drain active tasks, based on task sequence number counter,
that uses only cache lines already present in cache.  Since the new
mechanism does not need ordering, switch tq_active from TAILQ to LIST.
 - Move static and dynamic struct taskqueue fields into different cache
lines.  Move lock into its own cache line, so that heavy lock spinning
by multiple waiting threads would not affect the running thread.
 - While there, correct some TQ_SLEEP() wait messages.

This change fixes certain ZFS write workloads, causing huge congestion
on taskqueue lock.  Those workloads combine some large block writes to
saturate the pool and trigger allocation throttling, which uses higher
priority tasks to requeue the delayed I/Os, with many small blocks to
generate deep queue of small tasks for taskqueue to sort.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-11-01 22:49:44 +00:00
Ed Maste
2e5f9189bb avoid kernel stack data leak in core dump thrmisc note
bzero the entire thrmisc struct, not just the padding.  Other core dump
notes are already done this way.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:	markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-10-31 20:42:36 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Jeff Roberson
6ee653cfeb Drop the object lock in vfs_bio and cluster where it is now safe to do so.
Recent changes to busy/valid/dirty have enabled page based synchronization
and the object lock is no longer required in many cases.

Reviewed by:	kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21597
2019-10-29 20:37:59 +00:00
Gleb Smirnoff
5757b59f3e Merge td_epochnest with td_no_sleeping.
Epoch itself doesn't rely on the counter and it is provided
merely for sleeping subsystems to check it.

- In functions that sleep use THREAD_CAN_SLEEP() to assert
  correctness.  With EPOCH_TRACE compiled print epoch info.
- _sleep() was a wrong place to put the assertion for epoch,
  right place is sleepq_add(), as there ways to call the
  latter bypassing _sleep().
- Do not increase td_no_sleeping in non-preemptible epochs.
  The critical section would trigger all possible safeguards,
  no sleeping counter is extraneous.

Reviewed by:	kib
2019-10-29 17:28:25 +00:00
Konstantin Belousov
5e921ff49e amd64: move pcb out of kstack to struct thread.
This saves 320 bytes of the precious stack space.

The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that.  Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22138
2019-10-25 20:09:42 +00:00
Gleb Smirnoff
ed9d69b5e8 Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no
functional change.

Discussed with:	kib
2019-10-24 21:55:19 +00:00
John Baldwin
7d29eb9a91 Use a counter with a random base for explicit IVs in GCM.
This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq().  This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22117
2019-10-24 18:13:26 +00:00
Konstantin Belousov
c92f130498 Fix undefined behavior.
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22126
2019-10-23 16:06:47 +00:00
Konstantin Belousov
8076c4e7d1 vn_printf(): Decode VI_TEXT_REF.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-10-23 15:51:26 +00:00
Gleb Smirnoff
080e9496b8 Allow epoch tracker to use the very last byte of the stack. Not sure
this will help to avoid panic in this function, since it will also use
some stack, but makes code more strict.

Submitted by:	hselasky
2019-10-22 18:05:15 +00:00
Gleb Smirnoff
77d70e515f Assert that any epoch tracker belongs to the thread stack.
Reviewed by:	kib
2019-10-21 23:12:14 +00:00
Gleb Smirnoff
279b9aabe3 Remove epoch tracker from struct thread. It was an ugly crutch to emulate
locking semantics for if_addr_rlock() and if_maddr_rlock().
2019-10-21 18:19:32 +00:00
Andriy Gapon
3ad1ce46d3 debug,kassert.warnings is a statistic, not a tunable
MFC after:	1 week
2019-10-21 12:21:56 +00:00
Mark Johnston
f822c9e287 Apply mapping protections to preloaded kernel modules on amd64.
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21860
2019-10-18 13:56:45 +00:00
Mark Johnston
1d9eae9fb2 Apply mapping protections to .o kernel modules.
Use the section flags to derive mapping protections.  When multiple
sections overlap within a page, the union of their protections must be
applied.  With r353701 the .text and .rodata sections are padded to
ensure that this does not happen on amd64.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21896
2019-10-18 13:53:14 +00:00
Conrad Meyer
dda17b3672 Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.

There are three pieces in the complete NetGDB system.

First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.

Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic.  Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).

Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.

The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU).  There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with:	emaste, markj
Relnotes:	sure
Differential Revision:	https://reviews.freebsd.org/D21568
2019-10-17 21:33:01 +00:00
Mark Johnston
092bacb2c4 Clean up some nits in link_elf_(un)load_file().
- Remove a redundant assignment of ef->address.
- Don't return a Mach error number to the caller if vm_map_find() fails.
- Use ptoa() and fix style.

MFC after:	2 weeks
Sponsored by:	Netflix
2019-10-17 21:25:50 +00:00
Conrad Meyer
addccb8c51 Add a very limited DDB dumpon(8)-alike to MI dumper code
This allows ddb(4) commands to construct a static dumperinfo during
panic/debug and invoke doadump(false) using the provided dumper
configuration (always inserted first in the list).

The intended usecase is a ddb(4)-time netdump(4) command.

Reviewed by:	markj (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21448
2019-10-17 18:29:44 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Andriy Gapon
5fdc2c044e provide a way to assign taskqueue threads to a kernel process
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.

MFC after:	4 weeks
2019-10-17 06:32:34 +00:00
Mark Johnston
6d775f0ba1 Use KOBJMETHOD_END in the kernel linker.
MFC after:	1 week
2019-10-16 22:06:19 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Andrew Turner
9bb37c03fb Stop leaking information from the kernel through timespec
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.

Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.

In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.

This doesn't affect Tier-1 architectures so no SA is expected.

admbugs:	651
MFC after:	1 week
Sponsored by:	DARPA, AFRL
2019-10-16 13:21:01 +00:00
Kristof Provost
1d95443818 Generalize ARM specific comments in devmap
The comments in devmap are very ARM specific, this generalizes them for other
architectures.

Submitted by:	Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by:	manu, philip
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D22035
2019-10-15 23:21:52 +00:00
Gleb Smirnoff
4b25d1f2e3 Missing from r353596. 2019-10-15 21:32:38 +00:00
Gleb Smirnoff
bac060388f When assertion for a thread not being in an epoch fails also print all
entered epochs. Works with EPOCH_TRACE only.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D22017
2019-10-15 21:24:25 +00:00
Gleb Smirnoff
237c1f932b Remove pfctlinput2(). It came from KAME and had never ever been in use. 2019-10-15 15:40:03 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
63e9755548 (1/6) Replace busy checks with acquires where it is trival to do so.
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21548
2019-10-15 03:35:11 +00:00
Leandro Lupori
0ecc478b74 [PPC64] Initial kernel minidump implementation
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D21551
2019-10-14 13:04:04 +00:00
Gleb Smirnoff
f6eccf96a0 Since EPOCH_TRACE had been moved to opt_global.h, we don't need to waste
extra space in struct thread.
2019-10-14 04:17:56 +00:00
Mateusz Guzik
d1cbf3eeea vfs: add MNTK_NOMSYNC
On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22009
2019-10-13 15:40:34 +00:00
Mateusz Guzik
737241cd51 vfs: return free vnode batches in sync instead of vfs_msync
It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22008
2019-10-13 15:39:11 +00:00
Alexander Motin
a89a562b60 Allocate device softc from the device domain.
Since we are trying to bind device interrupt threads to the device domain,
it should have sense to make memory often accessed by them local. If domain
is not known, fall back to round-robin.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-10-12 19:03:07 +00:00
Kristof Provost
85d1151f96 mountroot: run statfs after mounting devfs
The usual flow for mounting a file system is to VFS_MOUNT() and then
immediately VFS_STATFS().

That's not done in vfs_mountroot_devfs(), which means the
mp->mnt_stat.f_iosize field is not correctly populated, which in turn
causes us to mark valid aio operations as unsafe (because the io size is
set to 0), ultimately causing the aio_test:md_waitcomplete test to fail.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D21897
2019-10-11 17:04:38 +00:00
Conrad Meyer
46d70077be ddb: Add CSV option, sorting to 'show (malloc|uma)'
Add /i option for machine-parseable CSV output.  This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first).  This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by:	Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by:	Dell EMC Isilon
2019-10-11 01:31:31 +00:00
John Baldwin
97ecf6efa0 Don't free the cursor boundary tag during vmem_destroy().
The cursor boundary tag is statically allocated in the vmem instead of
from the vmem_bt_zone.  Explicitly remove it from the vmem's segment
list in vmem_destroy before freeing all the segments from the vmem.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21953
2019-10-09 21:20:39 +00:00
Gleb Smirnoff
975b8f8462 Cleanup unneeded includes that crept in with r353292. 2019-10-09 16:59:42 +00:00
Gleb Smirnoff
ff3cfc330e Enter network epoch in domain callouts. 2019-10-09 16:21:05 +00:00
Mark Johnston
4013d72684 Fix handling of empty SCM_RIGHTS messages.
As unp_internalize() processes the input control messages, it builds
an output mbuf chain containing the internalized representations of
those messages.  In one special case, that of an empty SCM_RIGHTS
message, the message is simply discarded.  However, the loop which
appends mbufs to the output chain assumed that each iteration would
produce an mbuf, resulting in a null pointer dereference if an empty
SCM_RIGHTS message was followed by a non-empty message.

Fix this by advancing the output mbuf chain tail pointer only if an
internalized control message was produced.

Reported by:	syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-08 23:34:48 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Doug Moore
2288078c5e Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21882
2019-10-08 07:14:21 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Edward Tomasz Napierala
1a13f2e6b4 Introduce stats(3), a flexible statistics gathering API.
This provides a framework to define a template describing
a set of "variables of interest" and the intended way for
the framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof).  Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs.  The stats(3) framework can be used in
both userspace and the kernel.

See the stats(3) manual page for details.

This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.

The stats(3) framework is disabled by default for now, except
in the NOTES kernel (for QA); it is expected to be enabled
in amd64 GENERIC after a cool down period.

Reviewed by:	sef (earlier version)
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20477
2019-10-07 19:05:05 +00:00
Mateusz Guzik
dc20b834ca vfs: add optional root vnode caching
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by:	jeff (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21646
2019-10-06 22:14:32 +00:00
Kyle Evans
e3f35d562f Remove the remnants of SI_CHEAPCLONE
SI_CHEAPCLONE was introduced in r66067 for use with cloned bpfs. It was
later also used in tty, tun, tap at points. The rough timeline for being
removed in each of these is as follows:

- r181690: bpf switched to use cdevpriv API by ed@
- r181905: ed@ rewrote the TTY later to be mpsafe
- r204464: kib@ removes it from tun/tap, declaring it unused

I've not yet been able to dig up any other consumers in the intervening 9
years. It is no longer set on any devices in the tree and leaves an
interesting situation in make_dev_sv where we're ok with the device already
being set SI_NAMED.
2019-10-05 21:52:06 +00:00
Kyle Evans
d42fecb5c1 kern_conf: fully initialize cloned devices with make_dev_args, too
Attempting to initialize si_drv{1,2} with mda_si_drv{1,2} does not work if
you are operating on cloned devices.

clone_create must be called prior to the make_dev* family to create/return
the device on the clonelist as needed. This device is later returned early
in newdev(), prior to si_drv{0,1,2} initialization.

This patch simply breaks out of the loop if we've found a device and
finishes init.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21904
2019-10-05 21:44:18 +00:00
Mateusz Guzik
dfa8dae493 devfs: plug redundant bwillwrite avoidance
vn_write already checks for vnode type to see if bwillwrite should be called.

This effectively reverts r244643.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21905
2019-10-05 17:44:33 +00:00
Eric van Gyzen
e61e783b83 Add CTLFLAG_STATS to some vfs sysctl OIDs
Add CTLFLAG_STATS to the following OIDs:

vfs.altbufferflushes
vfs.recursiveflushes
vfs.barrierwrites
vfs.flushwithdeps
vfs.reassignbufcalls

Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:43:43 +00:00
Ed Maste
f91dd6091b simplify path handling in sysctl_try_reclaim_vnode
MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL.  Thus there is no need to increase the
buffer size here.

The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21876
2019-10-02 21:01:23 +00:00
Mark Johnston
5131cba6d6 Use OBJT_PHYS VM objects for kernel modules.
OBJT_DEFAULT incurs some unnecessary overhead given that kernel module
pages cannot be paged out.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21862
2019-10-02 16:34:42 +00:00
Mark Johnston
4a7b33ecf4 Disallow fcntl(F_READAHEAD) when the vnode is not a regular file.
The mountpoint may not have defined an iosize parameter, so an attempt
to configure readahead on a device file can lead to a divide-by-zero
crash.

The sequential heuristic is not applied to I/O to or from device files,
and posix_fadvise(2) returns an error when v_type != VREG, so perform
the same check here.

Reported by:	syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21864
2019-10-02 15:45:49 +00:00
Kyle Evans
5a391b572b shm_open2(2): completely unbreak
kern_shm_open2(), since conception, completely fails to pass the mode along
to kern_shm_open(). This breaks most uses of it.

Add tests alongside this that actually check the mode of the returned
files.

PR:		240934 [pulseaudio breakage]
Reported by:	ler, Andrew Gierth [postgres breakage]
Diagnosed by:	Andrew Gierth (great catch)
Tested by:	ler, tmunro
Pointy hat to:	kevans
2019-10-02 02:37:34 +00:00
Ed Maste
f403831e6c sysalls.master: remove superfluous ellipsis in comment
A single period is sufficient in this comment, and making this change
lets us find references to varargs syscalls by searching for ...
2019-10-01 17:05:21 +00:00
Brooks Davis
3a94552174 Restore the ability to set capenabled directly in syscalls.conf.
This fixes generation of cloudabi syscall tables broken in r340424.

Reviewed by:	kevans, emaste
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D21821
2019-09-30 20:58:29 +00:00
Kyle Evans
11fd6a60e7 syscalls.master: consistency, move ); to newline (no functional change) 2019-09-30 13:26:16 +00:00
Mark Johnston
1aa696babc Fix some problems with the SPARSE_MAPPING option in the kernel linker.
- Ensure that the end of the mapping passed to vm_page_wire() is
  page-aligned.  vm_page_wire() expects this.
- Wire pages before reading data into them.
- Apply protections specified in the segment descriptor using
  vm_map_protect() once relocation processing is done.
- On amd64, ensure that we load KLDs above KERNBASE, since they
  are compiled with the "kernel" memory model by default.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21756
2019-09-28 01:42:59 +00:00
Andrew Gallatin
b2dba6634b kTLS: Fix a bug where we would not encrypt anon data inplace.
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.

When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.

This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21796
2019-09-27 20:08:19 +00:00
Andrew Gallatin
6554362c66 kTLS support for TLS 1.3
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by:	jhb, hselasky
Sponsored by:	Netflix
Differential Revision:	D21801
2019-09-27 19:17:40 +00:00
Mateusz Guzik
708cf7eb6c cache: decrease ncnegfactor to 5
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary

Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.

Sample result about afer 90 minutes of poudriere -j 104:

           current    no evict   % of the original
numneg     219737     2013157    916
numneghits 266711906  263544562  98 [1]

[1] this may look funny but there is a certain dose of variation to the
build

The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:14:03 +00:00
Mateusz Guzik
e643141838 cache: stop requeuing negative entries on the hot list
Turns out it does not improve hit ratio, but it does come with a cost
induces stemming from dirtying hit entries.

Sample result: hit counts of evicted entries after 2 buildworlds

before:

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                180865
               1 |@@@@@@@                                  49150
               2 |@@@                                      19067
               4 |@                                        9825
               8 |@                                        7340
              16 |@                                        5952
              32 |@                                        5243
              64 |@                                        4446
             128 |                                         3556
             256 |                                         3035
             512 |                                         1705
            1024 |                                         1078
            2048 |                                         365
            4096 |                                         95
            8192 |                                         34
           16384 |                                         26
           32768 |                                         23
           65536 |                                         8
          131072 |                                         6
          262144 |                                         0

after:
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                184004
               1 |@@@@@@                                   47577
               2 |@@@                                      19446
               4 |@                                        10093
               8 |@                                        7470
              16 |@                                        5544
              32 |@                                        5475
              64 |@                                        5011
             128 |                                         3451
             256 |                                         3002
             512 |                                         1729
            1024 |                                         1086
            2048 |                                         363
            4096 |                                         86
            8192 |                                         26
           16384 |                                         25
           32768 |                                         24
           65536 |                                         7
          131072 |                                         5
          262144 |                                         0

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:13:22 +00:00
Mateusz Guzik
312196df0f cache: make negative list shrinking a little bit concurrent
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.

While here count how many times we skipped shrinking due to the lock
being already taken.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:43 +00:00
Mateusz Guzik
95c6dd890a cache: stop recalculating upper limit each time a new entry is added
Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:20 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Andrew Turner
50bb04b750 Check the vfs option length is valid before accessing through
When a VFS option passed to nmount is present but NULL the kernel will
place an empty option in its internal list. This will have a NULL
pointer and a length of 0. When we come to read one of these the kernel
will try to load from the last address of virtual memory. This is
normally invalid so will fault resulting in a kernel panic.

Fix this by checking if the length is valid before dereferencing.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2019-09-27 16:22:28 +00:00
David Bright
c4571256af sysent: regenerate after r352747.
Sponsored by:	Dell EMC Isilon
2019-09-26 15:41:10 +00:00
Mark Johnston
55248d32f2 Fix handling of invalid pages in exec_map_first_page().
exec_map_first_page() would unconditionally free an unbacked, invalid
page from the executable image.  However, it is possible that the page
is wired, in which case it is incorrect to free the page, so check for
additional wirings first.

Reported by:	syzkaller
Tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21767
2019-09-26 15:35:35 +00:00
David Bright
9afb12bab4 Add an shm_rename syscall
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.

truss support is included; audit support will come later.

This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	jilles (earlier revision)
Reviewed by:	brueffer (manpages, earlier revision)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21423
2019-09-26 15:32:28 +00:00
Toomas Soome
11fc80a098 kernel terminal should initialize fg and bg variables before calling TUNABLE_INT_FETCH
We have two ways to check if kenv variable exists - either we check return
value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check
if this value did change. In terminal_init() it is more convinient to
use pre-initialized variables.

Problem was revealed by older loader.efi, which did not set teken.* variables.

Reported by:	tuexen
2019-09-26 07:19:26 +00:00
Alexander Motin
176dd236dc Microoptimize sched_pickcpu() CPU affinity on SMT.
Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5%
of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-26 00:35:06 +00:00
Alexander Motin
c55dc51c37 Microoptimize sched_pickcpu() after r352658.
I've noticed that I missed intr check at one more SCHED_AFFINITY(),
so instead of adding one more branching I prefer to remove few.

Profiler shows the function CPU time reduction from 0.24% to 0.16%.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-25 19:29:09 +00:00
Kyle Evans
079c5b9ed8 rfork(2): add RFSPAWN flag
When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets
signal handlers in the child during creation to avoid a point of corruption
of parent state from the child.

This flag will be used by posix_spawn(3) to handle potential signal issues.

Reviewed by:	jilles, kib
Differential Revision:	https://reviews.freebsd.org/D19058
2019-09-25 19:20:41 +00:00
Gleb Smirnoff
dd902d015a Add debugging facility EPOCH_TRACE that checks that epochs entered are
properly nested and warns about recursive entrances.  Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.

Reviewed by:	hselasky
Sponsored by:	Netflix
Pull Request:	https://reviews.freebsd.org/D21610
2019-09-25 18:26:31 +00:00
Kyle Evans
a9ac5e1424 sysent: regenerate after r352705
This also implements it, fixes kdump, and removes no longer needed bits from
lib/libc/sys/shm_open.c for the interim.
2019-09-25 18:09:19 +00:00
Kyle Evans
234879a7e3 Mark shm_open(2) as COMPAT12, succeeded by shm_open2
Implementation and regenerated files will follow.
2019-09-25 18:06:48 +00:00
Kyle Evans
460211e730 sysent: regenerate after r352700 2019-09-25 17:59:58 +00:00
Kyle Evans
20f7057685 Add a shm_open2 syscall to support upcoming memfd_create
shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.

shm_open and memfd_create will both be implemented in libc to use this new
syscall.

__FreeBSD_version is bumped to indicate the presence.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21393
2019-09-25 17:59:15 +00:00
Kyle Evans
0cd95859c8 [2/3] Add an initial seal argument to kern_shm_open()
Now that flags may be set on posixshm, add an argument to kern_shm_open()
for the initial seals. To maintain past behavior where callers of
shm_open(2) are guaranteed to not have any seals applied to the fd they're
given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special
flag could be opened later for shm_open(2) to indicate that sealing should
be allowed.

We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if
F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a
shmfd that already existed. A note's been added about the assumptions we've
made here as a hint towards anyone wanting to allow other seals to be
applied at creation.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21392
2019-09-25 17:35:03 +00:00
Kyle Evans
af755d3e48 [1/3] Add mostly Linux-compatible file sealing support
File sealing applies protections against certain actions
(currently: write, growth, shrink) at the inode level. New fileops are added
to accommodate seals - EINVAL is returned by fcntl(2) if they are not
implemented.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D21391
2019-09-25 17:32:43 +00:00
Kyle Evans
85c5f3cb57 Add COMPAT12 support to makesyscalls.sh
Reviewed by:	kib, imp, brooks (all without syscalls.master edits)
Differential Revision:	https://reviews.freebsd.org/D21366
2019-09-25 17:29:45 +00:00
Toomas Soome
3001e0c942 kernel: terminal_init() should check for teken colors from kenv
Check for teken.fg_color and teken.bg_color and prepare the color
attributes accordingly.

When white background is used, make it light to improve visibility.
When black background is used, make kernel messages light.
2019-09-25 13:21:07 +00:00
Alexander Motin
bb3dfc6ae9 Fix wrong assertion in r352658.
MFC after:	1 month
2019-09-25 11:58:54 +00:00
Alexander Motin
c9205e3500 Fix/improve interrupt threads scheduling.
Doing some tests with very high interrupt rates I've noticed that one of
conditions I added in r232207 to make interrupt threads in most cases
run on local CPU never worked as expected (worked only if previous time
it was executed on some other CPU, that is quite opposite).  It caused
additional CPU usage to run full CPU search and could schedule interrupt
threads to some other CPU.

This patch removes that code and instead reuses existing non-interrupt
code path with some tweaks for interrupt case:
 - On SMT systems, if current thread is idle, don't look on other threads.
Even if they are busy, it may take more time to do fill search and bounce
the interrupt thread to other core then execute it locally, even sharing
CPU resources.  It is other threads should migrate, not bound interrupts.
 - Try hard to keep interrupt threads within LLC of their original CPU.
This improves scheduling cost and supposedly cache and memory locality.

On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few
percents of CPU time while adding few percents to IOPS.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-24 20:01:20 +00:00
Randall Stewart
35c7bb3407 This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on  implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D21582
2019-09-24 18:18:11 +00:00
Mateusz Guzik
93a85508ad cache: tidy up handling of negative entries
- track the total count of hot entries
- pre-read the lock when shrinking since it is typically already taken
- place the lock in its own cacheline
- shorten the hold time of hot lock list when zapping

Sponsored by:	The FreeBSD Foundation
2019-09-23 20:50:04 +00:00
Mark Johnston
38dae42c26 Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations.
This is required for DPCPU and VNET data variable definitions to work when
KLDs are linked as DSOs.  R_X86_64_RELATIVE relocations should not appear
in object files, so assert this in elf_relocaddr().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21755
2019-09-23 14:14:43 +00:00
Mateusz Guzik
afe257e3ca cache: count evictions of negatve entries
Sponsored by:	The FreeBSD Foundation
2019-09-23 08:53:14 +00:00
Sean Eric Fagan
ba7a55d934 Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover:  Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir:  Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by:	allanjude, kib
Differential Revision:	https://reviews.freebsd.org/D21458
2019-09-23 04:28:07 +00:00
Mateusz Guzik
7505cffa56 cache: try to avoid vhold if locks held
Sponsored by:	The FreeBSD Foundation
2019-09-22 20:50:24 +00:00
Mateusz Guzik
cd2112c305 cache: jump in negative success instead of positive
Sponsored by:	The FreeBSD Foundation
2019-09-22 20:49:17 +00:00
Mateusz Guzik
d2be3ef05c lockprof: move per-cpu data to dpcpu
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21747
2019-09-22 20:44:24 +00:00
Konstantin Belousov
f33533da8c kern.elf{32,64}.pie_base sysctl: enforce page alignment.
Requested by:	rstone
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-21 20:03:17 +00:00
Mateusz Guzik
cbba2cb367 lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks
Sponsored by:	The FreeBSD Foundation
2019-09-21 19:05:38 +00:00
Konstantin Belousov
95aafd6900 Make non-ASLR pie base tunable.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-21 18:00:23 +00:00
Alexander Motin
36d151a237 Allocate callout wheel from the respective memory domain.
MFC after:	1 week
2019-09-21 15:38:08 +00:00
Andrew Gallatin
61b8a4af71 remove redundant "ktls" in KTLS thr name
This reducesthe string width of the ktls thread name
and improves "ps" output.

Glanced at by: jhb
Event: EuroBSDCon hackathon
Sponsored by:	Netflix
2019-09-20 09:36:07 +00:00
Mateusz Guzik
b488246b45 vfs: group fields used for per-cpu ops in one cacheline
Sponsored by:	The FreeBSD Foundation
2019-09-19 21:23:14 +00:00
Konstantin Belousov
382e01c8dc sysctl: use names instead of magic numbers.
Replace magic numbers with symbols for internal sysctl operations.
Convert in-kernel and libc consumers.

Submitted by:	Pawel Biernacki
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21693
2019-09-18 16:13:10 +00:00
Konstantin Belousov
55894117b1 Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY.
Reviewed by:	bcr (man page), emaste (previous version)
PR:	240452
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
DIfferential revision:	https://reviews.freebsd.org/D21634
2019-09-17 18:32:18 +00:00
Kirk McKusick
100369071d The VFS-level clustering code collects together sequential blocks
by issuing delayed-writes (bdwrite()) until a non-sequential block
is written or the maximum cluster size is reached. At that point
it collects the delayed buffers together (using bread()) to write
them in a single operation. The assumption was that since we just
looked at them they will still be in memory so there is no need to
check for a read error from bread(). Very occationally (apparently
every 10-hours or so when being pounded by Peter Holm's tests)
this assumption is wrong.

The fix is to check for errors from bread() and fail the cluster
write thus falling back to the default individual flushing of any
still dirty buffers.

Reported by: Peter Holm and Chuck Silvers
Reviewed by: kib
MFC after:   3 days
2019-09-17 17:44:50 +00:00
Mateusz Guzik
d245aa1e72 vfs: apply r352437 to the fast path as well
This one is very hard to run into. If the filesystem is being unmounted or
the mount point is freed the vfs_op_thread_enter will fail. For it to
succeed the mount point itself would have to be reallocated in the time
window between the initial read and the attempt to enter.

Sponsored by:	The FreeBSD Foundation
2019-09-17 15:53:40 +00:00
Mateusz Guzik
7f65185940 vfs: fix braino resulting in NULL pointer deref in r352424
The breakage was added after all the testing and the testing which followed
was not sufficient to find it.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation
2019-09-17 08:09:39 +00:00
Mateusz Guzik
4cace859c2 vfs: convert struct mount counters to per-cpu
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before:   852393 ops/s
after:  76682077 ops/s

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21637
2019-09-16 21:37:47 +00:00
Mateusz Guzik
e87f3f72f1 vfs: manage mnt_writeopcount with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21575
2019-09-16 21:33:16 +00:00
Mateusz Guzik
ee831b2543 vfs: manage mnt_lockref with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21574
2019-09-16 21:32:21 +00:00
Mateusz Guzik
a8c8e44bf0 vfs: manage mnt_ref with atomics
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by:	kib, jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21425
2019-09-16 21:31:02 +00:00
Kyle Evans
3155f2f0e2 rangelock: add rangelock_cookie_assert
A future change to posixshm to add file sealing (in DIFF_21391[0] and child)
will move locking out of shm_dotruncate as kern_shm_open() will require the
lock to be held across the dotruncate until the seal is actually applied.
For this, the cookie is passed into shm_dotruncate_locked which asserts
RCA_WLOCKED.

[0] Name changed to protect the innocent, hopefully, from getting autoclosed
due to this reference...

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D21628
2019-09-15 02:59:53 +00:00
Mateusz Guzik
ce3ba63f67 vfs: release usecount using fetchadd
1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the
interlock

These facts combined mean we can fetchadd instead of having a cmpset loop.

Reviewed by:	kib (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21528
2019-09-13 15:49:04 +00:00
Mark Johnston
45cdd437ae Remove a redundant NULL pointer check in cpuset_modify_domain().
cpuset_getroot() is guaranteed to return a non-NULL pointer.

Reported by:	Mark Millard <marklmi@yahoo.com>
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-09-12 16:47:38 +00:00
Hans Petter Selasky
11b57401e6 Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision:	https://reviews.freebsd.org/D21620
Reviewed by:	kib@, markj@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-09-12 16:26:59 +00:00
Kyle Evans
5163b1a75c Follow up r352244: kenv: tighten up assertions
As I like to forget: static kenv var formatting is actually such that an
empty environment would be double null bytes. We should make sure that a
non-zero buffer has at least enough for this, though most of the current
usage is with a 4k buffer.
2019-09-12 14:34:46 +00:00
Kyle Evans
436c46875d kenv: assert that an empty static buffer passed in is "empty"
Garbage in the passed-in buffer can cause problems if any attempts to read
the kenv are inadvertently made between init_static_kenv and the first
kern_setenv -- assuming there is one.

This is cheap and easy, so do it. This also helps rule out some class of
bugs as one tries to debug; tunables fetch from the static environment up
until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that
rely on BSS clearing while others just grab a page of free memory and use it
(e.g. xen).
2019-09-12 13:51:43 +00:00
Conrad Meyer
aaa3852435 buf: Add B_INVALONERR flag to discard data
Setting the B_INVALONERR flag before a synchronous write causes the buf
cache to forcibly invalidate contents if the write fails (BIO_ERROR).

This is intended to be used to allow layers above the buffer cache to make
more informed decisions about when discarding dirty buffers without
successful write is acceptable.

As a proof of concept, use in msdosfs to handle failures to mark the on-disk
'dirty' bit during rw mount or ro->rw update.

Extending this to other filesystems is left as future work.

PR:		210316
Reviewed by:	kib (with objections)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21539
2019-09-11 21:24:14 +00:00
Mateusz Guzik
b088a4d6f9 cache: avoid excessive relocking on entry removal during lookup
Due to lock ordering issues (bucket lock held, vnode locks wanted) the code
starts with trylocking which in face of contention often fails. Prior to
the change it would loop back with a possible yield.

Instead note we know what locks are needed and can take them in the right
order, avoiding retries. Then we can safely re-lookup and see if the entry
we are looking for is still there.

On a 104-way box poudriere would result in constant retries during an 11h
run as seen in the vfs.cache.zap_and_exit_bucket_fail counter.

before: 408866592
after :         0

However, a new stat reports:
vfs.cache.zap_and_exit_bucket_relock_success: 32638

Note this is only a bandaid over current design issues.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2019-09-10 20:19:29 +00:00
Mateusz Guzik
a6cacb0dca cache: change the formula for calculating lock array sizes
It used to be mp_ncpus * 64, but this gives unnecessarily big values for small
machines and at the same time constraints bigger ones. In particular this helps
on a 104-way box for which the count is now doubled.

While here make cache_purgevfs less likely. Currently it is not efficient in
face of contention due to lock ordering issues. These are fixable but not worth
it at the moment.

Sponsored by:	The FreeBSD Foundation
2019-09-10 20:11:00 +00:00
Mateusz Guzik
1214618c05 cache: assorted cleanups
Sponsored by:	The FreeBSD Foundation
2019-09-10 20:08:24 +00:00
Jeff Roberson
c75757481f Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
 - vm_page_grab_valid() will grab and page in if necessary.
 - vm_page_busy_acquire() automates some busy acquire loops.

Discussed with:	alc, kib, markj
Tested by:	pho (part of larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21546
2019-09-10 19:08:01 +00:00