Commit Graph

16916 Commits

Author SHA1 Message Date
Gleb Smirnoff
ed9d69b5e8 Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no
functional change.

Discussed with:	kib
2019-10-24 21:55:19 +00:00
John Baldwin
7d29eb9a91 Use a counter with a random base for explicit IVs in GCM.
This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq().  This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22117
2019-10-24 18:13:26 +00:00
Konstantin Belousov
c92f130498 Fix undefined behavior.
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22126
2019-10-23 16:06:47 +00:00
Konstantin Belousov
8076c4e7d1 vn_printf(): Decode VI_TEXT_REF.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-10-23 15:51:26 +00:00
Gleb Smirnoff
080e9496b8 Allow epoch tracker to use the very last byte of the stack. Not sure
this will help to avoid panic in this function, since it will also use
some stack, but makes code more strict.

Submitted by:	hselasky
2019-10-22 18:05:15 +00:00
Gleb Smirnoff
77d70e515f Assert that any epoch tracker belongs to the thread stack.
Reviewed by:	kib
2019-10-21 23:12:14 +00:00
Gleb Smirnoff
279b9aabe3 Remove epoch tracker from struct thread. It was an ugly crutch to emulate
locking semantics for if_addr_rlock() and if_maddr_rlock().
2019-10-21 18:19:32 +00:00
Andriy Gapon
3ad1ce46d3 debug,kassert.warnings is a statistic, not a tunable
MFC after:	1 week
2019-10-21 12:21:56 +00:00
Mark Johnston
f822c9e287 Apply mapping protections to preloaded kernel modules on amd64.
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21860
2019-10-18 13:56:45 +00:00
Mark Johnston
1d9eae9fb2 Apply mapping protections to .o kernel modules.
Use the section flags to derive mapping protections.  When multiple
sections overlap within a page, the union of their protections must be
applied.  With r353701 the .text and .rodata sections are padded to
ensure that this does not happen on amd64.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21896
2019-10-18 13:53:14 +00:00
Conrad Meyer
dda17b3672 Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.

There are three pieces in the complete NetGDB system.

First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.

Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic.  Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).

Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.

The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU).  There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with:	emaste, markj
Relnotes:	sure
Differential Revision:	https://reviews.freebsd.org/D21568
2019-10-17 21:33:01 +00:00
Mark Johnston
092bacb2c4 Clean up some nits in link_elf_(un)load_file().
- Remove a redundant assignment of ef->address.
- Don't return a Mach error number to the caller if vm_map_find() fails.
- Use ptoa() and fix style.

MFC after:	2 weeks
Sponsored by:	Netflix
2019-10-17 21:25:50 +00:00
Conrad Meyer
addccb8c51 Add a very limited DDB dumpon(8)-alike to MI dumper code
This allows ddb(4) commands to construct a static dumperinfo during
panic/debug and invoke doadump(false) using the provided dumper
configuration (always inserted first in the list).

The intended usecase is a ddb(4)-time netdump(4) command.

Reviewed by:	markj (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21448
2019-10-17 18:29:44 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Andriy Gapon
5fdc2c044e provide a way to assign taskqueue threads to a kernel process
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.

MFC after:	4 weeks
2019-10-17 06:32:34 +00:00
Mark Johnston
6d775f0ba1 Use KOBJMETHOD_END in the kernel linker.
MFC after:	1 week
2019-10-16 22:06:19 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Andrew Turner
9bb37c03fb Stop leaking information from the kernel through timespec
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.

Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.

In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.

This doesn't affect Tier-1 architectures so no SA is expected.

admbugs:	651
MFC after:	1 week
Sponsored by:	DARPA, AFRL
2019-10-16 13:21:01 +00:00
Kristof Provost
1d95443818 Generalize ARM specific comments in devmap
The comments in devmap are very ARM specific, this generalizes them for other
architectures.

Submitted by:	Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by:	manu, philip
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D22035
2019-10-15 23:21:52 +00:00
Gleb Smirnoff
4b25d1f2e3 Missing from r353596. 2019-10-15 21:32:38 +00:00
Gleb Smirnoff
bac060388f When assertion for a thread not being in an epoch fails also print all
entered epochs. Works with EPOCH_TRACE only.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D22017
2019-10-15 21:24:25 +00:00
Gleb Smirnoff
237c1f932b Remove pfctlinput2(). It came from KAME and had never ever been in use. 2019-10-15 15:40:03 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
63e9755548 (1/6) Replace busy checks with acquires where it is trival to do so.
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21548
2019-10-15 03:35:11 +00:00
Leandro Lupori
0ecc478b74 [PPC64] Initial kernel minidump implementation
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D21551
2019-10-14 13:04:04 +00:00
Gleb Smirnoff
f6eccf96a0 Since EPOCH_TRACE had been moved to opt_global.h, we don't need to waste
extra space in struct thread.
2019-10-14 04:17:56 +00:00
Mateusz Guzik
d1cbf3eeea vfs: add MNTK_NOMSYNC
On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22009
2019-10-13 15:40:34 +00:00
Mateusz Guzik
737241cd51 vfs: return free vnode batches in sync instead of vfs_msync
It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22008
2019-10-13 15:39:11 +00:00
Alexander Motin
a89a562b60 Allocate device softc from the device domain.
Since we are trying to bind device interrupt threads to the device domain,
it should have sense to make memory often accessed by them local. If domain
is not known, fall back to round-robin.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-10-12 19:03:07 +00:00
Kristof Provost
85d1151f96 mountroot: run statfs after mounting devfs
The usual flow for mounting a file system is to VFS_MOUNT() and then
immediately VFS_STATFS().

That's not done in vfs_mountroot_devfs(), which means the
mp->mnt_stat.f_iosize field is not correctly populated, which in turn
causes us to mark valid aio operations as unsafe (because the io size is
set to 0), ultimately causing the aio_test:md_waitcomplete test to fail.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D21897
2019-10-11 17:04:38 +00:00
Conrad Meyer
46d70077be ddb: Add CSV option, sorting to 'show (malloc|uma)'
Add /i option for machine-parseable CSV output.  This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first).  This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by:	Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by:	Dell EMC Isilon
2019-10-11 01:31:31 +00:00
John Baldwin
97ecf6efa0 Don't free the cursor boundary tag during vmem_destroy().
The cursor boundary tag is statically allocated in the vmem instead of
from the vmem_bt_zone.  Explicitly remove it from the vmem's segment
list in vmem_destroy before freeing all the segments from the vmem.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21953
2019-10-09 21:20:39 +00:00
Gleb Smirnoff
975b8f8462 Cleanup unneeded includes that crept in with r353292. 2019-10-09 16:59:42 +00:00
Gleb Smirnoff
ff3cfc330e Enter network epoch in domain callouts. 2019-10-09 16:21:05 +00:00
Mark Johnston
4013d72684 Fix handling of empty SCM_RIGHTS messages.
As unp_internalize() processes the input control messages, it builds
an output mbuf chain containing the internalized representations of
those messages.  In one special case, that of an empty SCM_RIGHTS
message, the message is simply discarded.  However, the loop which
appends mbufs to the output chain assumed that each iteration would
produce an mbuf, resulting in a null pointer dereference if an empty
SCM_RIGHTS message was followed by a non-empty message.

Fix this by advancing the output mbuf chain tail pointer only if an
internalized control message was produced.

Reported by:	syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-08 23:34:48 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Doug Moore
2288078c5e Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21882
2019-10-08 07:14:21 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Edward Tomasz Napierala
1a13f2e6b4 Introduce stats(3), a flexible statistics gathering API.
This provides a framework to define a template describing
a set of "variables of interest" and the intended way for
the framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof).  Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs.  The stats(3) framework can be used in
both userspace and the kernel.

See the stats(3) manual page for details.

This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.

The stats(3) framework is disabled by default for now, except
in the NOTES kernel (for QA); it is expected to be enabled
in amd64 GENERIC after a cool down period.

Reviewed by:	sef (earlier version)
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20477
2019-10-07 19:05:05 +00:00
Mateusz Guzik
dc20b834ca vfs: add optional root vnode caching
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by:	jeff (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21646
2019-10-06 22:14:32 +00:00
Kyle Evans
e3f35d562f Remove the remnants of SI_CHEAPCLONE
SI_CHEAPCLONE was introduced in r66067 for use with cloned bpfs. It was
later also used in tty, tun, tap at points. The rough timeline for being
removed in each of these is as follows:

- r181690: bpf switched to use cdevpriv API by ed@
- r181905: ed@ rewrote the TTY later to be mpsafe
- r204464: kib@ removes it from tun/tap, declaring it unused

I've not yet been able to dig up any other consumers in the intervening 9
years. It is no longer set on any devices in the tree and leaves an
interesting situation in make_dev_sv where we're ok with the device already
being set SI_NAMED.
2019-10-05 21:52:06 +00:00
Kyle Evans
d42fecb5c1 kern_conf: fully initialize cloned devices with make_dev_args, too
Attempting to initialize si_drv{1,2} with mda_si_drv{1,2} does not work if
you are operating on cloned devices.

clone_create must be called prior to the make_dev* family to create/return
the device on the clonelist as needed. This device is later returned early
in newdev(), prior to si_drv{0,1,2} initialization.

This patch simply breaks out of the loop if we've found a device and
finishes init.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21904
2019-10-05 21:44:18 +00:00
Mateusz Guzik
dfa8dae493 devfs: plug redundant bwillwrite avoidance
vn_write already checks for vnode type to see if bwillwrite should be called.

This effectively reverts r244643.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21905
2019-10-05 17:44:33 +00:00
Eric van Gyzen
e61e783b83 Add CTLFLAG_STATS to some vfs sysctl OIDs
Add CTLFLAG_STATS to the following OIDs:

vfs.altbufferflushes
vfs.recursiveflushes
vfs.barrierwrites
vfs.flushwithdeps
vfs.reassignbufcalls

Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:43:43 +00:00
Ed Maste
f91dd6091b simplify path handling in sysctl_try_reclaim_vnode
MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL.  Thus there is no need to increase the
buffer size here.

The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21876
2019-10-02 21:01:23 +00:00
Mark Johnston
5131cba6d6 Use OBJT_PHYS VM objects for kernel modules.
OBJT_DEFAULT incurs some unnecessary overhead given that kernel module
pages cannot be paged out.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21862
2019-10-02 16:34:42 +00:00
Mark Johnston
4a7b33ecf4 Disallow fcntl(F_READAHEAD) when the vnode is not a regular file.
The mountpoint may not have defined an iosize parameter, so an attempt
to configure readahead on a device file can lead to a divide-by-zero
crash.

The sequential heuristic is not applied to I/O to or from device files,
and posix_fadvise(2) returns an error when v_type != VREG, so perform
the same check here.

Reported by:	syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21864
2019-10-02 15:45:49 +00:00
Kyle Evans
5a391b572b shm_open2(2): completely unbreak
kern_shm_open2(), since conception, completely fails to pass the mode along
to kern_shm_open(). This breaks most uses of it.

Add tests alongside this that actually check the mode of the returned
files.

PR:		240934 [pulseaudio breakage]
Reported by:	ler, Andrew Gierth [postgres breakage]
Diagnosed by:	Andrew Gierth (great catch)
Tested by:	ler, tmunro
Pointy hat to:	kevans
2019-10-02 02:37:34 +00:00
Ed Maste
f403831e6c sysalls.master: remove superfluous ellipsis in comment
A single period is sufficient in this comment, and making this change
lets us find references to varargs syscalls by searching for ...
2019-10-01 17:05:21 +00:00
Brooks Davis
3a94552174 Restore the ability to set capenabled directly in syscalls.conf.
This fixes generation of cloudabi syscall tables broken in r340424.

Reviewed by:	kevans, emaste
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D21821
2019-09-30 20:58:29 +00:00