Commit Graph

18365 Commits

Author SHA1 Message Date
Alex Richardson
fa2528ac64 Use atomic loads/stores when updating td->td_state
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by:	GENERIC-KCSAN
Test Plan:	KCSAN no longer complains, kernel still runs fine.
Reviewed By:	markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
2021-02-18 14:02:48 +00:00
Konstantin Belousov
fa3bd463ce lockf: ensure atomicity of lockf for open(O_CREAT|O_EXCL|O_EXLOCK)
or EX_SHLOCK.  Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.

Requested by:	mckusick
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28697
2021-02-18 01:22:05 +02:00
Warner Losh
00065c7630 Giant: move back Giant removal until 14
Update the Giant Lock warning message to FreeBSD 14. It's growing increasling
clear that this won't be done before 13.0.

MFC: Insta (re@'s request)
2021-02-17 14:33:09 -07:00
Jamie Gritton
cc7b730653 jail: Handle a possible race between jail_remove(2) and fork(2)
jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state.  Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.

Add a prison flag PR_REMOVE, which is checked before the new process
returns.  If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.

Reported by:    trasz
Approved by:    kib
MFC after:      3 days
2021-02-16 11:19:13 -08:00
Konstantin Belousov
c61fae1475 pgcache read: protect against reads past end of the vm object size
If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.

PR:	253158
Reported by:	Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-16 07:09:37 +02:00
Alex Richardson
0482d7c9e9 Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum check
After eaad8d1303 four additional
capsicum-test tests started failing. It turns out this is because
fget_only_user() was returning EBADF on a failed capsicum check instead
of forwarding the return value of cap_check_inline() like
fget_unlocked_seq().

capsicum-test failures before this:
```
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] Capability.OperationsForked
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
[  FAILED  ] OpenatTest.WithFlag
[  FAILED  ] ForkedOpenatTest_WithFlagInCapabilityMode._
[  FAILED  ] Select.LotsOFileDescriptorsForked
```
After:
```
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] Capability.NoBypassDAC
[  FAILED  ] Pdfork.OtherUserForked
[  FAILED  ] PipePdfork.WildcardWait
```

Reviewed By:	mjg
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D28691
2021-02-15 22:55:12 +00:00
Jason A. Harmening
41032835dc Fix divide-by-zero panic when ASLR is enabled and superpages disabled
When locating the anonymous memory region for a vm_map with ASLR
enabled, we try to keep the slid base address aligned on a superpage
boundary to minimize pagetable fragmentation and maximize the potential
usage of superpage mappings.  We can't (portably) do this if superpages
have been disabled by loader tunable and pagesizes[1] is 0, and it
would be less beneficial in that case anyway.

PR:		253511
Reported by:	johannes@jo-t.de
MFC after:	1 week
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28678
2021-02-15 10:38:04 -08:00
Mateusz Guzik
eac22dd480 lockmgr: shrink struct lock by 8 bytes on LP64
Currently the struct has a 4 byte padding stemming from 3 ints.

1. prio comfortably fits in short, unfortunately there is no dedicated
   type for it and plumbing it throughout the codebase is not worth it
   right now, instead an assert is added which covers also flags for
   safety
2. lk_exslpfail can in principle exceed u_short, but the count is
   already not considered reliable and it only ever gets modified
   straight to 0. In other words it can be incrementing with an upper
   bound of USHRT_MAX

With these in place struct lock shrinks from 48 to 40 bytes.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D28680
2021-02-15 13:57:25 +00:00
Konstantin Belousov
25c6318c79 procstat: distinguish vm map guards in procstat vm output.
Requested and reviewed by:	rwatson (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28658
2021-02-14 03:24:58 +02:00
Konstantin Belousov
adf28ab456 fifo: minor comment and assert improvements.
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov
b59a8e63d6 Stop ignoring ERELOOKUP from VOP_INACTIVE()
When possible, relock the vnode and retry inactivation.  Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov
3b2aa36024 Use VOP_VPUT_PAIR() for eligible VFS syscalls.
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
49c117c193 Add VOP_VPUT_PAIR() with trivial default implementation.
The VOP is intended to be used in situations where VFS has two
referenced locked vnodes, typically a directory vnode dvp and a vnode
vp that is linked from the directory, and at least dvp is vput(9)ed.
The child vnode can be also vput-ed, but optionally left referenced and
locked.

There, at least UFS may need to do some actions with dvp which cannot be
done while vp is also locked, so its lock might be dropped temporary.
For instance, in some cases UFS needs to sync dvp to avoid filesystem
state that is currently not handled by either kernel nor fsck. Having
such VOP provides the neccessary context for filesystem which can do
correct locking and handle potential reclamation of vp after relock.

Trivial implementation does vput(dvp) and optionally vput(vp).

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
ee965dfa64 vn_open(): If the vnode is reclaimed during open(2), do not return error.
Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures.  Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.

Suggested by: chs
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov
bf0db19339 buf SU hooks: track buf_start() calls with B_IOSTARTED flag
and only call buf_complete() if previously started.  Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by:	pho
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundations
2021-02-12 03:02:19 +02:00
Mateusz Guzik
39e0c3f686 cache: assorted comment fixups 2021-02-09 17:09:44 +01:00
Kyle Evans
504ebd612e kern: sonewconn: set so_options before pru_attach()
Protocol attachment has historically been able to observe and modify
so->so_options as needed, and it still can for newly created sockets.
779f106aa1 moved this to after pru_attach() when we re-acquire the
lock on the listening socket.

Restore the historical behavior so that pru_attach implementations can
consistently use it. Note that some pru_attach() do currently rely on
this, though that may change in the future. D28265 contains a change to
remove the use in TCP and IB/SDP bits, as resetting the requested linger
time on incoming connections seems questionable at best.

This does move the assignment out from under the head's listen lock, but
glebius notes that head won't be going away and applications cannot
assume any specific ordering with a race between a connection coming in
and the application changing socket options anyways.

Discussed-with:	glebius
MFC-after:	1 week
2021-02-08 21:44:43 -06:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Mark Johnston
b5aa9ad43a ktls: Make configuration sysctls available as tunables
Reviewed by:	gallatin, jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28499
2021-02-08 09:19:02 -05:00
Mark Johnston
1755b2b989 ktls: Use COUNTER_U64_DEFINE_EARLY
This makes it a bit more straightforward to add new counters when
debugging.  No functional change intended.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28498
2021-02-08 09:18:51 -05:00
Mateusz Guzik
2f8a844635 cache: remove the largely obsolete general description
Examples of inconsistencies with the current state:
- references LRU of all entries, removed years ago
- references a non-existent lock (neglist)
- claims negative entries have a NULL target

It will be replaced with a more accurate and more informative
description.

In the meantime take it out so it stops misleading.
2021-02-06 00:28:40 +01:00
Mateusz Guzik
0e1594e60e cache: fix vfs:namecache:lookup:miss probe call sites 2021-02-06 00:28:40 +01:00
Mateusz Guzik
2e96132a7d cache: drop spurious arg from panic in cache_validate
vp is already reported when noting mismatch
2021-02-06 00:28:39 +01:00
Mateusz Guzik
b54ed778fe cache: comment on FNV 2021-02-06 00:13:57 +01:00
Ed Maste
edc374e7c4 Correct description for kern.proc.proc_td
kern.proc.proc_td returns the process table with an entry for each
thread.  Previously the description included "no threads", presumably
a cut-and-pasteo in 2648efa621.

Description suggested by PauAmma.

PR:		253146
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-02-02 17:00:05 -05:00
Mateusz Guzik
45456abc4c cache: fix trailing slash support in face of permission problems
Reported by:	Johan Hendriks <joh.hendriks gmail.com>
Tested by:	kevans
2021-02-02 18:13:51 +00:00
Mateusz Guzik
6f19dc2124 cache: add delayed degenerate path handling 2021-02-01 04:53:23 +00:00
Mateusz Guzik
bbfb1edd70 cache: move hash computation into the parsing loop 2021-02-01 04:36:45 +00:00
Mateusz Guzik
e027e24bfa cache: add trailing slash support
Tested by:	pho
2021-01-31 12:02:46 +00:00
Mateusz Guzik
8cbd164a17 cache: handle NOFOLLOW requests for symlinks
Tested by:	pho
2021-01-31 12:02:46 +00:00
Gleb Smirnoff
3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Mateusz Guzik
45e1f85414 poll: use fget_unlocked or fget_only_user when feasible
This follows select by eleminating the use of filedesc lock.
This is a win for single-threaded processes and a mixed bag for others
as at small concurrency it is faster to take the lock instead of
refing/unrefing each file descriptor.

Nonetheless, removal of shared lock usage is a step towards a
mtx-protected fd table.
2021-01-29 11:23:44 +00:00
Mateusz Guzik
6affe1b712 select: employ fget_only_user
Since most select users are single-threaded this avoid a lot of work
in the common case.

For example select of 16 fds (ops/s):
before:	2114536
after:	2991010
2021-01-29 11:23:44 +00:00
Mateusz Guzik
eaad8d1303 fd: add fget_only_user
This can be used by single-threaded processes which don't share a file
descriptor table to access their file objects without having to
reference them.

For example select consumers tend to match the requirement and have
several file descriptors to inspect.
2021-01-29 11:23:43 +00:00
Jamie Gritton
c050ea803e jail: Handle a parent jail when a child is added to it
It's possible when adding a jail that its dying parent comes back to
life.  Only allow that to happen when JAIL_DYING is specified.  And if
it does happen, call PR_METHOD_CREATE on it.
2021-01-28 21:51:09 -08:00
Bryan Drewery
c926114f2f Fix getblk() with GB_NOCREAT returning false-negatives.
It is possible for a buf to be reassigned between the dirty and clean
lists while gbincore_unlocked() looks in each list.  Avoid creating
a buffer in that case and fallback to a locked lookup.

This fixes a regression from r363482.

More discussion on potential improvements to the clean and dirty lists
handling is in the review.

Reviewed by:	cem, kib, markj, vangyzen, rlibby
Reported by:	Suraj.Raju at dell.com
Submitted by:	Suraj.Raju at dell.com, cem, [based on both]
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D28375
2021-01-28 11:24:24 -08:00
Mateusz Guzik
5c325977b1 cache: add missing MNT_NOSYMFOLLOW check to symlink traversal 2021-01-27 15:08:38 +00:00
Mateusz Guzik
5fc384d181 cache: fallback when encountering a mount point during .. lookup
The current abort is overzealous.
2021-01-27 16:00:31 +01:00
Bjoern A. Zeeb
6f65b50546 firmware(9): extend firmware_get() by a "no warn" flag.
With the upcoming usage from LinuxKPI but also from drivers
ported natively we are seeing more probing of various
firmware (names).

Add the ability to firmware(9) to silence the
"firmware image loading/registering errors" by adding a new
firmware_get_flags() functions extending firmware_get() and
taking a flags argument as firmware_put() already does.

Requested-by:	zeising (for future LinuxKPI/DRM)
Sponsored-by:	The FreeBSD Foundation
Sponsored-by:	Rubicon Communications, LLC ("Netgate")
MFC after:	3 days
Reviewed-by:	markj
Differential Revision:	https://reviews.freebsd.org/D27413
2021-01-27 13:51:26 +00:00
Mateusz Guzik
a098a831a1 cache: tidy up handling of foo/bar lookups where foo is not a directory
The code was performing an avoidable check for doomed state to account
for foo being a VDIR but turning VBAD. Now that dooming puts a vnode
in a permanent "modify" state this is no longer necessary as the final
status check will catch it.
2021-01-26 20:42:53 +00:00
Mateusz Guzik
a51eca7936 cache: stop referring to removing entries as invalidating them
Said use is a remnant from the old code and clashes with the NCF_INVALID
flag.
2021-01-26 20:42:53 +00:00
Brooks Davis
d89c1c461c Reserve gaps in syscall numbers for local use
It is best for auditing of syscalls.master if we only append to the
file.  Reserving unimplemented system call numbers for local use makes
this policy and provides a large set of syscall numbers FreeBSD
derivatives can use without risk of conflict.

Reviewed by:	jhb, kevans, kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988
2021-01-26 18:27:45 +00:00
Brooks Davis
119fa6ee8a syscalls.master: Add a new syscall type: RESERVED
RESERVED syscall number are reserved for local/vendor use.  RESERVED is
identical to UNIMPL except that comments are ignored.

Reviewed by:	kevans
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988
2021-01-26 18:27:44 +00:00
Brooks Davis
65a524b499 Remove documentation of unimplemented syscalls
We have not been able to run binaries from other BSDs well over a
decade.  There is no need to document their allocation decisions here.

We also don't need to reserve syscall numbers of never-implemented
syscalls.

Reviewed by:	jhb, kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988
2021-01-26 18:27:44 +00:00
Mateusz Guzik
6943671b48 cache: convert cache_fplookup_parse to void now that it always succeeds 2021-01-26 13:24:03 +01:00
Mateusz Guzik
e7cf562a40 cache: change ->v_cache_dd synchronisation rules
Instead of resorting to seqc modification take advantage of immutability
of entries and check if the entry still matches after everything got
prepared.
2021-01-25 22:41:13 +00:00
Mateusz Guzik
6f08427649 cache: make ->v_cache_dd accesses atomic-clean for lockless usage 2021-01-25 22:41:13 +00:00
Mateusz Guzik
6ef8fede86 cache: make ->nc_flag accesses atomic-clean for lockless usage 2021-01-25 22:41:13 +00:00
Mateusz Guzik
ffcf8f97f8 cache: store vnodes in local vars in cache_zap_locked 2021-01-25 22:41:13 +00:00
Mateusz Guzik
868643e722 cache: assorted cleanups 2021-01-25 19:45:24 +00:00
Mateusz Guzik
1c7a65adb0 cache: track calls to cache_symlink_alloc with unsupported size
While here assert on size passed to free.
2021-01-25 19:45:23 +00:00
Mateusz Guzik
02ec31bdf6 cache: add back target entry on rename 2021-01-23 18:10:16 +00:00
Mateusz Guzik
739ecbcf1c cache: add symlink support to lockless lookup
Reviewed by:	kib (previous version)
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D27488
2021-01-23 15:04:43 +00:00
Jamie Gritton
195cd6ae24 jail: fix dangling reference bug from 6754ae2572
The change to use refcounts for pr_uref was mishandled in
prison_proc_free, so killing a jail's last process could add
an extra reference, leaving it an unkillable zombie.
2021-01-22 10:56:24 -08:00
Jamie Gritton
39c8ef90f6 jail: A jail could be removed without calling OSD methods
Fix a long-standing bug where setting nopersist on a process-less jail
would remove it without calling the the OSD PR_METHOD_REMOVE methods.
2021-01-22 10:50:10 -08:00
Marius Strobl
679e4cdabd kvprintf(9): add missing FALLTHROUGH
Reported by:	Coverity
CID:		1005166
2021-01-22 00:18:40 +01:00
Konstantin Belousov
1ac7c34486 malloc_aligned: roundup allocation size up to next power of two
to make it use the right aligned zone.

Reported by:	melifaro
Reviewed by:	alc, markj (previous version)
Discussed with:	jrtc27
Tested by:	pho (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28219
2021-01-21 23:34:10 +02:00
Konstantin Belousov
0781c79d48 Restrict supported alignment for malloc_domainset_aligned(9) to PAGE_SIZE.
UMA page_alloc() does not take an alignment, so UMA can only handle
alignment less then page size.

Noted by:	alc
Reviewed by:	alc, markj (previous version)
Discussed with:	jrtc27
Tested by:	pho (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28219
2021-01-21 23:34:10 +02:00
Jamie Gritton
6754ae2572 jail: Use refcount(9) for prison references.
Use refcount(9) for both pr_ref and pr_uref in struct prison.  This
allows prisons to held and freed without requiring the prison mutex.
An exception to this is that dropping the last reference will still
lock the prison, to keep the guarantee that a locked prison remains
valid and alive (provided it was at the time it was locked).

Among other things, this honors the promise made in a comment in
crcopy(9), that it will not block, which hasn't been true for two
decades.
2021-01-20 15:08:27 -08:00
Vladimir Kondratyev
e3dd8ed77b devinfo sysctl handler: Do not write zero-length strings in to sbuf twice
This fixes missing PnPinfo and location strings in devinfo(8) output
for devices with no attached drivers.
2021-01-21 02:06:16 +03:00
Alan Somers
2247f48941 aio: micro-optimize the lio_opcode assignments
This allows slightly more efficient opcode testing in-kernel.  It is
transparent to userland, except to applications that sneakily submit
aio fsync or aio mlock operations via lio_listio, which has never been
documented, requires the use of deliberately undefined constants
(LIO_SYNC and LIO_MLOCK), and is arguably a bug.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D27942
2021-01-20 09:02:25 -07:00
Alex Richardson
7e99c034f7 Emit uprintf() output for initproc if there is no controlling terminal
This patch helped me debug why /sbin/init was not being loaded after
making changes to the image activator in CheriBSD.

Reviewed By:	jhb (earlier version), kib
Differential Revision: https://reviews.freebsd.org/D28121
2021-01-20 09:54:46 +00:00
Mateusz Guzik
2171b8e8a2 cache: augment sdt probe in cache_fplookup_dot
Same as 6d386b4c ("cache: save a branch in cache_fplookup_next")
2021-01-20 07:23:14 +00:00
Mateusz Guzik
aae03cfe64 cache: whitespace nit in cache_fplookup_modifying 2021-01-20 07:22:04 +00:00
Mark Johnston
4dc1b17dbb ktls: Improve handling of the bind_threads tunable a bit
- Only check for empty domains if we actually tried to configure domain
  affinity in the first place.  Otherwise setting bind_threads=1 will
  always cause the sysctl value to be reported as zero.  This is
  harmless since the threads end up being bound, but it's confusing.
- Try to improve the sysctl description a bit.

Reviewed by:	gallatin, jhb
Submitted by:	Klara, Inc.
Sponsored by:	Ampere Computing
Differential Revision:	https://reviews.freebsd.org/D28161
2021-01-19 21:32:33 -05:00
Mateusz Guzik
38baca17e0 lockmgr: fix upgrade
TRYUPGRADE requests kept failing when they should not have due to wrong
macro used to count readers.

Fixes:	f6b091fbbd ("lockmgr: rewrite upgrade to stop always dropping the lock")
Noted by:	asomers
Differential Revision:	https://reviews.freebsd.org/D27947
2021-01-19 12:21:38 +00:00
Mateusz Guzik
57dab0292a cache: fix some typos 2021-01-19 10:17:14 +01:00
Mateusz Guzik
84ab77ad27 cache: drop-write only var from cache_fplookup_preparse 2021-01-19 10:13:30 +01:00
Mateusz Guzik
6d386b4c8a cache: save a branch in cache_fplookup_next
Previously the code would branch on top find out whether it should
branch on SDT probe and bumping the numposhits counter, depending
on cache_fplookup_cross_mount.

Arguably it should be done regardless of what said function returns.
2021-01-19 10:08:24 +01:00
Jamie Gritton
effad35ed1 jail: Clean up some function placement and improve comments.
Move prison_hold, prison_hold_locked ,prison_proc_hold, and
prison_proc_free to a more intuitive part of the file (together with
with prison_free and prison_free_locked), and add or improve comments
to these and others, to better describe what's going in the prison
reference cycle.

No functional changes.
2021-01-18 17:23:51 -08:00
Oleksandr Tymoshenko
248f0cabca make maximum interrupt number tunable on ARM, ARM64, MIPS, and RISC-V
Use a machdep.nirq tunable intead of compile-time constant NIRQ
as a value for maximum number of interrupts. It allows keep a system
footprint small by default with an option to increase the limit
for large systems like server-grade ARM64

Reviewd by:	mhorne
Differential Revision:	https://reviews.freebsd.org/D27844
Submitted by:	Klara, Inc.
Sponsored by:	Ampere Computing
2021-01-18 16:36:39 -08:00
Jamie Gritton
83bc72a04e jail: Fix a stray mutex from 76ad42abf9. 2021-01-18 15:47:09 -08:00
Jamie Gritton
76ad42abf9 jail: Add prison_isvalid() and prison_isalive()
prison_isvalid() checks if a prison record can be used at all, i.e.
pr_ref > 0.  This filters out prisons that aren't fully created, and
those that are either in the process of being dismantled, or will be
at the next opportunity.  While the check for pr_ref > 0 is simple
enough to make without a convenience function, this prepares the way
for other measures of prison validity.

prison_isalive() checks not only validity as far as the useablity of
the prison structure, but also whether the prison is visible to user
space.  It replaces a test for pr_uref > 0, which is currently only
used within kern_jail.c, and not often there.

Both of these functions also assert that either the prison mutex or
allprison_lock is held, since it's generally the case that unlocked
prisons aren't guaranteed to remain useable for any length of time.
This isn't entirely true, for example a thread can assume its own
prison is good, but most exceptions will exist inside of kern_jail.c.
2021-01-18 10:56:20 -08:00
Konstantin Belousov
36bcc44e2c Add ddb 'show timecounter' command.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-01-18 09:51:48 +02:00
Jamie Gritton
25c2c952e3 jail: Add proper prison locking in mqfs_prison_remove. 2021-01-17 17:41:09 -08:00
Konstantin Belousov
3b15beb30b Implement malloc_domainset_aligned(9).
Change the power-of-two malloc zones to require alignment equal to the
size [*].  Current uma allocator already provides such alignment, so in
fact this change does not change anything except providing future-proof
setup.

Suggested by:	markj [*]
Reviewed by:	andrew, jah, markj
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28147
2021-01-17 19:29:05 +02:00
Mateusz Guzik
fe258f23ef Save on getpid in setproctitle by supporting -1 as curproc. 2021-01-16 09:36:54 +01:00
Kirk McKusick
79a5c790bd Eliminate a locking panic when cleaning up UFS snapshots after a
disk failure.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.

Sponsored by: Netflix
2021-01-15 16:36:42 -08:00
Mitchell Horne
818390ce0c arm64: fix early devmap assertion
The purpose of this KASSERT is to ensure that we do not run out of space
in the early devmap. However, the devmap grew beyond its initial size of
2MB in r336519, and this assertion did not grow with it.

A devmap mapping of a 1080p framebuffer requires 1920x1080 bytes, or
1.977 MB, so it is just barely able to fit without triggering the
assertion, provided no other devices are mapped before it. With the
addition of `options GDB` in GENERIC by bbfa199cbc, the uart is now
mapped for the purposes of a debug port, before mapping the framebuffer.
The presence of both these conditions pushes the selected virtual
address just below the threshold, triggering the assertion.

To fix this, use the correct size of the devmap, defined by
PMAP_MAPDEV_EARLY_SIZE. Since this code is shared with RISC-V, define
it for that platform as well (although it is a different size).

PR:		25241
Reported by:	gbe
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-01-13 17:27:44 -04:00
Mateusz Guzik
ef23df1354 vfs: set NC_KEEPPOSENTRY alongside NOCACHE when creating a file
Arguably the entire NOCACHE logic should get retired, in the meantime
at least prevent the code from evicting existing entries.
2021-01-13 15:29:34 +00:00
Mateusz Guzik
5753be8e43 fd: add refcount argument to falloc_noinstall
This lets callers avoid atomic ops by initializing the count to required
value from the get go.

While here add falloc_abort to backpedal from this without having to
fdrop.
2021-01-13 15:29:34 +00:00
Mateusz Guzik
5171310e66 vfs: use finstall_refed in openat
This avoids 2 atomic ops in the common case: 1 to grab an extra
reference and 1 to release it.
2021-01-13 03:30:38 +00:00
Mateusz Guzik
530b699a62 fd: add finstall_refed
Can be used to consume an already existing reference and consequently
avoid atomic ops.
2021-01-13 03:27:03 +01:00
Mateusz Guzik
4faa375cdd fd: provide a dedicated closef variant for unix socket code
This avoids testing for td != NULL.
2021-01-13 03:27:03 +01:00
Konstantin Belousov
0659df6fad vm_map_protect: allow to set prot and max_prot in one go.
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome.  E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by:	brooks, markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28117
2021-01-13 01:35:22 +02:00
Mateusz Guzik
70ba77706d vfs: extend vfs:namei:lookup:return probe with nameidata 2021-01-12 13:35:27 +00:00
Mateusz Guzik
cdb62ab74e vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers
Check the comment above the routine for reasoning.
2021-01-12 13:16:10 +00:00
Mateusz Guzik
6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Konstantin Belousov
57f22c828e sigfastblock: do not skip cursig/postsig loop in ast()
Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now.  This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.

Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.

Reported by:	Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28089
2021-01-12 12:45:26 +02:00
Konstantin Belousov
513320c0f1 sigfastblock_setpend(): do not set PEND user flag unless TDP_SIGFASTPENDING is set.
User pending bit should not be set if kernel did not noted a pending signal.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28089
2021-01-12 12:43:34 +02:00
Alan Somers
ff1a307801 lio_listio: validate aio_lio_opcode
Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland).  The situation became more serious with
022ca2fc7f.  After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.

Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.

MFC-with:	022ca2fc7f
Reviewed by:	jhb, tmunro, 0mp
Differential Revision:	<https://reviews.freebsd.org/D28078
2021-01-11 19:53:01 -07:00
Jason A. Harmening
e8a5a1ad71 rctl(4): support throttling resource usage to 0
For rate-based resources that support throttling (e.g.
readiops/writeips), this fixes a divide-by-zero panic when rctl(8)
passes 0 as the throttle value.  For these resources, treat
zero-throttle requests as requests to suspend forward progress as long
as possible using the duration specified in
kern.racct.rctl.throttle_max.

PR:		251803
Reported by:	chris@cretaforce.gr
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27858
2021-01-11 15:36:57 -08:00
Konstantin Belousov
4ea65707d3 exec_new_vmspace: print useful error message on ctty if stack cannot be mapped.
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.

In the relatively common case of failure to map stack, print some hints
on the control terminal.  Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.

Requested by:	jhb
Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Konstantin Belousov
2e1c94aa1f Implement enforcing write XOR execute mapping policy.
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Robert Watson
30b68ecda8 Changes that improve DTrace FBT reliability on freebsd/arm64:
- Implement a dtrace_getnanouptime(), matching the existing
  dtrace_getnanotime(), to avoid DTrace calling out to a potentially
  instrumentable function.

  (These should probably both be under KDTRACE_HOOKS.  Also, it's not clear
  to me that they are correct implementations for the DTrace thread time
  functions they are used in .. fixes for another commit.)

- Don't allow FBT to instrument functions involved in EL1 exception handling
  that are involved in FBT trap processing: handle_el1h_sync() and
  do_el1h_sync().

- Don't allow FBT to instrument DDB and KDB functions, as that makes it
  rather harder to debug FBT problems.

Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel
panics due to recursion in DTrace.

Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to
have the aarch64 instrumentor more carefully check that instructions it
replaces are against the stack pointer, which can otherwise lead to memory
corruption.  That change remains under review.

MFC after:	2 weeks
Reviewed by:	andrew, kp, markj (earlier version), jrtc27 (earlier version)
Differential revision:	https://reviews.freebsd.org/D27766
2021-01-11 15:42:22 +00:00
Robert Watson
4f2cbaf3cd Track pipe(2) reads and writes as rusage message receives and sends, a
feature misplaced during the transition from BSD 4.4's socket implementation
to the optimised FreeBSD pipe implementation.

MFC after:		1 week
Reviewed by:		arichardson, imp
Differential Revision:	https://reviews.freebsd.org/D27878
2021-01-10 12:16:39 +00:00
Jamie Gritton
2a4b225146 jail: Simplify handling of prison_deref()
Track the the current lock/reference state in a single variable,
rather than deducing the proper prison_deref() flags from a
combination of equations and hard-coded values.
2021-01-09 21:05:06 -08:00
Konstantin Belousov
5844bd058a jobc: rework detection of orphaned groups.
Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.

This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).

Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.

(For MFC, keep pg_jobc around but unused).

Reported by:	jhb
Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
cf4f802e77 kinfo_proc: move job-control related data collection into a new helper.
This improves code structure and allows to put the lock asserts right
into place where the locks are needed.

Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only()
to fill_kinfo_proc(), this looks more symmetrical.

Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
4daea93813 Lock proctree in around fill_kinfo_proc().
Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc.  There was even an XXX comment
about it.

Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.

Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
a008bdeda3 tty_wait_background: improve locking.
Increase the scope of the process group lock ownership.  This ensures that
we are consistent in returning EIO for tty write from an orphan and delivery
of TTYOUT signals.

Reviewed by:	jilles
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
ef739c7373 pgrp: Prevent use after free.
Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed.  See for instance
tty_wait_background().

Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.

Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:19 +02:00
Konstantin Belousov
e0d83cd3e4 issignal(): when handling STOP-like signals, drop sigacts mutex earlier.
Reviewed by:	jilles
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:19 +02:00
Konstantin Belousov
993a1699b1 Style. Improve some KASSERTs messages.
Reviewed by:	jilles
Tested by:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:19 +02:00
Michael Tuexen
6685e259e3 tcp: don't use KTLS socket option on listening sockets
KTLS socket options make use of socket buffers, which are not
available for listening sockets.

Reported by:		syzbot+a8829e888a93a4a04619@syzkaller.appspotmail.com
Reviewed by:		jhb@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D27948
2021-01-08 08:57:11 +01:00
Jan Kokemüller
4d0c33be63 kevent(2): Bugfix for wrong EVFILT_TIMER timeouts
When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be
used instead of NS_TO_SBT, otherwise the timeout results are
misleading.

PR:		252539
Reviewed by:	kevans, kib
Approved by:	kevans
MFC after:	3 weeks
2021-01-09 20:00:25 +01:00
Warner Losh
40e6e2c2f7 sysctl: improve debug.kdb.panic_str description
Improve the wording for this sysctl.

Submitted by: rpokala@
2021-01-09 11:10:42 -07:00
Warner Losh
936440560b sysctl: implement debug.kdb.panic_str
This is just like debug.kdb.panic, except the string that's passed in
is reported in the panic message. This allows people with automated
systems to collect kernel panics over a large fleet of machines to
flag panics better. Strings like "Warner look at this hang" or "see
JIRA ABC-1234 for details" allow these automated systems to route the
forced panic to the appropriate engineers like you can with other
types of panics. Other users are likely possible.

Relnotes: Yes
Sponsored by: Netflix
Reviewed by: allanjude (earlier version)
Suggestions from review folded in by: 0mp, emaste, lwhsu
Differential Revision: https://reviews.freebsd.org/D28041
2021-01-08 14:30:28 -07:00
Andrew Gallatin
52cd25eb1a mbuf: enable ext_pgs ("unmapped") mbufs by default
Ext_pg mbufs allow carrying multiple pages per mbuf. This
reduces mbuf linked list traversals, especially in socket
buffers, thereby reducing cache misses and CPU use for
applications using sendfile.  Note that ext_pages use
unmapped pages, eliminating KVA mapping costs on 32-bit
platforms.

Ext_pg mbufs are also required for ktls (KERN_TLS), and having
them disabled by default is a stumbling block for those
wishing to enable ktls.

Reviewed-by:	jhb, glebius
Sponsored by:	Netfix
2021-01-08 13:43:30 -05:00
Mateusz Guzik
8ddea0b127 cache: just assign ni_resflags = NIRES_ABS
It is guaranteed to be 0 on entry.
2021-01-08 13:57:10 +00:00
Toomas Soome
742653ebd5 sysctl debug.dump_modinfo should recognize font module
Add MODINFOMD_FONT to dump list.
2021-01-08 09:24:49 +02:00
Alan Somers
20321e6225 Regenerate syscall files after reallocation of aio_writev/aio_readv 2021-01-07 19:50:32 -07:00
Alan Somers
b3286afae3 Reallocate syscall numbers for aio_writev and aio_readv
The originally chosen numbers interfere with downstream projects'
syscalls.  Move them to the end of the syscall table instead.

Reported by:	jrtc27
Reviewed by:	brooks
MFC-With:	022ca2fc7f
Differential Revision:	022ca2fc7f
2021-01-07 19:49:27 -07:00
Thomas Munro
801ac943ea aio_fsync(2): Support O_DSYNC.
aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2).

Reviewed by: kib, asomers, jhb
Differential Review: https://reviews.freebsd.org/D25071
2021-01-08 13:15:56 +13:00
Thomas Munro
a5e284038e open(2): Add O_DSYNC flag.
POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).

VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.

Flag also applies to fcntl(2).

Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090
2021-01-08 13:15:56 +13:00
Mateusz Guzik
71bd18d373 fd: use seqc_read_notmodify when translating fds 2021-01-07 23:30:04 +00:00
Mateusz Guzik
20ac5cda96 fd: make fd/fp mandatory
They are both always passed anyway.
2021-01-07 23:30:04 +00:00
Mateusz Guzik
fee405e057 cache: stop checkpointing cn_flags
They are only modified, if ever, for the last component.
2021-01-07 23:29:52 +00:00
Mateusz Guzik
ac7715471c cache: stop checkpointing cn_nameptr
For aborts cn_nameptr is the same as cn_pnbuf. For partial results
the same cn_nameptr is to be used.
2021-01-07 23:29:38 +00:00
Mateusz Guzik
0f1fc3a31f cache: stop manipulating pathlen
It is a copy-pasto from regular lookup. Add debug to ensure the result
is the same.
2021-01-07 23:26:53 +00:00
Chuck Silvers
11403bdeb4 vfs: fix rangelock range in vn_rdwr() for IO_APPEND
vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.

Reviewed by:	kib, imp, mckusick
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D28008
2021-01-07 13:37:35 -08:00
Mateusz Guzik
f2b794e1e9 cache: unengrish the comment in previous commit
Reported by:	rpokala, brd
2021-01-06 23:46:05 +00:00
Mateusz Guzik
deabdc6868 cache: stop pre-checking seqc when starting the lookup
Tested by:	pho
2021-01-06 07:28:07 +00:00
Mateusz Guzik
71a6a0b545 cache: skip checking for spurious slashes if possible
Tested by:	pho
2021-01-06 07:28:06 +00:00
Mateusz Guzik
33f3e81df5 cache: combine fast path enabled status into one flag
Tested by:	pho
2021-01-06 07:28:06 +00:00
Mateusz Guzik
dbbbc07cc3 cache: split handling of 0 and non-0 error codes
Tested by:	pho
2021-01-06 07:07:24 +01:00
Mateusz Guzik
a1a8f8ada1 cache: deinline state handling
The intent is to reduce branchfest when finishing the lookup.

Tested by:	pho
2021-01-06 07:05:22 +01:00
Mateusz Guzik
05803be000 cache: stop setting cn_nameptr on entry as matches cn_pnbuf already
While here tidy up other asserts.
2021-01-06 07:03:41 +01:00
Mateusz Guzik
3814bea00a cache: drop the now spurious doomed check when crossing a mount point 2021-01-03 21:22:16 +00:00
Mateusz Guzik
33a195baf3 vfs: keep seqc unchanged as long as the vnode is accessible via SMR 2021-01-03 21:22:16 +00:00
Mark Johnston
214257da3a sendfile: Clear page pointers when handling a pager error
When INVARIANTS is configred, the sendfile_iodone() callback verifies
that pages attached to the sendfile header are wired, but we unwire all
such pages after a synchronous pager error, before calling
sendfile_iodone().

Reported by:	pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2021-01-03 11:50:31 -05:00
Mark Johnston
90f580b954 Ensure that dirent's d_off field is initialized
We have the d_off field in struct dirent for providing the seek offset
of the next directory entry.  Several filesystems were not initializing
the field, which ends up being copied out to userland.

Reported by:	Syed Faraz Abrar <faraz@elttam.com>
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27792
2021-01-03 11:50:31 -05:00
Mateusz Guzik
82397d7919 vfs: denote vnode being a mount point with VIRF_MOUNTPOINT
Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27794
2021-01-03 06:50:06 +00:00
Mateusz Guzik
3e506a67bb vfs: add v_irflag accessors
Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27793
2021-01-03 06:50:06 +00:00
Mateusz Guzik
51bf55fa6c cache: stop checkpointing cn_namelen
The variable is recomputed by regular lookup from the get go.
2021-01-03 06:50:06 +00:00
Mateusz Guzik
7220a10b5b cache: predict on no spurious slashes in cache_fpl_handle_root
This is a step towards speculatively not handling them.
2021-01-03 06:50:06 +00:00
Mateusz Guzik
30a2fc91fa cache: postpone NAME_MAX check as it may be unnecessary 2021-01-03 06:50:06 +00:00
Mateusz Guzik
eca899bd5d cache: remove spurious null check in sdt probe 2021-01-03 06:50:06 +00:00
Alan Somers
1868a91fac Regenerate syscall files after addition of aio_writev/aio_readv 2021-01-02 19:57:58 -07:00
Alan Somers
022ca2fc7f Add aio_writev and aio_readv
POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.

It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.

Reviewed by:    jhb, kib, bcr
Relnotes:       yes
Differential Revision: https://reviews.freebsd.org/D27743
2021-01-02 19:57:58 -07:00
Jamie Gritton
b58a46347c jail: revert the attachment part of b4e87a6329
The change to kern_jail_set that was supposed to "also properly clean
up when attachment fails" didn't fix a memory leak but actually caused
a double free.  Back that part out, and leave the part that manages
allprison_lock state.
2020-12-31 19:55:49 -08:00
Mateusz Guzik
1365b5f86f cache: fold NCF_WHITE check into the rest
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
d7c62d98c9 cache: call cache_fplookup_modifying in neg
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
6fe7de1a25 cache: refactor cache_fpl_handle_root to fit the rest of the code better
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
e17e01bd0e cache: refactor dot handling
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
4651db56c7 cache: remove a branch from mount point checking
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
0b5bd1afd8 cache: support lockless lookup of degenerate paths
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
1d6eb97677 cache: save on branching when parsing the path by inserting a sentinel
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
67297766b5 cache: hoist trailing slash and degenerate path handling out of the loop
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
bb3a12f0e5 fd: inline pwd_get_smr
Tested by:	pho
2021-01-01 00:10:42 +00:00
John Baldwin
825d234144 Don't check P_INMEM in kdb_thr_*().
Not all debugger operations that enumerate threads require thread
stacks to be resident in memory to be useful.  Instead, push P_INMEM
checks (if needed) into callers.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27827
2020-12-31 16:01:12 -08:00
John Baldwin
9acce1c992 Enumerate processes via the pid hash table in kdb_thr_*().
Processes part way through exit1() are not included in allproc.  Using
allproc to enumerate processes prevented getting the stack trace of a
thread in this part of exit1() via ddb.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27826
2020-12-31 16:00:54 -08:00
John Baldwin
4e7d1b527c Add a proc_off_p_hash helper variable.
This is used by kernel debuggers to enumerate processes via the pid
hash table.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27825
2020-12-31 16:00:33 -08:00
John Baldwin
47877889f2 ddb ps: Use the pidhash to enumerate processes not in allproc.
Exiting processes that have been removed from allproc but are still
executing are not yet marked PRS_ZOMBIE, so they were not listed (for
example, if a thread panics during exit1()).  To detect these
processes, clear p_list.le_prev to NULL explicitly after removing a
process from the allproc list and check for this sentinel rather than
PRS_ZOMBIE when walking the pidhash.

While here, simplify the pidhash walk to use a single outer loop.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27824
2020-12-31 16:00:05 -08:00
Jamie Gritton
b4e87a6329 jail: Clean up allprison_lock handing in kern_jail_set
Keep explicit track of the allprison_lock state during the final part
of kern_jail_set, instead of deducing it from the JAIL_ATTACH flag.

Also properly clean up when the attachment fails, fixing a long-
standing (though minor) memory leak.
2020-12-31 15:18:43 -08:00
Mateusz Guzik
0c09f4b0cc cache: work around corner case of dvp == tvp in cache_fplookup_final_modifying
Fixes a panic where the kernel would unlock an unheld lock coming from
rename looking up "foo/." as the source.

Reported by:	markj (syzkaller)
2020-12-28 21:38:20 +00:00
Mateusz Guzik
4ab7d9f484 cache: reduce engrish in previous commit 2020-12-28 02:05:30 +00:00
Mateusz Guzik
0714f921cd cache: save on some branching in common case mount point traversal 2020-12-28 01:53:28 +00:00
Mateusz Guzik
8c9d74634a vfs: stop open-coding setting WILLBEDIR flag 2020-12-28 01:53:27 +00:00
Mateusz Guzik
002e18eb7f vfs: add FAILIFEXISTS flag
Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.

This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.

Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.

For simplicity the only supported semantics are as used by mkdir.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27789
2020-12-28 01:53:27 +00:00
Mateusz Guzik
ff97bc034f cache: simplify lockless dot lookups 2020-12-28 01:53:27 +00:00
Mateusz Guzik
abd7ded451 cache: modification and last entry filling support in lockless lookup v2
The previous patch failed to set the ISDOTDOT flag when appropriate,
which in turn fail to properly handle degenerate lookups.

While here sprinkle some extra assertions.

Tested by:	pho (previous version)
2020-12-27 21:03:18 +00:00
Mateusz Guzik
623daa69f9 cache: assert internal flags are not passed by namei 2020-12-27 19:49:24 +00:00
Mateusz Guzik
a1fc1f10c6 Revert "cache: modification and last entry filling support in lockless lookup"
This reverts commit 6dbb07ed68.

Some ports unreliably fail to build with rmdir getting ENOTEMPTY.
2020-12-27 19:02:29 +00:00
Mateusz Guzik
6dbb07ed68 cache: modification and last entry filling support in lockless lookup
Tested by:	pho (previous version)
2020-12-27 17:22:25 +00:00
Konstantin Belousov
9dd48b87e6 Regen. 2020-12-27 12:57:27 +02:00
Konstantin Belousov
7a202823aa Expose eventfd in the native API/ABI using a new __specialfd syscall
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues.  Unfortunately, kqueues
are not passable between processes.  And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow.  This came up when debugging strange HW video decode
failures in Firefox.  A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by:   greg@unrelenting.technology
Reviewed by:    markj (previous version)
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D26668
2020-12-27 12:57:26 +02:00
Jamie Gritton
7f4e724829 jail: add a missing lock around an osd_jail_call().
allprison_lock should be at least held shared when jail OSD methods
are called.  Add a shared lock around one such call where that wasn't
the case.

In another such call, change an exclusive lock grab to be shared in
what is likely the more common case.
2020-12-26 20:49:30 -08:00
Jamie Gritton
0fe74ae624 jail: Consistently handle the pr_allow bitmask
Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag
value itself, which is what sysctl expects.

Add prison_set_allow(), which can set or clear a permission bit, and
propagates cleared bits down to child jails.

Use prison_allow() and prison_set_allow() in the various jail.allow.*
sysctls, and others that depend on thoe permissions.

Add locking around checking both pr_allow and pr_enforce_statfs in
prison_priv_check().
2020-12-26 20:25:02 -08:00
Mark Johnston
26b23f07fb sendfile: Ensure that sfio->npages is initialized
We initialize sfio->npages only when some I/O is required to satisfy the
request.  However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.

Fix the problem by initializing sfio->npages earlier.  Note that
sendfile_swapin() always initializes the page array.  In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.

Reported by:		syzkaller (with KASAN)
Reviewed by:		kib
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27726
2020-12-26 16:07:40 -05:00
Jamie Gritton
5d58f959d3 jail: Fix lock-free access to dynamic pr.allow flags
Use atomic access and a memory barrier to ensure that the flag parameter
in pr_flag_allow is indeed set after the rest of the structure is valid.

Simplify adding flag bits with pr_allow_all, a dynamic version of
PR_ALLOW_ALL_STATIC.
2020-12-26 12:53:28 -08:00
Jamie Gritton
7de883c82f jail: Fix an O(n^2) loop when adding jails
When a jail is added using the default (system-chosen) JID, and
non-default-JID jails already exist, a loop through the allprison
list could restart and result in unnecessary O(n^2) behaviour.
There should never be more than two list passes required.

Also clean up inefficient (though still O(n)) allprison list traversal
when finding jails by ID, or when adding jails in the common case of
all default JIDs.
2020-12-26 10:39:34 -08:00
Alan Somers
0120603891 AIO: remove the kaiocb->bio linkage
Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former.  But we
don't really need to.  aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.

Also, remove the unused kaiocb.backend2 field.

Reviewed By:	kib
Differential Revision: https://reviews.freebsd.org/D27682
2020-12-23 16:06:15 +00:00
Mateusz Guzik
906a73e791 cache: fix up cache_hold_vnode comment 2020-12-23 07:24:29 +00:00
Andrew Gallatin
02bc3865aa Optionally bind ktls threads to NUMA domains
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.

This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21648
2020-12-19 21:46:09 +00:00
Kyle Evans
54a837c8cc kern: cpuset: allow jails to modify child jails' roots
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.

As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.

Reviewed by:	jamie
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27352
2020-12-19 03:30:06 +00:00
Konstantin Belousov
673e2dd652 Add ELF flag to disable ASLR stack gap.
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR:	239873
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-12-18 23:14:39 +00:00
John Baldwin
a095390344 Use a template assembly file for firmware object files.
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file.  The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds.  Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g.  object files
with only data).  Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27579
2020-12-17 20:31:17 +00:00
Konstantin Belousov
551e205f6d Fix a race in tty_signal_sessleader() with unlocked read of s_leader.
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it.  Read s_leader only once
and ensure that compiler is not allowed to reload.

While there, make access to t_session somewhat more pretty by using
local variable.

PR:	251915
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	1 week
2020-12-17 19:51:39 +00:00
Mateusz Guzik
57efe26bcb fd: reimplement close_range to avoid spurious relocking 2020-12-17 18:52:30 +00:00
Mateusz Guzik
08a5615cfe audit: rework AUDIT_SYSCLOSE
This in particular avoids spurious lookups on close.
2020-12-17 18:52:04 +00:00
Mateusz Guzik
1e71e7c4f6 fd: refactor closefp in preparation for close_range rework 2020-12-17 18:51:09 +00:00
Mateusz Guzik
08241fedc4 fd: remove redundant saturation check from fget_unlocked_seq
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.

Noted by:	kib
2020-12-16 18:01:41 +00:00
Mateusz Guzik
6404d7ffc1 uipc: disable prediction in unp_pcb_lock_peer
The branch is not very predictable one way or the other, at least during
buildkernel where it only correctly matched 57% of calls.
2020-12-13 21:32:19 +00:00
Mateusz Guzik
8ab96e265d cache: fix ups bad predicts
- last level fallback normally sees CREATE; the code should be optimized to not
get there for said case
- fast path commonly fails with ENOENT
2020-12-13 21:29:39 +00:00
Mateusz Guzik
d48c2b8d29 vfs: correctly predict last fdrop on failed open
Arguably since the count is guaranteed to be 1 the code should be modified
to avoid the work.
2020-12-13 21:28:15 +00:00
Konstantin Belousov
203affb291 Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428.
Reported by:	arichardson
Reviewed by:	arichardson, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27597
2020-12-13 19:45:42 +00:00
Konstantin Belousov
0b459854bc Correct indent.
Sponsored by:	The FreeBSD Foundation
2020-12-13 19:43:45 +00:00
Mateusz Guzik
edcdcefb88 fd: fix fdrop prediction when closing a fd
Most of the time this is the last reference, contrary to typical fdrop use.
2020-12-13 18:06:24 +00:00
Ryan Libby
d3bbf8af68 cache_fplookup: quiet gcc -Wreturn-type
Reviewed by:	markj, mjg
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27555
2020-12-11 22:51:44 +00:00
Mateusz Guzik
0ecce93dca fd: make serialization in fdescfree_fds conditional on hold count
p_fd nullification in fdescfree serializes against new threads transitioning
the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can
safely assume there is nobody else using the table. Losing the race and
observing > 1 is harmless.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27522
2020-12-10 17:17:22 +00:00
Mark Johnston
3309fa7403 Plug a race between fd table teardown and several loops
To export information from fd tables we have several loops which do
this:

FILDESC_SLOCK(fdp);
for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++)
	<export info for fd i>;
FILDESC_SUNLOCK(fdp);

Before r367777, fdescfree() acquired the fd table exclusive lock between
decrementing fdp->fd_refcount and freeing table entries.  This
serialized with the loop above, so the file at descriptor i would remain
valid until the lock is dropped.  Now there is no serialization, so the
loops may race with teardown of file descriptor tables.

Acquire the exclusive fdtable lock after releasing the final table
reference to provide a barrier synchronizing with these loops.

Reported by:	pho
Reviewed by:	kib (previous version), mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27513
2020-12-09 14:05:08 +00:00
Mark Johnston
4c1c90ea95 Use refcount_load(9) to load fd table reference counts
No functional change intended.

Reviewed by:	kib, mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27512
2020-12-09 14:04:54 +00:00
Kyle Evans
f1b18a668d cpuset_set{affinity,domain}: do not allow empty masks
cpuset_modify() would not currently catch this, because it only checks that
the new mask is a subset of the root set and circumvents the EDEADLK check
in cpuset_testupdate().

This change both directly validates the mask coming in since we can
trivially detect an empty mask, and it updates cpuset_testupdate to catch
stuff like this going forward by always ensuring we don't end up with an
empty mask.

The check_mask argument has been renamed because the 'check' verbiage does
not imply to me that it's actually doing a different operation. We're either
augmenting the existing mask, or we are replacing it entirely.

Reported by:	syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com
Discussed with:	andrew
Reviewed by:	andrew, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27511
2020-12-08 18:47:22 +00:00
Kyle Evans
b2780e8537 kern: cpuset: resolve race between cpuset_lookup/cpuset_rel
The race plays out like so between threads A and B:

1. A ref's cpuset 10
2. B does a lookup of cpuset 10, grabs the cpuset lock and searches
   cpuset_ids
3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock
   while B is still searching and not yet ref'd
4. B ref's cpuset 10 and drops the cpuset lock
5. A proceeds to free the cpuset out from underneath B

Resolve the race by only releasing the last reference under the cpuset lock.
Thread A now picks up the spinlock and observes that the cpuset has been
revived, returning immediately for B to deal with later.

Reported by:	syzbot+92dff413e201164c796b@syzkaller.appspotmail.com
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27498
2020-12-08 18:45:47 +00:00
Kyle Evans
9c83dab96c kern: cpuset: plug a unr leak
cpuset_rel_defer() is supposed to be functionally equivalent to
cpuset_rel() but with anything that might sleep deferred until
cpuset_rel_complete -- this setup is used specifically for cpuset_setproc.

Add in the missing unr free to match cpuset_rel. This fixes a leak that
was observed when I wrote a small userland application to try and debug
another issue, which effectively did:

cpuset(&newid);
cpuset(&scratch);

newid gets leaked when scratch is created; it's off the list, so there's
no mechanism for anything else to relinquish it. A more realistic reproducer
would likely be a process that inherits some cpuset that it's the only ref
for, but it creates a new one to modify. Alternatively, administratively
reassigning a process' cpuset that it's the last ref for will have the same
effect.

Discovered through D27498.

MFC after:	1 week
2020-12-08 18:44:06 +00:00
Mateusz Guzik
8fcfd0e222 vfs: add cleanup on error missed in r368375
Noted by:	jrtc27
2020-12-06 19:24:38 +00:00
Mateusz Guzik
60e2a0d9a4 vfs: factor buffer allocation/copyin out of namei 2020-12-06 04:59:24 +00:00
Mateusz Guzik
0c23d26230 vfs: keep bad ops on vnode reclaim
They were only modified to accomodate a redundant assertion.

This runs into problems as lockless lookup can still try to use the vnode
and crash instead of getting an error.

The bug was only present in kernels with INVARIANTS.

Reported by:	kevans
2020-12-05 05:56:23 +00:00
Konstantin Belousov
be2535b0a6 Add kern_ntp_adjtime(9).
Reviewed by:	brooks, cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27471
2020-12-04 18:56:44 +00:00
Kyle Evans
34af05ead3 kern: soclose: don't sleep on SO_LINGER w/ timeout=0
This is a valid scenario that's handled in the various protocol layers where
it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it
indicates we should immediately drop the connection, it makes little sense
to sleep on it.

This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this
could result in the thread hanging until a signal interrupts it if the
protocol does not mark the socket as disconnected for whatever reason.

Reported by:	syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com
Reviewed by:	glebius, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27407
2020-12-04 04:39:48 +00:00
Mark Johnston
b957b18594 Always use 64-bit physical addresses for dump_avail[] in minidumps
As of r365978, minidumps include a copy of dump_avail[].  This is an
array of vm_paddr_t ranges.  libkvm walks the array assuming that
sizeof(vm_paddr_t) is equal to the platform "word size", but that's not
correct on some platforms.  For instance, i386 uses a 64-bit vm_paddr_t.

Fix the problem by always dumping 64-bit addresses.  On platforms where
vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate
dump_avail[] to an array of uint64_t ranges.  With this change, libkvm
no longer needs to maintain a notion of the target word size, so get rid
of it.

This is a no-op on platforms where sizeof(vm_paddr_t) == 8.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27082
2020-12-03 17:12:31 +00:00
Oleksandr Tymoshenko
18ce865a4f Add support for hw.physmem tunable for ARM/ARM64/RISC-V platforms
hw.physmem tunable allows to limit number of physical memory available to the
system. It's handled in machdep files for x86 and PowerPC. This patch adds
required logic to the consolidated physmem management interface that is used by
ARM, ARM64, and RISC-V.

Submitted by:	Klara, Inc.
Reviewed by:	mhorne
Sponsored by:	Ampere Computing
Differential Revision:	https://reviews.freebsd.org/D27152
2020-12-03 05:39:27 +00:00
Mateusz Guzik
10e64782ed select: make sure there are no wakeup attempts after selfdfree returns
Prior to the patch returning selfdfree could still be racing against doselwakeup
which set sf_si = NULL and now locks stp to wake up the other thread.

A sufficiently unlucky pair can end up going all the way down to freeing
select-related structures before the lock/wakeup/unlock finishes.

This started manifesting itself as crashes since select data started getting
freed in r367714.
2020-12-02 00:48:15 +00:00
Konstantin Belousov
6814c2dac5 lio_listio(2): send signal even if number of jobs is zero.
Right now, if lio registered zero jobs, syscall frees lio job
structure, cleaning up queued ksi.  As result, the realtime signal is
dequeued and never delivered.

Fix it by allowing sendsig() to copy ksi when job count is zero.

PR: 220398
Reported and reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:53:33 +00:00
Konstantin Belousov
2933165666 vfs_aio.c: style.
Mostly re-wrap conditions to split after binary ops.

Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:46:51 +00:00
Konstantin Belousov
5c5005ec20 vfs_aio.c: correct comment.
Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:30:32 +00:00
Mark Johnston
dad22308a1 vmem: Revert r364744
A pair of bugs are believed to have caused the hangs described in the
commit log message for r364744:

1. uma_reclaim() could trigger reclamation of the reserve of boundary
   tags used to avoid deadlock.  This was fixed by r366840.
2. The loop in vmem_xalloc() would in some cases try to allocate more
   boundary tags than the expected upper bound of BT_MAXALLOC.  The
   reserve is sized based on the value BT_MAXMALLOC, so this behaviour
   could deplete the reserve without guaranteeing a successful
   allocation, resulting in a hang.  This was fixed by r366838.

PR:		248008
Tested by:	rmacklem
2020-12-01 16:06:31 +00:00
Alexander V. Chernikov
8db8bebf1f Move inner loop logic out of sysctl_sysctl_next_ls().
Refactor sysctl_sysctl_next_ls():
* Move huge inner loop out of sysctl_sysctl_next_ls() into a separate
 non-recursive function, returning the next step to be taken.
* Update resulting node oid parts only on successful lookup
* Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno,
 slightly simplifying logic

Reviewed by:	freqlabs
Differential Revision:	https://reviews.freebsd.org/D27029
2020-11-30 21:59:52 +00:00
Toomas Soome
93b18e3730 vt: if loader did pass the font via metadata, use it
The built in 8x16 font may be way too small with large framebuffer
resolutions, to improve readability, use loader provied font.
2020-11-30 11:45:47 +00:00
Toomas Soome
a4a10b37d4 Add VT driver for VBE framebuffer device
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.

struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.

Differential Revision:	https://reviews.freebsd.org/D27373
2020-11-30 08:22:40 +00:00
Matt Macy
2338da0373 Import kernel WireGuard support
Data path largely shared with the OpenBSD implementation by
Matt Dunwoodie <ncon@nconroy.net>

Reviewed by:	grehan@freebsd.org
MFC after:	1 month
Sponsored by:	Rubicon LLC, (Netgate)
Differential Revision:	https://reviews.freebsd.org/D26137
2020-11-29 19:38:03 +00:00
Konstantin Belousov
a9d4fe977a bio aio: Destroy ephemeral mapping before unwiring page.
Apparently some architectures, like ppc in its hashed page tables
variants, account mappings by pmap_qenter() in the response from
pmap_is_page_mapped().

While there, eliminate useless userp variable.

Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27409
2020-11-29 10:30:56 +00:00
Alexander Motin
83f6b50123 Remove alignment requirements for KVA buffer mapping.
After r368124 pbuf_zone has extra page to handle this particular case.
2020-11-29 01:30:17 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Kyle Evans
e07e3fa3c9 kern: cpuset: drop the lock to allocate domainsets
Restructure the loop a little bit to make it a little more clear how it
really operates: we never allocate any domains at the beginning of the first
iteration, and it will run until we've satisfied the amount we need or we
encounter an error.

The lock is now taken outside of the loop to make stuff inside the loop
easier to evaluate w.r.t. locking.

This fixes it to not try and allocate any domains for the freelist under the
spinlock, which would have happened before if we needed any new domains.

Reported by:	syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com
Reviewed by:	markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D27371
2020-11-28 01:21:11 +00:00
Mark Johnston
0c56925bc2 callout(9): Remove some leftover APM BIOS support
This code is obsolete since r366546.

Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27267
2020-11-27 20:46:02 +00:00
Konstantin Belousov
99c66d3acf vn_read_from_obj(): fix handling of doomed vnodes.
There is no reason why vp->v_object cannot be NULL. If it is, it's
fine, handle it by delegating to VOP_READ().

Tested by:	pho
Reviewed by:	markj, mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27327
2020-11-26 18:13:33 +00:00
Konstantin Belousov
164438a7b9 More careful handling of the mount failure.
- VFS_UNMOUNT() requires vn_start_write() around it [*].
- call VFS_PURGE() before unmount.
- do not destroy mp if cleanup unmount did not succeed.
- set MNTK_UNMOUNT, and indicate forced unmount with MNTK_UNMOUNTF
  for VFS_UNMOUNT() in cleanup.

PR:	251320 [*]
Reported by:	Tong Zhang <ztong0001@gmail.com>
Reviewed by:	markj, mjg
Discussed with:	rmacklem
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27327
2020-11-26 18:08:42 +00:00
Konstantin Belousov
3b1f974bfb Make max ticks for pause in vn_lock_pair() adjustable at runtime.
Reduce default value from hz / 10 to hz / 100.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-11-26 18:00:26 +00:00
Mateusz Guzik
b83e94be53 thread: staticize thread_reap and move td_allocdomain
thread_init is a much better fit as the the value is constant after
initialization.
2020-11-26 06:59:27 +00:00
Mateusz Guzik
2e51c2bfd1 pipe: follow up cleanup to previous
The commited patch was incomplete.

- add back missing goto retry, noted by jhb
- 'if (error)'  -> 'if (error != 0)'
- consistently do:

if (error != 0)
    break;
continue;

instead of:

if (error != 0)
    break;
else
    continue;

This adds some 'continue' uses which are not needed, but line up with the
rest of pipe_write.
2020-11-25 22:53:21 +00:00
Mateusz Guzik
c8df8543fd pipe: drop spurious pipeunlock/pipelock cycle on write 2020-11-25 21:41:23 +00:00
Kyle Evans
d431dea5ac kern: cpuset: properly rebase when attaching to a jail
The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.

Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.

This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).

If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.

Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.

To recap, here's how it worked before in all cases:

0     4 <-- jail              0      4 <-- jail / process
|                             |
1                 ->          1
|
3 <-- process

Here's how it works now:

0     4 <-- jail             0       4 <-- jail
|                            |       |
1                 ->         1       5 <-- process
|
3 <-- process

or

0     4 <-- jail             0       4 <-- jail / process
|                            |
1 <-- process     ->         1

More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.

In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.

Reviewed by:	jamie
MFC after:	1 month (maybe)
Differential Revision:	https://reviews.freebsd.org/D27298
2020-11-25 03:14:25 +00:00
Kyle Evans
30b7c6f977 kern: cpuset: rename _cpuset_create() to cpuset_init()
cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.

A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().

Suggested by:	markj
MFC after:	1 week
2020-11-25 02:12:24 +00:00
Kyle Evans
29d04ea8c3 kern: cpuset: allow cpuset_create() to take an allocated *setp
Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.

This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).

Reviewed by:	jamie, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27297
2020-11-25 01:42:32 +00:00
Kyle Evans
c7ef3490e2 kern: never restart syscalls calling closefp(), e.g. close(2)
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.

The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient

Reviewed by:	jilles, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27310
2020-11-25 01:08:57 +00:00
Cy Schubert
e5a307c6ac Fix a typo in a comment.
MFC after:	3 days
2020-11-24 06:42:32 +00:00
Mateusz Guzik
f90d57b808 locks: push lock_delay_arg_init calls down
Minor cleanup to skip doing them when recursing on locks and so that
they can act on found lock value if need be.
2020-11-24 03:49:37 +00:00
Mateusz Guzik
094c148b7a sx: drop spurious volatile keyword 2020-11-24 03:48:44 +00:00
Mateusz Guzik
598f2b8116 dtrace: stop using eventhandlers for the part compiled into the kernel
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D27311
2020-11-23 18:27:21 +00:00
Mateusz Guzik
a9568cd2bc thread: stash domain id to work around vtophys problems on ppc64
Adding to zombie list can be perfomed by idle threads, which on ppc64 leads to
panics as it requires a sleepable lock.

Reported by:	alfredo
Reviewed by:	kib, markj
Fixes:	r367842 ("thread: numa-aware zombie reaping")
Differential Revision:	https://reviews.freebsd.org/D27288
2020-11-23 18:26:47 +00:00
Konstantin Belousov
87a9b18d22 Provide ABI modules hooks for process exec/exit and thread exit.
Exec and exit are same as corresponding eventhandler hooks.

Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available.  Note that the
process lock is owned when the hook is called.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27309
2020-11-23 17:29:25 +00:00
Edward Tomasz Napierala
9c8c797c1a Remove the 'wantparent' variable, unused since r145004.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27193
2020-11-23 12:47:23 +00:00
Kyle Evans
dac521ebcf cpuset_setproc: use the appropriate parent for new anonymous sets
As far as I can tell, this has been the case since initially committed in
2008.  cpuset_setproc is the executor of cpuset reassignment; note this
excerpt from the description:

* 1) Set is non-null.  This reparents all anonymous sets to the provided
*    set and replaces all non-anonymous td_cpusets with the provided set.

However, reviewing cpuset_setproc_setthread() for some jail related work
unearthed the error: if tdset was not anonymous, we were replacing it with
`set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the
thread's mask over and AND it with `set`) but give the new anonymous set
the original tdset as the parent (i.e. the base of the set we're supposed to
be leaving behind).

The primary visible consequences were that:

1.) cpuset_getid() following such assignment returns the wrong result, the
    setid that we left behind rather than the one we joined.
2.) When a process attached to the jail, the base set of any anonymous
    threads was a set outside of the jail.

This was initially bundled in D27298, but it's a minor fix that's fairly
easy to verify the correctness of.

A test is included in D27307 ("badparent"), which demonstrates the issue
with, effectively:

osetid = cpuset_getid()
newsetid = cpuset()
cpuset_setaffinity(thread)
cpuset_setid(osetid)
cpuset_getid(thread) -> observe that it matches newsetid instead of osetid.

MFC after:	1 week
2020-11-23 02:49:53 +00:00
Kyle Evans
60e60e73fd freebsd32: take the _umtx_op struct definitions back
Providing these in freebsd32.h facilitates local testing/measuring of the
structs rather than forcing one to locally recreate them. Sanity checking
offsets/sizes remains in kern_umtx.c where these are typically used.
2020-11-23 00:58:14 +00:00
Kyle Evans
f96078b8fe kern: dup: do not assume oldfde is valid
oldfde may be invalidated if the table has grown due to the operation that
we're performing, either via fdalloc() or a direct fdgrowtable_exp().

This was technically OK before rS367927 because the old table remained valid
until the filedesc became unused, but now it may be freed immediately if
it's an unshared table in a single-threaded process, so it is no longer a
good assumption to make.

This fixes dup/dup2 invocations that grow the file table; in the initial
report, it manifested as a kernel panic in devel/gmake's configure script.

Reported by:	Guy Yur <guyyur gmail com>
Reviewed by:	rew
Differential Revision:	https://reviews.freebsd.org/D27319
2020-11-23 00:33:06 +00:00
Kyle Evans
e0cb5b2a77 [2/2] _umtx_op: introduce 32-bit/i386 flags for operations
This patch takes advantage of the consolidation that happened to provide two
flags that can be used with the native _umtx_op(2): UMTX_OP___32BIT and
UMTX_OP__I386.

UMTX_OP__32BIT iindicates that we are being provided with 32-bit structures.
Note that this flag alone indicates a 64bit time_t, since this is the
majority case.

UMTX_OP__I386 has been provided so that we can emulate i386 as well,
regardless of whether the host is amd64 or not.

Both imply a different set of copyops in sysumtx_op. freebsd32__umtx_op
simply ignores the flags, since it's already doing a 32-bit operation and
it's unlikely we'll be running an emulator under compat32. Future work
could consider it, but the author sees little benefit.

This will be used by qemu-bsd-user to pass on all _umtx_op calls to the
native interface as long as the host/target endianness matches, effectively
eliminating most if not all of the remaining unresolved deadlocks for most.

This version changed a fair amount from what was under review, mostly in
response to refactoring of the prereq reorganization and battle-testing
it with qemu-bsd-user.  The main changes are as follows:

1.) The i386 flag got renamed to omit '32BIT' since this is redundant.
2.) The flags are now properly handled on 32-bit platforms to emulate other
    32-bit platforms.
3.) Robust list handling was fixed, and the 32-bit functionality that was
    previously gated by COMPAT_FREEBSD32 is now unconditional.
4.) Robust list handling was also improved, including the error reported
    when a process has already registered 32-bit ABI lists and also
    detecting if native robust lists have already been registered. Both
    scenarios now return EBUSY rather than EINVAL, because the input is
    technically valid but we're too busy with another ABI's lists.

libsysdecode/kdump/truss support will go into review soon-ish, along with
the associated manpage update.

Reviewed by:	kib (earlier version)
MFC after:	3 weeks
2020-11-22 05:47:45 +00:00
Kyle Evans
15eaec6a5c _umtx_op: move compat32 definitions back in
These are reasonably compact, and a future commit will blur the compat32
lines by supporting 32-bit operations with the native _umtx_op.
2020-11-22 05:34:51 +00:00
Robert Wing
3c85ca21d1 fd: free old file descriptor tables when not shared
During the life of a process, new file descriptor tables may be allocated. When
a new table is allocated, the old table is placed in a free list and held onto
until all processes referencing them exit.

When a new file descriptor table is allocated, the old file descriptor table
can be freed when the current process has a single-thread and the file
descriptor table is not being shared with any other processes.

Reviewed by:    kevans
Approved by:    kevans (mentor)
Differential Revision:  https://reviews.freebsd.org/D18617
2020-11-22 05:00:28 +00:00
Konstantin Belousov
e68c619144 Stop using eventhandlers for itimers subsystem exec and exit hooks.
While there, do some minor cleanup for kclocks.  They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.

Perhaps we can stop registering kclocks at all and statically
initialize them.

Reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27305
2020-11-21 21:43:36 +00:00
Konstantin Belousov
5a2a4551f5 Remove unused prototype.
Missed part of r367918.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-11-21 10:58:19 +00:00
Konstantin Belousov
74a093eb98 Stop using eventhandler to invoke umtx_exec hook.
There is no point in dynamic registration, umtx hook is there always.

Reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27303
2020-11-21 10:32:40 +00:00
Kirk McKusick
e75f0f2b48 Only attempt a VOP_UNLOCK() when the vn_lock() has been successful.
No MFC as this code is not present in 12-stable.

Reported by:  Peter Holm
Reviewed by:  Mateusz Guzik
Tested by:    Peter Holm
Sponsored by: Netflix
2020-11-20 20:22:01 +00:00
Michal Meloun
d9de80d614 Also pass interrupt binding request to non-root interrupt controllers.
There are message based controllers that can bind interrupts even if they are
not implemented as root controllers (such as the ITS subblock of GIC).

MFC after:	3 weeks
2020-11-20 09:05:36 +00:00
Mateusz Guzik
f9fe7b28bc pipe: thundering herd problem in pipelock
All reads and writes are serialized with a hand-rolled lock, but unlocking it
always wakes up all waiters. Existing flag fields get resized to make room for
introduction of waiter counter without growing the struct.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27273
2020-11-19 19:25:47 +00:00
Mark Johnston
a33fef5e25 callout(9): Fix a race between CPU migration and callout_drain()
Suppose a running callout re-arms itself, and before the callout
finishes running another CPU calls callout_drain() and goes to sleep.
softclock_call_cc() will wake up the draining thread, which may not run
immediately if there is a lot of CPU load.  Furthermore, the callout is
still in the callout wheel so it can continue to run and re-arm itself.
Then, suppose that the callout migrates to another CPU before the
draining thread gets a chance to run.  The draining thread is in this
loop in _callout_stop_safe():

	while (cc_exec_curr(cc) == c) {
		CC_UNLOCK(cc);
		sleep();
		CC_LOCK(cc);
	}

but after the migration, cc points to the wrong CPU's callout state.
Then the draining thread goes off and removes the callout from the
wheel, but does so using the wrong lock and per-CPU callout state.

Fix the problem by doing a re-lookup of the callout CPU after sleeping.

Reported by:	syzbot+79569cd4d76636b2cc1c@syzkaller.appspotmail.com
Reported by:	syzbot+1b27e0237aa22d8adffa@syzkaller.appspotmail.com
Reported by:	syzbot+e21aa5b85a9aff90ef3e@syzkaller.appspotmail.com
Reviewed by:	emaste, hselasky
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27266
2020-11-19 18:37:28 +00:00
Mitchell Horne
c8a96cdcd9 Add an option for entering KDB on recursive panics
There are many cases where one would choose avoid entering the debugger
on a normal panic, opting instead to reboot and possibly save a kernel
dump. However, recursive kernel panics are an unusual case that might
warrant attention from a human, so provide a secondary tunable,
debug.debugger_on_recursive_panic, to allow entering the debugger only
when this occurs.

For for simplicity in maintaining existing behaviour, the tunable
defaults to zero.

Reviewed by:	cem, markj
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27271
2020-11-19 18:03:40 +00:00
Mateusz Guzik
d116b9f1ad thread: numa-aware zombie reaping
The current global list is a significant problem, in particular induces a lot
of cross-domain thread frees. When running poudriere on a 2 domain box about
half of all frees were of that nature.

Patch below introduces per-domain thread data containing zombie lists and
domain-aware reaping. By default it only reaps from the current domain, only
reaping from others if there is free TID shortage.

A dedicated callout is introduced to reap lingering threads if there happens
to be no activity.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D27185
2020-11-19 10:00:48 +00:00