Commit Graph

18200 Commits

Author SHA1 Message Date
Mark Johnston
4dc1b17dbb ktls: Improve handling of the bind_threads tunable a bit
- Only check for empty domains if we actually tried to configure domain
  affinity in the first place.  Otherwise setting bind_threads=1 will
  always cause the sysctl value to be reported as zero.  This is
  harmless since the threads end up being bound, but it's confusing.
- Try to improve the sysctl description a bit.

Reviewed by:	gallatin, jhb
Submitted by:	Klara, Inc.
Sponsored by:	Ampere Computing
Differential Revision:	https://reviews.freebsd.org/D28161
2021-01-19 21:32:33 -05:00
Mateusz Guzik
38baca17e0 lockmgr: fix upgrade
TRYUPGRADE requests kept failing when they should not have due to wrong
macro used to count readers.

Fixes:	f6b091fbbd ("lockmgr: rewrite upgrade to stop always dropping the lock")
Noted by:	asomers
Differential Revision:	https://reviews.freebsd.org/D27947
2021-01-19 12:21:38 +00:00
Mateusz Guzik
57dab0292a cache: fix some typos 2021-01-19 10:17:14 +01:00
Mateusz Guzik
84ab77ad27 cache: drop-write only var from cache_fplookup_preparse 2021-01-19 10:13:30 +01:00
Mateusz Guzik
6d386b4c8a cache: save a branch in cache_fplookup_next
Previously the code would branch on top find out whether it should
branch on SDT probe and bumping the numposhits counter, depending
on cache_fplookup_cross_mount.

Arguably it should be done regardless of what said function returns.
2021-01-19 10:08:24 +01:00
Jamie Gritton
effad35ed1 jail: Clean up some function placement and improve comments.
Move prison_hold, prison_hold_locked ,prison_proc_hold, and
prison_proc_free to a more intuitive part of the file (together with
with prison_free and prison_free_locked), and add or improve comments
to these and others, to better describe what's going in the prison
reference cycle.

No functional changes.
2021-01-18 17:23:51 -08:00
Oleksandr Tymoshenko
248f0cabca make maximum interrupt number tunable on ARM, ARM64, MIPS, and RISC-V
Use a machdep.nirq tunable intead of compile-time constant NIRQ
as a value for maximum number of interrupts. It allows keep a system
footprint small by default with an option to increase the limit
for large systems like server-grade ARM64

Reviewd by:	mhorne
Differential Revision:	https://reviews.freebsd.org/D27844
Submitted by:	Klara, Inc.
Sponsored by:	Ampere Computing
2021-01-18 16:36:39 -08:00
Jamie Gritton
83bc72a04e jail: Fix a stray mutex from 76ad42abf9. 2021-01-18 15:47:09 -08:00
Jamie Gritton
76ad42abf9 jail: Add prison_isvalid() and prison_isalive()
prison_isvalid() checks if a prison record can be used at all, i.e.
pr_ref > 0.  This filters out prisons that aren't fully created, and
those that are either in the process of being dismantled, or will be
at the next opportunity.  While the check for pr_ref > 0 is simple
enough to make without a convenience function, this prepares the way
for other measures of prison validity.

prison_isalive() checks not only validity as far as the useablity of
the prison structure, but also whether the prison is visible to user
space.  It replaces a test for pr_uref > 0, which is currently only
used within kern_jail.c, and not often there.

Both of these functions also assert that either the prison mutex or
allprison_lock is held, since it's generally the case that unlocked
prisons aren't guaranteed to remain useable for any length of time.
This isn't entirely true, for example a thread can assume its own
prison is good, but most exceptions will exist inside of kern_jail.c.
2021-01-18 10:56:20 -08:00
Konstantin Belousov
36bcc44e2c Add ddb 'show timecounter' command.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-01-18 09:51:48 +02:00
Jamie Gritton
25c2c952e3 jail: Add proper prison locking in mqfs_prison_remove. 2021-01-17 17:41:09 -08:00
Konstantin Belousov
3b15beb30b Implement malloc_domainset_aligned(9).
Change the power-of-two malloc zones to require alignment equal to the
size [*].  Current uma allocator already provides such alignment, so in
fact this change does not change anything except providing future-proof
setup.

Suggested by:	markj [*]
Reviewed by:	andrew, jah, markj
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28147
2021-01-17 19:29:05 +02:00
Mateusz Guzik
fe258f23ef Save on getpid in setproctitle by supporting -1 as curproc. 2021-01-16 09:36:54 +01:00
Kirk McKusick
79a5c790bd Eliminate a locking panic when cleaning up UFS snapshots after a
disk failure.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.

Sponsored by: Netflix
2021-01-15 16:36:42 -08:00
Mitchell Horne
818390ce0c arm64: fix early devmap assertion
The purpose of this KASSERT is to ensure that we do not run out of space
in the early devmap. However, the devmap grew beyond its initial size of
2MB in r336519, and this assertion did not grow with it.

A devmap mapping of a 1080p framebuffer requires 1920x1080 bytes, or
1.977 MB, so it is just barely able to fit without triggering the
assertion, provided no other devices are mapped before it. With the
addition of `options GDB` in GENERIC by bbfa199cbc, the uart is now
mapped for the purposes of a debug port, before mapping the framebuffer.
The presence of both these conditions pushes the selected virtual
address just below the threshold, triggering the assertion.

To fix this, use the correct size of the devmap, defined by
PMAP_MAPDEV_EARLY_SIZE. Since this code is shared with RISC-V, define
it for that platform as well (although it is a different size).

PR:		25241
Reported by:	gbe
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-01-13 17:27:44 -04:00
Mateusz Guzik
ef23df1354 vfs: set NC_KEEPPOSENTRY alongside NOCACHE when creating a file
Arguably the entire NOCACHE logic should get retired, in the meantime
at least prevent the code from evicting existing entries.
2021-01-13 15:29:34 +00:00
Mateusz Guzik
5753be8e43 fd: add refcount argument to falloc_noinstall
This lets callers avoid atomic ops by initializing the count to required
value from the get go.

While here add falloc_abort to backpedal from this without having to
fdrop.
2021-01-13 15:29:34 +00:00
Mateusz Guzik
5171310e66 vfs: use finstall_refed in openat
This avoids 2 atomic ops in the common case: 1 to grab an extra
reference and 1 to release it.
2021-01-13 03:30:38 +00:00
Mateusz Guzik
530b699a62 fd: add finstall_refed
Can be used to consume an already existing reference and consequently
avoid atomic ops.
2021-01-13 03:27:03 +01:00
Mateusz Guzik
4faa375cdd fd: provide a dedicated closef variant for unix socket code
This avoids testing for td != NULL.
2021-01-13 03:27:03 +01:00
Konstantin Belousov
0659df6fad vm_map_protect: allow to set prot and max_prot in one go.
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome.  E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by:	brooks, markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28117
2021-01-13 01:35:22 +02:00
Mateusz Guzik
70ba77706d vfs: extend vfs:namei:lookup:return probe with nameidata 2021-01-12 13:35:27 +00:00
Mateusz Guzik
cdb62ab74e vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers
Check the comment above the routine for reasoning.
2021-01-12 13:16:10 +00:00
Mateusz Guzik
6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Konstantin Belousov
57f22c828e sigfastblock: do not skip cursig/postsig loop in ast()
Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now.  This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.

Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.

Reported by:	Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28089
2021-01-12 12:45:26 +02:00
Konstantin Belousov
513320c0f1 sigfastblock_setpend(): do not set PEND user flag unless TDP_SIGFASTPENDING is set.
User pending bit should not be set if kernel did not noted a pending signal.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28089
2021-01-12 12:43:34 +02:00
Alan Somers
ff1a307801 lio_listio: validate aio_lio_opcode
Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland).  The situation became more serious with
022ca2fc7f.  After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.

Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.

MFC-with:	022ca2fc7f
Reviewed by:	jhb, tmunro, 0mp
Differential Revision:	<https://reviews.freebsd.org/D28078
2021-01-11 19:53:01 -07:00
Jason A. Harmening
e8a5a1ad71 rctl(4): support throttling resource usage to 0
For rate-based resources that support throttling (e.g.
readiops/writeips), this fixes a divide-by-zero panic when rctl(8)
passes 0 as the throttle value.  For these resources, treat
zero-throttle requests as requests to suspend forward progress as long
as possible using the duration specified in
kern.racct.rctl.throttle_max.

PR:		251803
Reported by:	chris@cretaforce.gr
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27858
2021-01-11 15:36:57 -08:00
Konstantin Belousov
4ea65707d3 exec_new_vmspace: print useful error message on ctty if stack cannot be mapped.
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.

In the relatively common case of failure to map stack, print some hints
on the control terminal.  Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.

Requested by:	jhb
Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Konstantin Belousov
2e1c94aa1f Implement enforcing write XOR execute mapping policy.
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Robert Watson
30b68ecda8 Changes that improve DTrace FBT reliability on freebsd/arm64:
- Implement a dtrace_getnanouptime(), matching the existing
  dtrace_getnanotime(), to avoid DTrace calling out to a potentially
  instrumentable function.

  (These should probably both be under KDTRACE_HOOKS.  Also, it's not clear
  to me that they are correct implementations for the DTrace thread time
  functions they are used in .. fixes for another commit.)

- Don't allow FBT to instrument functions involved in EL1 exception handling
  that are involved in FBT trap processing: handle_el1h_sync() and
  do_el1h_sync().

- Don't allow FBT to instrument DDB and KDB functions, as that makes it
  rather harder to debug FBT problems.

Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel
panics due to recursion in DTrace.

Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to
have the aarch64 instrumentor more carefully check that instructions it
replaces are against the stack pointer, which can otherwise lead to memory
corruption.  That change remains under review.

MFC after:	2 weeks
Reviewed by:	andrew, kp, markj (earlier version), jrtc27 (earlier version)
Differential revision:	https://reviews.freebsd.org/D27766
2021-01-11 15:42:22 +00:00
Robert Watson
4f2cbaf3cd Track pipe(2) reads and writes as rusage message receives and sends, a
feature misplaced during the transition from BSD 4.4's socket implementation
to the optimised FreeBSD pipe implementation.

MFC after:		1 week
Reviewed by:		arichardson, imp
Differential Revision:	https://reviews.freebsd.org/D27878
2021-01-10 12:16:39 +00:00
Jamie Gritton
2a4b225146 jail: Simplify handling of prison_deref()
Track the the current lock/reference state in a single variable,
rather than deducing the proper prison_deref() flags from a
combination of equations and hard-coded values.
2021-01-09 21:05:06 -08:00
Konstantin Belousov
5844bd058a jobc: rework detection of orphaned groups.
Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.

This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).

Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.

(For MFC, keep pg_jobc around but unused).

Reported by:	jhb
Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
cf4f802e77 kinfo_proc: move job-control related data collection into a new helper.
This improves code structure and allows to put the lock asserts right
into place where the locks are needed.

Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only()
to fill_kinfo_proc(), this looks more symmetrical.

Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
4daea93813 Lock proctree in around fill_kinfo_proc().
Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc.  There was even an XXX comment
about it.

Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.

Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
a008bdeda3 tty_wait_background: improve locking.
Increase the scope of the process group lock ownership.  This ensures that
we are consistent in returning EIO for tty write from an orphan and delivery
of TTYOUT signals.

Reviewed by:	jilles
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:20 +02:00
Konstantin Belousov
ef739c7373 pgrp: Prevent use after free.
Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed.  See for instance
tty_wait_background().

Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.

Reviewed by:	jilles
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:19 +02:00
Konstantin Belousov
e0d83cd3e4 issignal(): when handling STOP-like signals, drop sigacts mutex earlier.
Reviewed by:	jilles
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:19 +02:00
Konstantin Belousov
993a1699b1 Style. Improve some KASSERTs messages.
Reviewed by:	jilles
Tested by:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27871
2021-01-10 04:41:19 +02:00
Michael Tuexen
6685e259e3 tcp: don't use KTLS socket option on listening sockets
KTLS socket options make use of socket buffers, which are not
available for listening sockets.

Reported by:		syzbot+a8829e888a93a4a04619@syzkaller.appspotmail.com
Reviewed by:		jhb@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D27948
2021-01-08 08:57:11 +01:00
Jan Kokemüller
4d0c33be63 kevent(2): Bugfix for wrong EVFILT_TIMER timeouts
When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be
used instead of NS_TO_SBT, otherwise the timeout results are
misleading.

PR:		252539
Reviewed by:	kevans, kib
Approved by:	kevans
MFC after:	3 weeks
2021-01-09 20:00:25 +01:00
Warner Losh
40e6e2c2f7 sysctl: improve debug.kdb.panic_str description
Improve the wording for this sysctl.

Submitted by: rpokala@
2021-01-09 11:10:42 -07:00
Warner Losh
936440560b sysctl: implement debug.kdb.panic_str
This is just like debug.kdb.panic, except the string that's passed in
is reported in the panic message. This allows people with automated
systems to collect kernel panics over a large fleet of machines to
flag panics better. Strings like "Warner look at this hang" or "see
JIRA ABC-1234 for details" allow these automated systems to route the
forced panic to the appropriate engineers like you can with other
types of panics. Other users are likely possible.

Relnotes: Yes
Sponsored by: Netflix
Reviewed by: allanjude (earlier version)
Suggestions from review folded in by: 0mp, emaste, lwhsu
Differential Revision: https://reviews.freebsd.org/D28041
2021-01-08 14:30:28 -07:00
Andrew Gallatin
52cd25eb1a mbuf: enable ext_pgs ("unmapped") mbufs by default
Ext_pg mbufs allow carrying multiple pages per mbuf. This
reduces mbuf linked list traversals, especially in socket
buffers, thereby reducing cache misses and CPU use for
applications using sendfile.  Note that ext_pages use
unmapped pages, eliminating KVA mapping costs on 32-bit
platforms.

Ext_pg mbufs are also required for ktls (KERN_TLS), and having
them disabled by default is a stumbling block for those
wishing to enable ktls.

Reviewed-by:	jhb, glebius
Sponsored by:	Netfix
2021-01-08 13:43:30 -05:00
Mateusz Guzik
8ddea0b127 cache: just assign ni_resflags = NIRES_ABS
It is guaranteed to be 0 on entry.
2021-01-08 13:57:10 +00:00
Toomas Soome
742653ebd5 sysctl debug.dump_modinfo should recognize font module
Add MODINFOMD_FONT to dump list.
2021-01-08 09:24:49 +02:00
Alan Somers
20321e6225 Regenerate syscall files after reallocation of aio_writev/aio_readv 2021-01-07 19:50:32 -07:00
Alan Somers
b3286afae3 Reallocate syscall numbers for aio_writev and aio_readv
The originally chosen numbers interfere with downstream projects'
syscalls.  Move them to the end of the syscall table instead.

Reported by:	jrtc27
Reviewed by:	brooks
MFC-With:	022ca2fc7f
Differential Revision:	022ca2fc7f
2021-01-07 19:49:27 -07:00
Thomas Munro
801ac943ea aio_fsync(2): Support O_DSYNC.
aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2).

Reviewed by: kib, asomers, jhb
Differential Review: https://reviews.freebsd.org/D25071
2021-01-08 13:15:56 +13:00
Thomas Munro
a5e284038e open(2): Add O_DSYNC flag.
POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).

VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.

Flag also applies to fcntl(2).

Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090
2021-01-08 13:15:56 +13:00
Mateusz Guzik
71bd18d373 fd: use seqc_read_notmodify when translating fds 2021-01-07 23:30:04 +00:00
Mateusz Guzik
20ac5cda96 fd: make fd/fp mandatory
They are both always passed anyway.
2021-01-07 23:30:04 +00:00
Mateusz Guzik
fee405e057 cache: stop checkpointing cn_flags
They are only modified, if ever, for the last component.
2021-01-07 23:29:52 +00:00
Mateusz Guzik
ac7715471c cache: stop checkpointing cn_nameptr
For aborts cn_nameptr is the same as cn_pnbuf. For partial results
the same cn_nameptr is to be used.
2021-01-07 23:29:38 +00:00
Mateusz Guzik
0f1fc3a31f cache: stop manipulating pathlen
It is a copy-pasto from regular lookup. Add debug to ensure the result
is the same.
2021-01-07 23:26:53 +00:00
Chuck Silvers
11403bdeb4 vfs: fix rangelock range in vn_rdwr() for IO_APPEND
vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.

Reviewed by:	kib, imp, mckusick
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D28008
2021-01-07 13:37:35 -08:00
Mateusz Guzik
f2b794e1e9 cache: unengrish the comment in previous commit
Reported by:	rpokala, brd
2021-01-06 23:46:05 +00:00
Mateusz Guzik
deabdc6868 cache: stop pre-checking seqc when starting the lookup
Tested by:	pho
2021-01-06 07:28:07 +00:00
Mateusz Guzik
71a6a0b545 cache: skip checking for spurious slashes if possible
Tested by:	pho
2021-01-06 07:28:06 +00:00
Mateusz Guzik
33f3e81df5 cache: combine fast path enabled status into one flag
Tested by:	pho
2021-01-06 07:28:06 +00:00
Mateusz Guzik
dbbbc07cc3 cache: split handling of 0 and non-0 error codes
Tested by:	pho
2021-01-06 07:07:24 +01:00
Mateusz Guzik
a1a8f8ada1 cache: deinline state handling
The intent is to reduce branchfest when finishing the lookup.

Tested by:	pho
2021-01-06 07:05:22 +01:00
Mateusz Guzik
05803be000 cache: stop setting cn_nameptr on entry as matches cn_pnbuf already
While here tidy up other asserts.
2021-01-06 07:03:41 +01:00
Mateusz Guzik
3814bea00a cache: drop the now spurious doomed check when crossing a mount point 2021-01-03 21:22:16 +00:00
Mateusz Guzik
33a195baf3 vfs: keep seqc unchanged as long as the vnode is accessible via SMR 2021-01-03 21:22:16 +00:00
Mark Johnston
214257da3a sendfile: Clear page pointers when handling a pager error
When INVARIANTS is configred, the sendfile_iodone() callback verifies
that pages attached to the sendfile header are wired, but we unwire all
such pages after a synchronous pager error, before calling
sendfile_iodone().

Reported by:	pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2021-01-03 11:50:31 -05:00
Mark Johnston
90f580b954 Ensure that dirent's d_off field is initialized
We have the d_off field in struct dirent for providing the seek offset
of the next directory entry.  Several filesystems were not initializing
the field, which ends up being copied out to userland.

Reported by:	Syed Faraz Abrar <faraz@elttam.com>
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27792
2021-01-03 11:50:31 -05:00
Mateusz Guzik
82397d7919 vfs: denote vnode being a mount point with VIRF_MOUNTPOINT
Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27794
2021-01-03 06:50:06 +00:00
Mateusz Guzik
3e506a67bb vfs: add v_irflag accessors
Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27793
2021-01-03 06:50:06 +00:00
Mateusz Guzik
51bf55fa6c cache: stop checkpointing cn_namelen
The variable is recomputed by regular lookup from the get go.
2021-01-03 06:50:06 +00:00
Mateusz Guzik
7220a10b5b cache: predict on no spurious slashes in cache_fpl_handle_root
This is a step towards speculatively not handling them.
2021-01-03 06:50:06 +00:00
Mateusz Guzik
30a2fc91fa cache: postpone NAME_MAX check as it may be unnecessary 2021-01-03 06:50:06 +00:00
Mateusz Guzik
eca899bd5d cache: remove spurious null check in sdt probe 2021-01-03 06:50:06 +00:00
Alan Somers
1868a91fac Regenerate syscall files after addition of aio_writev/aio_readv 2021-01-02 19:57:58 -07:00
Alan Somers
022ca2fc7f Add aio_writev and aio_readv
POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.

It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.

Reviewed by:    jhb, kib, bcr
Relnotes:       yes
Differential Revision: https://reviews.freebsd.org/D27743
2021-01-02 19:57:58 -07:00
Jamie Gritton
b58a46347c jail: revert the attachment part of b4e87a6329
The change to kern_jail_set that was supposed to "also properly clean
up when attachment fails" didn't fix a memory leak but actually caused
a double free.  Back that part out, and leave the part that manages
allprison_lock state.
2020-12-31 19:55:49 -08:00
Mateusz Guzik
1365b5f86f cache: fold NCF_WHITE check into the rest
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
d7c62d98c9 cache: call cache_fplookup_modifying in neg
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
6fe7de1a25 cache: refactor cache_fpl_handle_root to fit the rest of the code better
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
e17e01bd0e cache: refactor dot handling
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
4651db56c7 cache: remove a branch from mount point checking
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
0b5bd1afd8 cache: support lockless lookup of degenerate paths
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
1d6eb97677 cache: save on branching when parsing the path by inserting a sentinel
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
67297766b5 cache: hoist trailing slash and degenerate path handling out of the loop
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
bb3a12f0e5 fd: inline pwd_get_smr
Tested by:	pho
2021-01-01 00:10:42 +00:00
John Baldwin
825d234144 Don't check P_INMEM in kdb_thr_*().
Not all debugger operations that enumerate threads require thread
stacks to be resident in memory to be useful.  Instead, push P_INMEM
checks (if needed) into callers.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27827
2020-12-31 16:01:12 -08:00
John Baldwin
9acce1c992 Enumerate processes via the pid hash table in kdb_thr_*().
Processes part way through exit1() are not included in allproc.  Using
allproc to enumerate processes prevented getting the stack trace of a
thread in this part of exit1() via ddb.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27826
2020-12-31 16:00:54 -08:00
John Baldwin
4e7d1b527c Add a proc_off_p_hash helper variable.
This is used by kernel debuggers to enumerate processes via the pid
hash table.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27825
2020-12-31 16:00:33 -08:00
John Baldwin
47877889f2 ddb ps: Use the pidhash to enumerate processes not in allproc.
Exiting processes that have been removed from allproc but are still
executing are not yet marked PRS_ZOMBIE, so they were not listed (for
example, if a thread panics during exit1()).  To detect these
processes, clear p_list.le_prev to NULL explicitly after removing a
process from the allproc list and check for this sentinel rather than
PRS_ZOMBIE when walking the pidhash.

While here, simplify the pidhash walk to use a single outer loop.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27824
2020-12-31 16:00:05 -08:00
Jamie Gritton
b4e87a6329 jail: Clean up allprison_lock handing in kern_jail_set
Keep explicit track of the allprison_lock state during the final part
of kern_jail_set, instead of deducing it from the JAIL_ATTACH flag.

Also properly clean up when the attachment fails, fixing a long-
standing (though minor) memory leak.
2020-12-31 15:18:43 -08:00
Mateusz Guzik
0c09f4b0cc cache: work around corner case of dvp == tvp in cache_fplookup_final_modifying
Fixes a panic where the kernel would unlock an unheld lock coming from
rename looking up "foo/." as the source.

Reported by:	markj (syzkaller)
2020-12-28 21:38:20 +00:00
Mateusz Guzik
4ab7d9f484 cache: reduce engrish in previous commit 2020-12-28 02:05:30 +00:00
Mateusz Guzik
0714f921cd cache: save on some branching in common case mount point traversal 2020-12-28 01:53:28 +00:00
Mateusz Guzik
8c9d74634a vfs: stop open-coding setting WILLBEDIR flag 2020-12-28 01:53:27 +00:00
Mateusz Guzik
002e18eb7f vfs: add FAILIFEXISTS flag
Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.

This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.

Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.

For simplicity the only supported semantics are as used by mkdir.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27789
2020-12-28 01:53:27 +00:00
Mateusz Guzik
ff97bc034f cache: simplify lockless dot lookups 2020-12-28 01:53:27 +00:00
Mateusz Guzik
abd7ded451 cache: modification and last entry filling support in lockless lookup v2
The previous patch failed to set the ISDOTDOT flag when appropriate,
which in turn fail to properly handle degenerate lookups.

While here sprinkle some extra assertions.

Tested by:	pho (previous version)
2020-12-27 21:03:18 +00:00
Mateusz Guzik
623daa69f9 cache: assert internal flags are not passed by namei 2020-12-27 19:49:24 +00:00
Mateusz Guzik
a1fc1f10c6 Revert "cache: modification and last entry filling support in lockless lookup"
This reverts commit 6dbb07ed68.

Some ports unreliably fail to build with rmdir getting ENOTEMPTY.
2020-12-27 19:02:29 +00:00
Mateusz Guzik
6dbb07ed68 cache: modification and last entry filling support in lockless lookup
Tested by:	pho (previous version)
2020-12-27 17:22:25 +00:00
Konstantin Belousov
9dd48b87e6 Regen. 2020-12-27 12:57:27 +02:00
Konstantin Belousov
7a202823aa Expose eventfd in the native API/ABI using a new __specialfd syscall
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues.  Unfortunately, kqueues
are not passable between processes.  And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow.  This came up when debugging strange HW video decode
failures in Firefox.  A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by:   greg@unrelenting.technology
Reviewed by:    markj (previous version)
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D26668
2020-12-27 12:57:26 +02:00
Jamie Gritton
7f4e724829 jail: add a missing lock around an osd_jail_call().
allprison_lock should be at least held shared when jail OSD methods
are called.  Add a shared lock around one such call where that wasn't
the case.

In another such call, change an exclusive lock grab to be shared in
what is likely the more common case.
2020-12-26 20:49:30 -08:00
Jamie Gritton
0fe74ae624 jail: Consistently handle the pr_allow bitmask
Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag
value itself, which is what sysctl expects.

Add prison_set_allow(), which can set or clear a permission bit, and
propagates cleared bits down to child jails.

Use prison_allow() and prison_set_allow() in the various jail.allow.*
sysctls, and others that depend on thoe permissions.

Add locking around checking both pr_allow and pr_enforce_statfs in
prison_priv_check().
2020-12-26 20:25:02 -08:00
Mark Johnston
26b23f07fb sendfile: Ensure that sfio->npages is initialized
We initialize sfio->npages only when some I/O is required to satisfy the
request.  However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.

Fix the problem by initializing sfio->npages earlier.  Note that
sendfile_swapin() always initializes the page array.  In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.

Reported by:		syzkaller (with KASAN)
Reviewed by:		kib
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27726
2020-12-26 16:07:40 -05:00
Jamie Gritton
5d58f959d3 jail: Fix lock-free access to dynamic pr.allow flags
Use atomic access and a memory barrier to ensure that the flag parameter
in pr_flag_allow is indeed set after the rest of the structure is valid.

Simplify adding flag bits with pr_allow_all, a dynamic version of
PR_ALLOW_ALL_STATIC.
2020-12-26 12:53:28 -08:00
Jamie Gritton
7de883c82f jail: Fix an O(n^2) loop when adding jails
When a jail is added using the default (system-chosen) JID, and
non-default-JID jails already exist, a loop through the allprison
list could restart and result in unnecessary O(n^2) behaviour.
There should never be more than two list passes required.

Also clean up inefficient (though still O(n)) allprison list traversal
when finding jails by ID, or when adding jails in the common case of
all default JIDs.
2020-12-26 10:39:34 -08:00
Alan Somers
0120603891 AIO: remove the kaiocb->bio linkage
Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former.  But we
don't really need to.  aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.

Also, remove the unused kaiocb.backend2 field.

Reviewed By:	kib
Differential Revision: https://reviews.freebsd.org/D27682
2020-12-23 16:06:15 +00:00
Mateusz Guzik
906a73e791 cache: fix up cache_hold_vnode comment 2020-12-23 07:24:29 +00:00
Andrew Gallatin
02bc3865aa Optionally bind ktls threads to NUMA domains
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.

This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21648
2020-12-19 21:46:09 +00:00
Kyle Evans
54a837c8cc kern: cpuset: allow jails to modify child jails' roots
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.

As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.

Reviewed by:	jamie
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27352
2020-12-19 03:30:06 +00:00
Konstantin Belousov
673e2dd652 Add ELF flag to disable ASLR stack gap.
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR:	239873
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-12-18 23:14:39 +00:00
John Baldwin
a095390344 Use a template assembly file for firmware object files.
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file.  The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds.  Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g.  object files
with only data).  Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27579
2020-12-17 20:31:17 +00:00
Konstantin Belousov
551e205f6d Fix a race in tty_signal_sessleader() with unlocked read of s_leader.
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it.  Read s_leader only once
and ensure that compiler is not allowed to reload.

While there, make access to t_session somewhat more pretty by using
local variable.

PR:	251915
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	1 week
2020-12-17 19:51:39 +00:00
Mateusz Guzik
57efe26bcb fd: reimplement close_range to avoid spurious relocking 2020-12-17 18:52:30 +00:00
Mateusz Guzik
08a5615cfe audit: rework AUDIT_SYSCLOSE
This in particular avoids spurious lookups on close.
2020-12-17 18:52:04 +00:00
Mateusz Guzik
1e71e7c4f6 fd: refactor closefp in preparation for close_range rework 2020-12-17 18:51:09 +00:00
Mateusz Guzik
08241fedc4 fd: remove redundant saturation check from fget_unlocked_seq
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.

Noted by:	kib
2020-12-16 18:01:41 +00:00
Mateusz Guzik
6404d7ffc1 uipc: disable prediction in unp_pcb_lock_peer
The branch is not very predictable one way or the other, at least during
buildkernel where it only correctly matched 57% of calls.
2020-12-13 21:32:19 +00:00
Mateusz Guzik
8ab96e265d cache: fix ups bad predicts
- last level fallback normally sees CREATE; the code should be optimized to not
get there for said case
- fast path commonly fails with ENOENT
2020-12-13 21:29:39 +00:00
Mateusz Guzik
d48c2b8d29 vfs: correctly predict last fdrop on failed open
Arguably since the count is guaranteed to be 1 the code should be modified
to avoid the work.
2020-12-13 21:28:15 +00:00
Konstantin Belousov
203affb291 Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428.
Reported by:	arichardson
Reviewed by:	arichardson, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27597
2020-12-13 19:45:42 +00:00
Konstantin Belousov
0b459854bc Correct indent.
Sponsored by:	The FreeBSD Foundation
2020-12-13 19:43:45 +00:00
Mateusz Guzik
edcdcefb88 fd: fix fdrop prediction when closing a fd
Most of the time this is the last reference, contrary to typical fdrop use.
2020-12-13 18:06:24 +00:00
Ryan Libby
d3bbf8af68 cache_fplookup: quiet gcc -Wreturn-type
Reviewed by:	markj, mjg
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27555
2020-12-11 22:51:44 +00:00
Mateusz Guzik
0ecce93dca fd: make serialization in fdescfree_fds conditional on hold count
p_fd nullification in fdescfree serializes against new threads transitioning
the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can
safely assume there is nobody else using the table. Losing the race and
observing > 1 is harmless.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27522
2020-12-10 17:17:22 +00:00
Mark Johnston
3309fa7403 Plug a race between fd table teardown and several loops
To export information from fd tables we have several loops which do
this:

FILDESC_SLOCK(fdp);
for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++)
	<export info for fd i>;
FILDESC_SUNLOCK(fdp);

Before r367777, fdescfree() acquired the fd table exclusive lock between
decrementing fdp->fd_refcount and freeing table entries.  This
serialized with the loop above, so the file at descriptor i would remain
valid until the lock is dropped.  Now there is no serialization, so the
loops may race with teardown of file descriptor tables.

Acquire the exclusive fdtable lock after releasing the final table
reference to provide a barrier synchronizing with these loops.

Reported by:	pho
Reviewed by:	kib (previous version), mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27513
2020-12-09 14:05:08 +00:00
Mark Johnston
4c1c90ea95 Use refcount_load(9) to load fd table reference counts
No functional change intended.

Reviewed by:	kib, mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27512
2020-12-09 14:04:54 +00:00
Kyle Evans
f1b18a668d cpuset_set{affinity,domain}: do not allow empty masks
cpuset_modify() would not currently catch this, because it only checks that
the new mask is a subset of the root set and circumvents the EDEADLK check
in cpuset_testupdate().

This change both directly validates the mask coming in since we can
trivially detect an empty mask, and it updates cpuset_testupdate to catch
stuff like this going forward by always ensuring we don't end up with an
empty mask.

The check_mask argument has been renamed because the 'check' verbiage does
not imply to me that it's actually doing a different operation. We're either
augmenting the existing mask, or we are replacing it entirely.

Reported by:	syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com
Discussed with:	andrew
Reviewed by:	andrew, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27511
2020-12-08 18:47:22 +00:00
Kyle Evans
b2780e8537 kern: cpuset: resolve race between cpuset_lookup/cpuset_rel
The race plays out like so between threads A and B:

1. A ref's cpuset 10
2. B does a lookup of cpuset 10, grabs the cpuset lock and searches
   cpuset_ids
3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock
   while B is still searching and not yet ref'd
4. B ref's cpuset 10 and drops the cpuset lock
5. A proceeds to free the cpuset out from underneath B

Resolve the race by only releasing the last reference under the cpuset lock.
Thread A now picks up the spinlock and observes that the cpuset has been
revived, returning immediately for B to deal with later.

Reported by:	syzbot+92dff413e201164c796b@syzkaller.appspotmail.com
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27498
2020-12-08 18:45:47 +00:00
Kyle Evans
9c83dab96c kern: cpuset: plug a unr leak
cpuset_rel_defer() is supposed to be functionally equivalent to
cpuset_rel() but with anything that might sleep deferred until
cpuset_rel_complete -- this setup is used specifically for cpuset_setproc.

Add in the missing unr free to match cpuset_rel. This fixes a leak that
was observed when I wrote a small userland application to try and debug
another issue, which effectively did:

cpuset(&newid);
cpuset(&scratch);

newid gets leaked when scratch is created; it's off the list, so there's
no mechanism for anything else to relinquish it. A more realistic reproducer
would likely be a process that inherits some cpuset that it's the only ref
for, but it creates a new one to modify. Alternatively, administratively
reassigning a process' cpuset that it's the last ref for will have the same
effect.

Discovered through D27498.

MFC after:	1 week
2020-12-08 18:44:06 +00:00
Mateusz Guzik
8fcfd0e222 vfs: add cleanup on error missed in r368375
Noted by:	jrtc27
2020-12-06 19:24:38 +00:00
Mateusz Guzik
60e2a0d9a4 vfs: factor buffer allocation/copyin out of namei 2020-12-06 04:59:24 +00:00
Mateusz Guzik
0c23d26230 vfs: keep bad ops on vnode reclaim
They were only modified to accomodate a redundant assertion.

This runs into problems as lockless lookup can still try to use the vnode
and crash instead of getting an error.

The bug was only present in kernels with INVARIANTS.

Reported by:	kevans
2020-12-05 05:56:23 +00:00
Konstantin Belousov
be2535b0a6 Add kern_ntp_adjtime(9).
Reviewed by:	brooks, cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27471
2020-12-04 18:56:44 +00:00
Kyle Evans
34af05ead3 kern: soclose: don't sleep on SO_LINGER w/ timeout=0
This is a valid scenario that's handled in the various protocol layers where
it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it
indicates we should immediately drop the connection, it makes little sense
to sleep on it.

This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this
could result in the thread hanging until a signal interrupts it if the
protocol does not mark the socket as disconnected for whatever reason.

Reported by:	syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com
Reviewed by:	glebius, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27407
2020-12-04 04:39:48 +00:00
Mark Johnston
b957b18594 Always use 64-bit physical addresses for dump_avail[] in minidumps
As of r365978, minidumps include a copy of dump_avail[].  This is an
array of vm_paddr_t ranges.  libkvm walks the array assuming that
sizeof(vm_paddr_t) is equal to the platform "word size", but that's not
correct on some platforms.  For instance, i386 uses a 64-bit vm_paddr_t.

Fix the problem by always dumping 64-bit addresses.  On platforms where
vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate
dump_avail[] to an array of uint64_t ranges.  With this change, libkvm
no longer needs to maintain a notion of the target word size, so get rid
of it.

This is a no-op on platforms where sizeof(vm_paddr_t) == 8.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27082
2020-12-03 17:12:31 +00:00
Oleksandr Tymoshenko
18ce865a4f Add support for hw.physmem tunable for ARM/ARM64/RISC-V platforms
hw.physmem tunable allows to limit number of physical memory available to the
system. It's handled in machdep files for x86 and PowerPC. This patch adds
required logic to the consolidated physmem management interface that is used by
ARM, ARM64, and RISC-V.

Submitted by:	Klara, Inc.
Reviewed by:	mhorne
Sponsored by:	Ampere Computing
Differential Revision:	https://reviews.freebsd.org/D27152
2020-12-03 05:39:27 +00:00
Mateusz Guzik
10e64782ed select: make sure there are no wakeup attempts after selfdfree returns
Prior to the patch returning selfdfree could still be racing against doselwakeup
which set sf_si = NULL and now locks stp to wake up the other thread.

A sufficiently unlucky pair can end up going all the way down to freeing
select-related structures before the lock/wakeup/unlock finishes.

This started manifesting itself as crashes since select data started getting
freed in r367714.
2020-12-02 00:48:15 +00:00
Konstantin Belousov
6814c2dac5 lio_listio(2): send signal even if number of jobs is zero.
Right now, if lio registered zero jobs, syscall frees lio job
structure, cleaning up queued ksi.  As result, the realtime signal is
dequeued and never delivered.

Fix it by allowing sendsig() to copy ksi when job count is zero.

PR: 220398
Reported and reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:53:33 +00:00
Konstantin Belousov
2933165666 vfs_aio.c: style.
Mostly re-wrap conditions to split after binary ops.

Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:46:51 +00:00
Konstantin Belousov
5c5005ec20 vfs_aio.c: correct comment.
Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:30:32 +00:00
Mark Johnston
dad22308a1 vmem: Revert r364744
A pair of bugs are believed to have caused the hangs described in the
commit log message for r364744:

1. uma_reclaim() could trigger reclamation of the reserve of boundary
   tags used to avoid deadlock.  This was fixed by r366840.
2. The loop in vmem_xalloc() would in some cases try to allocate more
   boundary tags than the expected upper bound of BT_MAXALLOC.  The
   reserve is sized based on the value BT_MAXMALLOC, so this behaviour
   could deplete the reserve without guaranteeing a successful
   allocation, resulting in a hang.  This was fixed by r366838.

PR:		248008
Tested by:	rmacklem
2020-12-01 16:06:31 +00:00
Alexander V. Chernikov
8db8bebf1f Move inner loop logic out of sysctl_sysctl_next_ls().
Refactor sysctl_sysctl_next_ls():
* Move huge inner loop out of sysctl_sysctl_next_ls() into a separate
 non-recursive function, returning the next step to be taken.
* Update resulting node oid parts only on successful lookup
* Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno,
 slightly simplifying logic

Reviewed by:	freqlabs
Differential Revision:	https://reviews.freebsd.org/D27029
2020-11-30 21:59:52 +00:00
Toomas Soome
93b18e3730 vt: if loader did pass the font via metadata, use it
The built in 8x16 font may be way too small with large framebuffer
resolutions, to improve readability, use loader provied font.
2020-11-30 11:45:47 +00:00
Toomas Soome
a4a10b37d4 Add VT driver for VBE framebuffer device
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.

struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.

Differential Revision:	https://reviews.freebsd.org/D27373
2020-11-30 08:22:40 +00:00
Matt Macy
2338da0373 Import kernel WireGuard support
Data path largely shared with the OpenBSD implementation by
Matt Dunwoodie <ncon@nconroy.net>

Reviewed by:	grehan@freebsd.org
MFC after:	1 month
Sponsored by:	Rubicon LLC, (Netgate)
Differential Revision:	https://reviews.freebsd.org/D26137
2020-11-29 19:38:03 +00:00
Konstantin Belousov
a9d4fe977a bio aio: Destroy ephemeral mapping before unwiring page.
Apparently some architectures, like ppc in its hashed page tables
variants, account mappings by pmap_qenter() in the response from
pmap_is_page_mapped().

While there, eliminate useless userp variable.

Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27409
2020-11-29 10:30:56 +00:00
Alexander Motin
83f6b50123 Remove alignment requirements for KVA buffer mapping.
After r368124 pbuf_zone has extra page to handle this particular case.
2020-11-29 01:30:17 +00:00