freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	cdb62ab74e	vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers Check the comment above the routine for reasoning.	2021-01-12 13:16:10 +00:00
Mateusz Guzik	6b3a9a0f3d	Convert remaining cap_rights_init users to cap_rights_init_one semantic patch: @@ expression rights, r; @@ - cap_rights_init(&rights, r) + cap_rights_init_one(&rights, r)	2021-01-12 13:16:10 +00:00
Konstantin Belousov	57f22c828e	sigfastblock: do not skip cursig/postsig loop in ast() Even if sigfastblock block is non-zero, non-blockable signals must be checked on ast and delivered now. This also affects debugger ability to attach, because issignal() also calls ptracestop() if there is a pending stop for debugee. Instead of checking for sigfastblock, and either setting PENDING flag for usermode or doing signal delivery loop, always do the loop after checking, and then handle PENDING bit. issignal() already does the right thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to happen. Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>, markj Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28089	2021-01-12 12:45:26 +02:00
Konstantin Belousov	513320c0f1	sigfastblock_setpend(): do not set PEND user flag unless TDP_SIGFASTPENDING is set. User pending bit should not be set if kernel did not noted a pending signal. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28089	2021-01-12 12:43:34 +02:00
Alan Somers	ff1a307801	lio_listio: validate aio_lio_opcode Previously, we would accept any kind of LIO_* opcode, including ones that were intended for in-kernel use only like LIO_SYNC (which is not defined in userland). The situation became more serious with `022ca2fc7f`. After that revision, setting aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion. Note that POSIX does not specify what should happen if aio_lio_opcode is invalid. MFC-with: `022ca2fc7f` Reviewed by: jhb, tmunro, 0mp Differential Revision: <https://reviews.freebsd.org/D28078	2021-01-11 19:53:01 -07:00
Jason A. Harmening	e8a5a1ad71	rctl(4): support throttling resource usage to 0 For rate-based resources that support throttling (e.g. readiops/writeips), this fixes a divide-by-zero panic when rctl(8) passes 0 as the throttle value. For these resources, treat zero-throttle requests as requests to suspend forward progress as long as possible using the duration specified in kern.racct.rctl.throttle_max. PR: 251803 Reported by: chris@cretaforce.gr Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27858	2021-01-11 15:36:57 -08:00
Konstantin Belousov	4ea65707d3	exec_new_vmspace: print useful error message on ctty if stack cannot be mapped. After old vmspace is destroyed during execve(2), but before the new space is fully constructed, an error during image activation cannot be returned because there is no executing program to receive it. In the relatively common case of failure to map stack, print some hints on the control terminal. Note that user has enough knobs to cause stack mapping error, and this is the most common reason for execve(2) aborting the process. Requested by: jhb Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28050	2021-01-12 01:15:43 +02:00
Konstantin Belousov	2e1c94aa1f	Implement enforcing write XOR execute mapping policy. It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE \| PROT_EXEC are never specified together, if vm_map has MAP_WX flag set. FreeBSD control flag allows specific binary to request WX exempt, and there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/ disable globally. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28050	2021-01-12 01:15:43 +02:00
Robert Watson	30b68ecda8	Changes that improve DTrace FBT reliability on freebsd/arm64: - Implement a dtrace_getnanouptime(), matching the existing dtrace_getnanotime(), to avoid DTrace calling out to a potentially instrumentable function. (These should probably both be under KDTRACE_HOOKS. Also, it's not clear to me that they are correct implementations for the DTrace thread time functions they are used in .. fixes for another commit.) - Don't allow FBT to instrument functions involved in EL1 exception handling that are involved in FBT trap processing: handle_el1h_sync() and do_el1h_sync(). - Don't allow FBT to instrument DDB and KDB functions, as that makes it rather harder to debug FBT problems. Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel panics due to recursion in DTrace. Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to have the aarch64 instrumentor more carefully check that instructions it replaces are against the stack pointer, which can otherwise lead to memory corruption. That change remains under review. MFC after: 2 weeks Reviewed by: andrew, kp, markj (earlier version), jrtc27 (earlier version) Differential revision: https://reviews.freebsd.org/D27766	2021-01-11 15:42:22 +00:00
Robert Watson	4f2cbaf3cd	Track pipe(2) reads and writes as rusage message receives and sends, a feature misplaced during the transition from BSD 4.4's socket implementation to the optimised FreeBSD pipe implementation. MFC after: 1 week Reviewed by: arichardson, imp Differential Revision: https://reviews.freebsd.org/D27878	2021-01-10 12:16:39 +00:00
Jamie Gritton	2a4b225146	jail: Simplify handling of prison_deref() Track the the current lock/reference state in a single variable, rather than deducing the proper prison_deref() flags from a combination of equations and hard-coded values.	2021-01-09 21:05:06 -08:00
Konstantin Belousov	5844bd058a	jobc: rework detection of orphaned groups. Instead of trying to maintain pg_jobc counter on each process group update (and sometimes before), just calculate the counter when needed. Still, for the benefit of the signal delivery code, explicitly mark orphaned groups as such with the new process group flag. This way we prevent bugs in the corner cases where updates to the counter were missed due to complicated configuration of p_pptr/p_opptr/real_parent (debugger). Since we need to iterate over all children of the process on exit, this change mostly affects the process group entry and leave, where we need to iterate all process group members to detect orpaned status. (For MFC, keep pg_jobc around but unused). Reported by: jhb Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	cf4f802e77	kinfo_proc: move job-control related data collection into a new helper. This improves code structure and allows to put the lock asserts right into place where the locks are needed. Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only() to fill_kinfo_proc(), this looks more symmetrical. Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	4daea93813	Lock proctree in around fill_kinfo_proc(). Proctree lock is needed for correct calculation and collection of the job-control related data in kinfo_proc. There was even an XXX comment about it. Satisfy locking and lock ordering requirements by taking proctree lock around pass over each bucket in proc_iterate(), and in sysctl_kern_proc() and note_procstat_proc() for individual process reporting. Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	a008bdeda3	tty_wait_background: improve locking. Increase the scope of the process group lock ownership. This ensures that we are consistent in returning EIO for tty write from an orphan and delivery of TTYOUT signals. Reviewed by: jilles Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	ef739c7373	pgrp: Prevent use after free. Often, we have a process locked and need to get locked process group. In this case, because progress group lock is before process lock, unlocking process allows the group to be freed. See for instance tty_wait_background(). Make pgrp structures allocated from nofree zone, and ensure type stability of the pgrp mutex. Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:19 +02:00
Konstantin Belousov	e0d83cd3e4	issignal(): when handling STOP-like signals, drop sigacts mutex earlier. Reviewed by: jilles Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:19 +02:00
Konstantin Belousov	993a1699b1	Style. Improve some KASSERTs messages. Reviewed by: jilles Tested by: pho MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:19 +02:00
Michael Tuexen	6685e259e3	tcp: don't use KTLS socket option on listening sockets KTLS socket options make use of socket buffers, which are not available for listening sockets. Reported by: syzbot+a8829e888a93a4a04619@syzkaller.appspotmail.com Reviewed by: jhb@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D27948	2021-01-08 08:57:11 +01:00
Jan Kokemüller	4d0c33be63	kevent(2): Bugfix for wrong EVFILT_TIMER timeouts When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be used instead of NS_TO_SBT, otherwise the timeout results are misleading. PR: 252539 Reviewed by: kevans, kib Approved by: kevans MFC after: 3 weeks	2021-01-09 20:00:25 +01:00
Warner Losh	40e6e2c2f7	sysctl: improve debug.kdb.panic_str description Improve the wording for this sysctl. Submitted by: rpokala@	2021-01-09 11:10:42 -07:00
Warner Losh	936440560b	sysctl: implement debug.kdb.panic_str This is just like debug.kdb.panic, except the string that's passed in is reported in the panic message. This allows people with automated systems to collect kernel panics over a large fleet of machines to flag panics better. Strings like "Warner look at this hang" or "see JIRA ABC-1234 for details" allow these automated systems to route the forced panic to the appropriate engineers like you can with other types of panics. Other users are likely possible. Relnotes: Yes Sponsored by: Netflix Reviewed by: allanjude (earlier version) Suggestions from review folded in by: 0mp, emaste, lwhsu Differential Revision: https://reviews.freebsd.org/D28041	2021-01-08 14:30:28 -07:00
Andrew Gallatin	52cd25eb1a	mbuf: enable ext_pgs ("unmapped") mbufs by default Ext_pg mbufs allow carrying multiple pages per mbuf. This reduces mbuf linked list traversals, especially in socket buffers, thereby reducing cache misses and CPU use for applications using sendfile. Note that ext_pages use unmapped pages, eliminating KVA mapping costs on 32-bit platforms. Ext_pg mbufs are also required for ktls (KERN_TLS), and having them disabled by default is a stumbling block for those wishing to enable ktls. Reviewed-by: jhb, glebius Sponsored by: Netfix	2021-01-08 13:43:30 -05:00
Mateusz Guzik	8ddea0b127	cache: just assign ni_resflags = NIRES_ABS It is guaranteed to be 0 on entry.	2021-01-08 13:57:10 +00:00
Toomas Soome	742653ebd5	sysctl debug.dump_modinfo should recognize font module Add MODINFOMD_FONT to dump list.	2021-01-08 09:24:49 +02:00
Alan Somers	20321e6225	Regenerate syscall files after reallocation of aio_writev/aio_readv	2021-01-07 19:50:32 -07:00
Alan Somers	b3286afae3	Reallocate syscall numbers for aio_writev and aio_readv The originally chosen numbers interfere with downstream projects' syscalls. Move them to the end of the syscall table instead. Reported by: jrtc27 Reviewed by: brooks MFC-With: `022ca2fc7f` Differential Revision: `022ca2fc7f`	2021-01-07 19:49:27 -07:00
Thomas Munro	801ac943ea	aio_fsync(2): Support O_DSYNC. aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2). Reviewed by: kib, asomers, jhb Differential Review: https://reviews.freebsd.org/D25071	2021-01-08 13:15:56 +13:00
Thomas Munro	a5e284038e	open(2): Add O_DSYNC flag. POSIX O_DSYNC means that writes include an implicit fdatasync(2), just as O_SYNC implies fsync(2). VOP_WRITE() functions that understand the new IO_DATASYNC flag can act accordingly, but we'll still pass down IO_SYNC so that file systems that don't understand it will continue to provide the stronger O_SYNC behaviour. Flag also applies to fcntl(2). Reviewed by: kib, delphij Differential Revision: https://reviews.freebsd.org/D25090	2021-01-08 13:15:56 +13:00
Mateusz Guzik	71bd18d373	fd: use seqc_read_notmodify when translating fds	2021-01-07 23:30:04 +00:00
Mateusz Guzik	20ac5cda96	fd: make fd/fp mandatory They are both always passed anyway.	2021-01-07 23:30:04 +00:00
Mateusz Guzik	fee405e057	cache: stop checkpointing cn_flags They are only modified, if ever, for the last component.	2021-01-07 23:29:52 +00:00
Mateusz Guzik	ac7715471c	cache: stop checkpointing cn_nameptr For aborts cn_nameptr is the same as cn_pnbuf. For partial results the same cn_nameptr is to be used.	2021-01-07 23:29:38 +00:00
Mateusz Guzik	0f1fc3a31f	cache: stop manipulating pathlen It is a copy-pasto from regular lookup. Add debug to ensure the result is the same.	2021-01-07 23:26:53 +00:00
Chuck Silvers	11403bdeb4	vfs: fix rangelock range in vn_rdwr() for IO_APPEND vn_rdwr() must lock the entire file range for IO_APPEND just like vn_io_fault() does for O_APPEND. Reviewed by: kib, imp, mckusick Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28008	2021-01-07 13:37:35 -08:00
Mateusz Guzik	f2b794e1e9	cache: unengrish the comment in previous commit Reported by: rpokala, brd	2021-01-06 23:46:05 +00:00
Mateusz Guzik	deabdc6868	cache: stop pre-checking seqc when starting the lookup Tested by: pho	2021-01-06 07:28:07 +00:00
Mateusz Guzik	71a6a0b545	cache: skip checking for spurious slashes if possible Tested by: pho	2021-01-06 07:28:06 +00:00
Mateusz Guzik	33f3e81df5	cache: combine fast path enabled status into one flag Tested by: pho	2021-01-06 07:28:06 +00:00
Mateusz Guzik	dbbbc07cc3	cache: split handling of 0 and non-0 error codes Tested by: pho	2021-01-06 07:07:24 +01:00
Mateusz Guzik	a1a8f8ada1	cache: deinline state handling The intent is to reduce branchfest when finishing the lookup. Tested by: pho	2021-01-06 07:05:22 +01:00
Mateusz Guzik	05803be000	cache: stop setting cn_nameptr on entry as matches cn_pnbuf already While here tidy up other asserts.	2021-01-06 07:03:41 +01:00
Mateusz Guzik	3814bea00a	cache: drop the now spurious doomed check when crossing a mount point	2021-01-03 21:22:16 +00:00
Mateusz Guzik	33a195baf3	vfs: keep seqc unchanged as long as the vnode is accessible via SMR	2021-01-03 21:22:16 +00:00
Mark Johnston	214257da3a	sendfile: Clear page pointers when handling a pager error When INVARIANTS is configred, the sendfile_iodone() callback verifies that pages attached to the sendfile header are wired, but we unwire all such pages after a synchronous pager error, before calling sendfile_iodone(). Reported by: pho Tested by: pho Sponsored by: The FreeBSD Foundation	2021-01-03 11:50:31 -05:00
Mark Johnston	90f580b954	Ensure that dirent's d_off field is initialized We have the d_off field in struct dirent for providing the seek offset of the next directory entry. Several filesystems were not initializing the field, which ends up being copied out to userland. Reported by: Syed Faraz Abrar <faraz@elttam.com> Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27792	2021-01-03 11:50:31 -05:00
Mateusz Guzik	82397d7919	vfs: denote vnode being a mount point with VIRF_MOUNTPOINT Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27794	2021-01-03 06:50:06 +00:00
Mateusz Guzik	3e506a67bb	vfs: add v_irflag accessors Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27793	2021-01-03 06:50:06 +00:00
Mateusz Guzik	51bf55fa6c	cache: stop checkpointing cn_namelen The variable is recomputed by regular lookup from the get go.	2021-01-03 06:50:06 +00:00
Mateusz Guzik	7220a10b5b	cache: predict on no spurious slashes in cache_fpl_handle_root This is a step towards speculatively not handling them.	2021-01-03 06:50:06 +00:00
Mateusz Guzik	30a2fc91fa	cache: postpone NAME_MAX check as it may be unnecessary	2021-01-03 06:50:06 +00:00
Mateusz Guzik	eca899bd5d	cache: remove spurious null check in sdt probe	2021-01-03 06:50:06 +00:00
Alan Somers	1868a91fac	Regenerate syscall files after addition of aio_writev/aio_readv	2021-01-02 19:57:58 -07:00
Alan Somers	022ca2fc7f	Add aio_writev and aio_readv POSIX AIO is great, but it lacks vectored I/O functions. This commit fixes that shortcoming by adding aio_writev and aio_readv. They aren't part of the standard, but they're an obvious extension. They work just like their synchronous equivalents pwritev and preadv. It isn't yet possible to use vectored aiocbs with lio_listio, but that could be added in the future. Reviewed by: jhb, kib, bcr Relnotes: yes Differential Revision: https://reviews.freebsd.org/D27743	2021-01-02 19:57:58 -07:00
Jamie Gritton	b58a46347c	jail: revert the attachment part of `b4e87a6329` The change to kern_jail_set that was supposed to "also properly clean up when attachment fails" didn't fix a memory leak but actually caused a double free. Back that part out, and leave the part that manages allprison_lock state.	2020-12-31 19:55:49 -08:00
Mateusz Guzik	1365b5f86f	cache: fold NCF_WHITE check into the rest Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	d7c62d98c9	cache: call cache_fplookup_modifying in neg Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	6fe7de1a25	cache: refactor cache_fpl_handle_root to fit the rest of the code better Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	e17e01bd0e	cache: refactor dot handling Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	4651db56c7	cache: remove a branch from mount point checking Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	0b5bd1afd8	cache: support lockless lookup of degenerate paths Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	1d6eb97677	cache: save on branching when parsing the path by inserting a sentinel Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	67297766b5	cache: hoist trailing slash and degenerate path handling out of the loop Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	bb3a12f0e5	fd: inline pwd_get_smr Tested by: pho	2021-01-01 00:10:42 +00:00
John Baldwin	825d234144	Don't check P_INMEM in kdb_thr_*(). Not all debugger operations that enumerate threads require thread stacks to be resident in memory to be useful. Instead, push P_INMEM checks (if needed) into callers. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27827	2020-12-31 16:01:12 -08:00
John Baldwin	9acce1c992	Enumerate processes via the pid hash table in kdb_thr_*(). Processes part way through exit1() are not included in allproc. Using allproc to enumerate processes prevented getting the stack trace of a thread in this part of exit1() via ddb. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27826	2020-12-31 16:00:54 -08:00
John Baldwin	4e7d1b527c	Add a proc_off_p_hash helper variable. This is used by kernel debuggers to enumerate processes via the pid hash table. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27825	2020-12-31 16:00:33 -08:00
John Baldwin	47877889f2	ddb ps: Use the pidhash to enumerate processes not in allproc. Exiting processes that have been removed from allproc but are still executing are not yet marked PRS_ZOMBIE, so they were not listed (for example, if a thread panics during exit1()). To detect these processes, clear p_list.le_prev to NULL explicitly after removing a process from the allproc list and check for this sentinel rather than PRS_ZOMBIE when walking the pidhash. While here, simplify the pidhash walk to use a single outer loop. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27824	2020-12-31 16:00:05 -08:00
Jamie Gritton	b4e87a6329	jail: Clean up allprison_lock handing in kern_jail_set Keep explicit track of the allprison_lock state during the final part of kern_jail_set, instead of deducing it from the JAIL_ATTACH flag. Also properly clean up when the attachment fails, fixing a long- standing (though minor) memory leak.	2020-12-31 15:18:43 -08:00
Mateusz Guzik	0c09f4b0cc	cache: work around corner case of dvp == tvp in cache_fplookup_final_modifying Fixes a panic where the kernel would unlock an unheld lock coming from rename looking up "foo/." as the source. Reported by: markj (syzkaller)	2020-12-28 21:38:20 +00:00
Mateusz Guzik	4ab7d9f484	cache: reduce engrish in previous commit	2020-12-28 02:05:30 +00:00
Mateusz Guzik	0714f921cd	cache: save on some branching in common case mount point traversal	2020-12-28 01:53:28 +00:00
Mateusz Guzik	8c9d74634a	vfs: stop open-coding setting WILLBEDIR flag	2020-12-28 01:53:27 +00:00
Mateusz Guzik	002e18eb7f	vfs: add FAILIFEXISTS flag Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on the way and both are used a lot when building respective kernels. This poses a problem as spurious locking avoidably interferes with concurrent operations like getdirentries on affected directories. Work around the problem by adding FAILIFEXISTS flag. In case of lockless lookup this manages to avoid any work to begin with, there is no speed up for the locked case but perhaps this can be augmented later on. For simplicity the only supported semantics are as used by mkdir. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27789	2020-12-28 01:53:27 +00:00
Mateusz Guzik	ff97bc034f	cache: simplify lockless dot lookups	2020-12-28 01:53:27 +00:00
Mateusz Guzik	abd7ded451	cache: modification and last entry filling support in lockless lookup v2 The previous patch failed to set the ISDOTDOT flag when appropriate, which in turn fail to properly handle degenerate lookups. While here sprinkle some extra assertions. Tested by: pho (previous version)	2020-12-27 21:03:18 +00:00
Mateusz Guzik	623daa69f9	cache: assert internal flags are not passed by namei	2020-12-27 19:49:24 +00:00
Mateusz Guzik	a1fc1f10c6	Revert "cache: modification and last entry filling support in lockless lookup" This reverts commit `6dbb07ed68`. Some ports unreliably fail to build with rmdir getting ENOTEMPTY.	2020-12-27 19:02:29 +00:00
Mateusz Guzik	6dbb07ed68	cache: modification and last entry filling support in lockless lookup Tested by: pho (previous version)	2020-12-27 17:22:25 +00:00
Konstantin Belousov	9dd48b87e6	Regen.	2020-12-27 12:57:27 +02:00
Konstantin Belousov	7a202823aa	Expose eventfd in the native API/ABI using a new __specialfd syscall eventfd is a Linux system call that produces special file descriptors for event notification. When porting Linux software, it is currently usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues are not passable between processes. And, as noted by the author of epoll-shim, even if they were, the library state would also have to be passed somehow. This came up when debugging strange HW video decode failures in Firefox. A native implementation would avoid these problems and help with porting Linux software. Since we now already have an eventfd implementation in the kernel (for the Linuxulator), it's pretty easy to expose it natively, which is what this patch does. Submitted by: greg@unrelenting.technology Reviewed by: markj (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26668	2020-12-27 12:57:26 +02:00
Jamie Gritton	7f4e724829	jail: add a missing lock around an osd_jail_call(). allprison_lock should be at least held shared when jail OSD methods are called. Add a shared lock around one such call where that wasn't the case. In another such call, change an exclusive lock grab to be shared in what is likely the more common case.	2020-12-26 20:49:30 -08:00
Jamie Gritton	0fe74ae624	jail: Consistently handle the pr_allow bitmask Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag value itself, which is what sysctl expects. Add prison_set_allow(), which can set or clear a permission bit, and propagates cleared bits down to child jails. Use prison_allow() and prison_set_allow() in the various jail.allow.* sysctls, and others that depend on thoe permissions. Add locking around checking both pr_allow and pr_enforce_statfs in prison_priv_check().	2020-12-26 20:25:02 -08:00
Mark Johnston	26b23f07fb	sendfile: Ensure that sfio->npages is initialized We initialize sfio->npages only when some I/O is required to satisfy the request. However, sendfile_iodone() contains an INVARIANTS-only check that references sfio->npages, and this check is executed even if no I/O is performed, so the check may use an uninitialized value. Fix the problem by initializing sfio->npages earlier. Note that sendfile_swapin() always initializes the page array. In some rare cases we need to trim the page array so ensure that sfio->npages gets updated accordingly. Reported by: syzkaller (with KASAN) Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27726	2020-12-26 16:07:40 -05:00
Jamie Gritton	5d58f959d3	jail: Fix lock-free access to dynamic pr.allow flags Use atomic access and a memory barrier to ensure that the flag parameter in pr_flag_allow is indeed set after the rest of the structure is valid. Simplify adding flag bits with pr_allow_all, a dynamic version of PR_ALLOW_ALL_STATIC.	2020-12-26 12:53:28 -08:00
Jamie Gritton	7de883c82f	jail: Fix an O(n^2) loop when adding jails When a jail is added using the default (system-chosen) JID, and non-default-JID jails already exist, a loop through the allprison list could restart and result in unnecessary O(n^2) behaviour. There should never be more than two list passes required. Also clean up inefficient (though still O(n)) allprison list traversal when finding jails by ID, or when adding jails in the common case of all default JIDs.	2020-12-26 10:39:34 -08:00
Alan Somers	0120603891	AIO: remove the kaiocb->bio linkage Vectored aio will require each aiocb to be associated with multiple bios, so we can't store a link to the latter from the former. But we don't really need to. aio_biowakeup already knows the bio it's using, and the other fields can be stored within the bio and/or buf itself. Also, remove the unused kaiocb.backend2 field. Reviewed By: kib Differential Revision: https://reviews.freebsd.org/D27682	2020-12-23 16:06:15 +00:00
Mateusz Guzik	906a73e791	cache: fix up cache_hold_vnode comment	2020-12-23 07:24:29 +00:00
Andrew Gallatin	02bc3865aa	Optionally bind ktls threads to NUMA domains When ktls_bind_thread is 2, we pick a ktls worker thread that is bound to the same domain as the TCP connection associated with the socket. We use roughly the same code as netinet/tcp_hpts.c to do this. This allows crypto to run on the same domain as the TCP connection is associated with. Assuming TCP_REUSPORT_LB_NUMA (D21636) is in place & in use, this ensures that the crypto source and destination buffers are local to the same NUMA domain as we're running crypto on. This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces cross-domain traffic from over 37% down to about 13% as measured by pcm.x on a dual-socket Xeon using nginx and a Netflix workload. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21648	2020-12-19 21:46:09 +00:00
Kyle Evans	54a837c8cc	kern: cpuset: allow jails to modify child jails' roots This partially lifts a restriction imposed by r191639 ("Prevent a superuser inside a jail from modifying the dedicated root cpuset of that jail") that's perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still cannot modify their own cpuset, but they can modify child jails' roots to further restrict them or widen them back to the modifying jails' own mask. As a side effect of this, the system root may once again widen the mask of jails as long as they're still using a subset of the parent jails' mask. This was previously prevented by the fact that cpuset_getroot of a root set will return that root, rather than the root's parent -- cpuset_modify uses cpuset_getroot since it was introduced in r327895, previously it was just validating against set->cs_parent which allowed the system root to widen jail masks. Reviewed by: jamie MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27352	2020-12-19 03:30:06 +00:00
Konstantin Belousov	673e2dd652	Add ELF flag to disable ASLR stack gap. Also centralize and unify checks to enable ASLR stack gap in a new helper exec_stackgap(). PR: 239873 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-12-18 23:14:39 +00:00
John Baldwin	a095390344	Use a template assembly file for firmware object files. Similar to r366897, this uses the .incbin directive to pull in a firmware file's contents into a .fwo file. The same scheme for computing symbol names from the filename is used as before to maximize compatiblity and not require rebuilding existing .fwo files for NO_CLEAN builds. Using ld -o binary requires extra hacks in linkers to either specify ABI options (e.g. soft- vs hard-float) or to ignore ABI incompatiblities when linking certain objects (e.g. object files with only data). Using the compiler driver avoids the need for these hacks as the compiler driver is able to set all the appropriate ABI options. Reviewed by: imp, markj Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27579	2020-12-17 20:31:17 +00:00
Konstantin Belousov	551e205f6d	Fix a race in tty_signal_sessleader() with unlocked read of s_leader. Since we do not own the session lock, a parallel killjobc() might reset s_leader to NULL after we checked it. Read s_leader only once and ensure that compiler is not allowed to reload. While there, make access to t_session somewhat more pretty by using local variable. PR: 251915 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 1 week	2020-12-17 19:51:39 +00:00
Mateusz Guzik	57efe26bcb	fd: reimplement close_range to avoid spurious relocking	2020-12-17 18:52:30 +00:00
Mateusz Guzik	08a5615cfe	audit: rework AUDIT_SYSCLOSE This in particular avoids spurious lookups on close.	2020-12-17 18:52:04 +00:00
Mateusz Guzik	1e71e7c4f6	fd: refactor closefp in preparation for close_range rework	2020-12-17 18:51:09 +00:00
Mateusz Guzik	08241fedc4	fd: remove redundant saturation check from fget_unlocked_seq refcount_acquire_if_not_zero returns true on saturation. The case of 0 is handled by looping again, after which the originally found pointer will no longer be there. Noted by: kib	2020-12-16 18:01:41 +00:00
Mateusz Guzik	6404d7ffc1	uipc: disable prediction in unp_pcb_lock_peer The branch is not very predictable one way or the other, at least during buildkernel where it only correctly matched 57% of calls.	2020-12-13 21:32:19 +00:00
Mateusz Guzik	8ab96e265d	cache: fix ups bad predicts - last level fallback normally sees CREATE; the code should be optimized to not get there for said case - fast path commonly fails with ENOENT	2020-12-13 21:29:39 +00:00
Mateusz Guzik	d48c2b8d29	vfs: correctly predict last fdrop on failed open Arguably since the count is guaranteed to be 1 the code should be modified to avoid the work.	2020-12-13 21:28:15 +00:00
Konstantin Belousov	203affb291	Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428. Reported by: arichardson Reviewed by: arichardson, markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27597	2020-12-13 19:45:42 +00:00
Konstantin Belousov	0b459854bc	Correct indent. Sponsored by: The FreeBSD Foundation	2020-12-13 19:43:45 +00:00
Mateusz Guzik	edcdcefb88	fd: fix fdrop prediction when closing a fd Most of the time this is the last reference, contrary to typical fdrop use.	2020-12-13 18:06:24 +00:00
Ryan Libby	d3bbf8af68	cache_fplookup: quiet gcc -Wreturn-type Reviewed by: markj, mjg Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27555	2020-12-11 22:51:44 +00:00
Mateusz Guzik	0ecce93dca	fd: make serialization in fdescfree_fds conditional on hold count p_fd nullification in fdescfree serializes against new threads transitioning the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can safely assume there is nobody else using the table. Losing the race and observing > 1 is harmless. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27522	2020-12-10 17:17:22 +00:00
Mark Johnston	3309fa7403	Plug a race between fd table teardown and several loops To export information from fd tables we have several loops which do this: FILDESC_SLOCK(fdp); for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++) <export info for fd i>; FILDESC_SUNLOCK(fdp); Before r367777, fdescfree() acquired the fd table exclusive lock between decrementing fdp->fd_refcount and freeing table entries. This serialized with the loop above, so the file at descriptor i would remain valid until the lock is dropped. Now there is no serialization, so the loops may race with teardown of file descriptor tables. Acquire the exclusive fdtable lock after releasing the final table reference to provide a barrier synchronizing with these loops. Reported by: pho Reviewed by: kib (previous version), mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27513	2020-12-09 14:05:08 +00:00
Mark Johnston	4c1c90ea95	Use refcount_load(9) to load fd table reference counts No functional change intended. Reviewed by: kib, mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27512	2020-12-09 14:04:54 +00:00
Kyle Evans	f1b18a668d	cpuset_set{affinity,domain}: do not allow empty masks cpuset_modify() would not currently catch this, because it only checks that the new mask is a subset of the root set and circumvents the EDEADLK check in cpuset_testupdate(). This change both directly validates the mask coming in since we can trivially detect an empty mask, and it updates cpuset_testupdate to catch stuff like this going forward by always ensuring we don't end up with an empty mask. The check_mask argument has been renamed because the 'check' verbiage does not imply to me that it's actually doing a different operation. We're either augmenting the existing mask, or we are replacing it entirely. Reported by: syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com Discussed with: andrew Reviewed by: andrew, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27511	2020-12-08 18:47:22 +00:00
Kyle Evans	b2780e8537	kern: cpuset: resolve race between cpuset_lookup/cpuset_rel The race plays out like so between threads A and B: 1. A ref's cpuset 10 2. B does a lookup of cpuset 10, grabs the cpuset lock and searches cpuset_ids 3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock while B is still searching and not yet ref'd 4. B ref's cpuset 10 and drops the cpuset lock 5. A proceeds to free the cpuset out from underneath B Resolve the race by only releasing the last reference under the cpuset lock. Thread A now picks up the spinlock and observes that the cpuset has been revived, returning immediately for B to deal with later. Reported by: syzbot+92dff413e201164c796b@syzkaller.appspotmail.com Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27498	2020-12-08 18:45:47 +00:00
Kyle Evans	9c83dab96c	kern: cpuset: plug a unr leak cpuset_rel_defer() is supposed to be functionally equivalent to cpuset_rel() but with anything that might sleep deferred until cpuset_rel_complete -- this setup is used specifically for cpuset_setproc. Add in the missing unr free to match cpuset_rel. This fixes a leak that was observed when I wrote a small userland application to try and debug another issue, which effectively did: cpuset(&newid); cpuset(&scratch); newid gets leaked when scratch is created; it's off the list, so there's no mechanism for anything else to relinquish it. A more realistic reproducer would likely be a process that inherits some cpuset that it's the only ref for, but it creates a new one to modify. Alternatively, administratively reassigning a process' cpuset that it's the last ref for will have the same effect. Discovered through D27498. MFC after: 1 week	2020-12-08 18:44:06 +00:00
Mateusz Guzik	8fcfd0e222	vfs: add cleanup on error missed in r368375 Noted by: jrtc27	2020-12-06 19:24:38 +00:00
Mateusz Guzik	60e2a0d9a4	vfs: factor buffer allocation/copyin out of namei	2020-12-06 04:59:24 +00:00
Mateusz Guzik	0c23d26230	vfs: keep bad ops on vnode reclaim They were only modified to accomodate a redundant assertion. This runs into problems as lockless lookup can still try to use the vnode and crash instead of getting an error. The bug was only present in kernels with INVARIANTS. Reported by: kevans	2020-12-05 05:56:23 +00:00
Konstantin Belousov	be2535b0a6	Add kern_ntp_adjtime(9). Reviewed by: brooks, cy Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27471	2020-12-04 18:56:44 +00:00
Kyle Evans	34af05ead3	kern: soclose: don't sleep on SO_LINGER w/ timeout=0 This is a valid scenario that's handled in the various protocol layers where it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it indicates we should immediately drop the connection, it makes little sense to sleep on it. This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this could result in the thread hanging until a signal interrupts it if the protocol does not mark the socket as disconnected for whatever reason. Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com Reviewed by: glebius, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27407	2020-12-04 04:39:48 +00:00
Mark Johnston	b957b18594	Always use 64-bit physical addresses for dump_avail[] in minidumps As of r365978, minidumps include a copy of dump_avail[]. This is an array of vm_paddr_t ranges. libkvm walks the array assuming that sizeof(vm_paddr_t) is equal to the platform "word size", but that's not correct on some platforms. For instance, i386 uses a 64-bit vm_paddr_t. Fix the problem by always dumping 64-bit addresses. On platforms where vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate dump_avail[] to an array of uint64_t ranges. With this change, libkvm no longer needs to maintain a notion of the target word size, so get rid of it. This is a no-op on platforms where sizeof(vm_paddr_t) == 8. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27082	2020-12-03 17:12:31 +00:00
Oleksandr Tymoshenko	18ce865a4f	Add support for hw.physmem tunable for ARM/ARM64/RISC-V platforms hw.physmem tunable allows to limit number of physical memory available to the system. It's handled in machdep files for x86 and PowerPC. This patch adds required logic to the consolidated physmem management interface that is used by ARM, ARM64, and RISC-V. Submitted by: Klara, Inc. Reviewed by: mhorne Sponsored by: Ampere Computing Differential Revision: https://reviews.freebsd.org/D27152	2020-12-03 05:39:27 +00:00
Mateusz Guzik	10e64782ed	select: make sure there are no wakeup attempts after selfdfree returns Prior to the patch returning selfdfree could still be racing against doselwakeup which set sf_si = NULL and now locks stp to wake up the other thread. A sufficiently unlucky pair can end up going all the way down to freeing select-related structures before the lock/wakeup/unlock finishes. This started manifesting itself as crashes since select data started getting freed in r367714.	2020-12-02 00:48:15 +00:00
Konstantin Belousov	6814c2dac5	lio_listio(2): send signal even if number of jobs is zero. Right now, if lio registered zero jobs, syscall frees lio job structure, cleaning up queued ksi. As result, the realtime signal is dequeued and never delivered. Fix it by allowing sendsig() to copy ksi when job count is zero. PR: 220398 Reported and reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421	2020-12-01 22:53:33 +00:00
Konstantin Belousov	2933165666	vfs_aio.c: style. Mostly re-wrap conditions to split after binary ops. Reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421	2020-12-01 22:46:51 +00:00
Konstantin Belousov	5c5005ec20	vfs_aio.c: correct comment. Reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421	2020-12-01 22:30:32 +00:00
Mark Johnston	dad22308a1	vmem: Revert r364744 A pair of bugs are believed to have caused the hangs described in the commit log message for r364744: 1. uma_reclaim() could trigger reclamation of the reserve of boundary tags used to avoid deadlock. This was fixed by r366840. 2. The loop in vmem_xalloc() would in some cases try to allocate more boundary tags than the expected upper bound of BT_MAXALLOC. The reserve is sized based on the value BT_MAXMALLOC, so this behaviour could deplete the reserve without guaranteeing a successful allocation, resulting in a hang. This was fixed by r366838. PR: 248008 Tested by: rmacklem	2020-12-01 16:06:31 +00:00
Alexander V. Chernikov	8db8bebf1f	Move inner loop logic out of sysctl_sysctl_next_ls(). Refactor sysctl_sysctl_next_ls(): * Move huge inner loop out of sysctl_sysctl_next_ls() into a separate non-recursive function, returning the next step to be taken. * Update resulting node oid parts only on successful lookup * Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno, slightly simplifying logic Reviewed by: freqlabs Differential Revision: https://reviews.freebsd.org/D27029	2020-11-30 21:59:52 +00:00
Toomas Soome	93b18e3730	vt: if loader did pass the font via metadata, use it The built in 8x16 font may be way too small with large framebuffer resolutions, to improve readability, use loader provied font.	2020-11-30 11:45:47 +00:00
Toomas Soome	a4a10b37d4	Add VT driver for VBE framebuffer device Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT. vt_vbefb is built based on vt_efifb and is assuming similar data for initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb in kernel metadata. struct vbe_fb, is populated by boot loader, and is passed to kernel via metadata payload. Differential Revision: https://reviews.freebsd.org/D27373	2020-11-30 08:22:40 +00:00
Matt Macy	2338da0373	Import kernel WireGuard support Data path largely shared with the OpenBSD implementation by Matt Dunwoodie <ncon@nconroy.net> Reviewed by: grehan@freebsd.org MFC after: 1 month Sponsored by: Rubicon LLC, (Netgate) Differential Revision: https://reviews.freebsd.org/D26137	2020-11-29 19:38:03 +00:00
Konstantin Belousov	a9d4fe977a	bio aio: Destroy ephemeral mapping before unwiring page. Apparently some architectures, like ppc in its hashed page tables variants, account mappings by pmap_qenter() in the response from pmap_is_page_mapped(). While there, eliminate useless userp variable. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27409	2020-11-29 10:30:56 +00:00
Alexander Motin	83f6b50123	Remove alignment requirements for KVA buffer mapping. After r368124 pbuf_zone has extra page to handle this particular case.	2020-11-29 01:30:17 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Kyle Evans	e07e3fa3c9	kern: cpuset: drop the lock to allocate domainsets Restructure the loop a little bit to make it a little more clear how it really operates: we never allocate any domains at the beginning of the first iteration, and it will run until we've satisfied the amount we need or we encounter an error. The lock is now taken outside of the loop to make stuff inside the loop easier to evaluate w.r.t. locking. This fixes it to not try and allocate any domains for the freelist under the spinlock, which would have happened before if we needed any new domains. Reported by: syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D27371	2020-11-28 01:21:11 +00:00
Mark Johnston	0c56925bc2	callout(9): Remove some leftover APM BIOS support This code is obsolete since r366546. Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27267	2020-11-27 20:46:02 +00:00
Konstantin Belousov	99c66d3acf	vn_read_from_obj(): fix handling of doomed vnodes. There is no reason why vp->v_object cannot be NULL. If it is, it's fine, handle it by delegating to VOP_READ(). Tested by: pho Reviewed by: markj, mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27327	2020-11-26 18:13:33 +00:00
Konstantin Belousov	164438a7b9	More careful handling of the mount failure. - VFS_UNMOUNT() requires vn_start_write() around it []. - call VFS_PURGE() before unmount. - do not destroy mp if cleanup unmount did not succeed. - set MNTK_UNMOUNT, and indicate forced unmount with MNTK_UNMOUNTF for VFS_UNMOUNT() in cleanup. PR: 251320 [] Reported by: Tong Zhang <ztong0001@gmail.com> Reviewed by: markj, mjg Discussed with: rmacklem Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27327	2020-11-26 18:08:42 +00:00
Konstantin Belousov	3b1f974bfb	Make max ticks for pause in vn_lock_pair() adjustable at runtime. Reduce default value from hz / 10 to hz / 100. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation	2020-11-26 18:00:26 +00:00
Mateusz Guzik	b83e94be53	thread: staticize thread_reap and move td_allocdomain thread_init is a much better fit as the the value is constant after initialization.	2020-11-26 06:59:27 +00:00
Mateusz Guzik	2e51c2bfd1	pipe: follow up cleanup to previous The commited patch was incomplete. - add back missing goto retry, noted by jhb - 'if (error)' -> 'if (error != 0)' - consistently do: if (error != 0) break; continue; instead of: if (error != 0) break; else continue; This adds some 'continue' uses which are not needed, but line up with the rest of pipe_write.	2020-11-25 22:53:21 +00:00
Mateusz Guzik	c8df8543fd	pipe: drop spurious pipeunlock/pipelock cycle on write	2020-11-25 21:41:23 +00:00
Kyle Evans	d431dea5ac	kern: cpuset: properly rebase when attaching to a jail The current logic is a fine choice for a system administrator modifying process cpusets or a process creating a new cpuset(2), but not ideal for processes attaching to a jail. Currently, when a process attaches to a jail, it does exactly what any other process does and loses any mask it might have applied in the process of doing so because cpuset_setproc() is entirely based around the assumption that non-anonymous cpusets in the process can be replaced with the new parent set. This approach slightly improves the jail attach integration by modifying cpuset_setproc() callers to indicate if they should rebase their cpuset to the indicated set or not (i.e. cpuset_setproc_update_set). If we're rebasing and the process currently has a cpuset assigned that is not the containing jail's root set, then we will now create a new base set for it hanging off the jail's root with the existing mask applied instead of using the jail's root set as the new base set. Note that the common case will be that the process doesn't have a cpuset within the jail root, but the system root can freely assign a cpuset from a jail to a process outside of the jail with no restriction. We assume that that may have happened or that it could happen due to a race when we drop the proc lock, so we must recheck both within the loop to gather up sufficient freed cpusets and after the loop. To recap, here's how it worked before in all cases: 0 4 <-- jail 0 4 <-- jail / process \| \| 1 -> 1 \| 3 <-- process Here's how it works now: 0 4 <-- jail 0 4 <-- jail \| \| \| 1 -> 1 5 <-- process \| 3 <-- process or 0 4 <-- jail 0 4 <-- jail / process \| \| 1 <-- process -> 1 More importantly, in both cases, the attaching process still retains the mask it had prior to attaching or the attach fails with EDEADLK if it's left with no CPUs to run on or the domain policy is incompatible. The author of this patch considers this almost a security feature, because a MAC policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's restricted to some subset of available CPUs the ability to attach to a jail, which might lift the user's restrictions if they attach to a jail with a wider mask. In most cases, it's anticipated that admins will use this to be able to, for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`, and avoid the need for contortions to spawn a command inside a jail with a more limited cpuset than the jail. Reviewed by: jamie MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D27298	2020-11-25 03:14:25 +00:00
Kyle Evans	30b7c6f977	kern: cpuset: rename _cpuset_create() to cpuset_init() cpuset_init() is better descriptor for what the function actually does. The name was previously taken by a sysinit that setup cpuset_zero's mask from all_cpus, it was removed in r331698 before stable/12 branched. A comment referencing the removed sysinit has now also been removed, since the setup previously done was moved into cpuset_thread0(). Suggested by: markj MFC after: 1 week	2020-11-25 02:12:24 +00:00
Kyle Evans	29d04ea8c3	kern: cpuset: allow cpuset_create() to take an allocated setp Currently, it must always allocate a new set to be used for passing to _cpuset_create, but it doesn't have to. This is purely kern_cpuset.c internal and it's sparsely used, so just change it to use setp if it's not-NULL and modify the two consumers to pass in the address of a NULL cpuset. This paves the way for consumers that want the unr allocation without the possibility of sleeping as long as they've done their due diligence to ensure that the mask will properly apply atop the supplied parent (i.e. avoiding the free_unr() in the last failure path). Reviewed by: jamie, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27297	2020-11-25 01:42:32 +00:00
Kyle Evans	c7ef3490e2	kern: never restart syscalls calling closefp(), e.g. close(2) All paths leading into closefp() will either replace or remove the fd from the filedesc table, and closefp() will call fo_close methods that can and do currently sleep without regard for the possibility of an ERESTART. This can be dangerous in multithreaded applications as another thread could have opened another file in its place that is subsequently operated on upon restart. The following are seemingly the only ones that will pass back ERESTART in-tree: - sockets (SO_LINGER) - fusefs - nfsclient Reviewed by: jilles, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27310	2020-11-25 01:08:57 +00:00
Cy Schubert	e5a307c6ac	Fix a typo in a comment. MFC after: 3 days	2020-11-24 06:42:32 +00:00
Mateusz Guzik	f90d57b808	locks: push lock_delay_arg_init calls down Minor cleanup to skip doing them when recursing on locks and so that they can act on found lock value if need be.	2020-11-24 03:49:37 +00:00
Mateusz Guzik	094c148b7a	sx: drop spurious volatile keyword	2020-11-24 03:48:44 +00:00
Mateusz Guzik	598f2b8116	dtrace: stop using eventhandlers for the part compiled into the kernel Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D27311	2020-11-23 18:27:21 +00:00
Mateusz Guzik	a9568cd2bc	thread: stash domain id to work around vtophys problems on ppc64 Adding to zombie list can be perfomed by idle threads, which on ppc64 leads to panics as it requires a sleepable lock. Reported by: alfredo Reviewed by: kib, markj Fixes: r367842 ("thread: numa-aware zombie reaping") Differential Revision: https://reviews.freebsd.org/D27288	2020-11-23 18:26:47 +00:00
Konstantin Belousov	87a9b18d22	Provide ABI modules hooks for process exec/exit and thread exit. Exec and exit are same as corresponding eventhandler hooks. Thread exit hook is called somewhat earlier, while thread is still owned by the process and enough context is available. Note that the process lock is owned when the hook is called. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27309	2020-11-23 17:29:25 +00:00
Edward Tomasz Napierala	9c8c797c1a	Remove the 'wantparent' variable, unused since r145004. Reviewed by: kib MFC after: 2 weeks Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27193	2020-11-23 12:47:23 +00:00
Kyle Evans	dac521ebcf	cpuset_setproc: use the appropriate parent for new anonymous sets As far as I can tell, this has been the case since initially committed in 2008. cpuset_setproc is the executor of cpuset reassignment; note this excerpt from the description: * 1) Set is non-null. This reparents all anonymous sets to the provided * set and replaces all non-anonymous td_cpusets with the provided set. However, reviewing cpuset_setproc_setthread() for some jail related work unearthed the error: if tdset was not anonymous, we were replacing it with `set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the thread's mask over and AND it with `set`) but give the new anonymous set the original tdset as the parent (i.e. the base of the set we're supposed to be leaving behind). The primary visible consequences were that: 1.) cpuset_getid() following such assignment returns the wrong result, the setid that we left behind rather than the one we joined. 2.) When a process attached to the jail, the base set of any anonymous threads was a set outside of the jail. This was initially bundled in D27298, but it's a minor fix that's fairly easy to verify the correctness of. A test is included in D27307 ("badparent"), which demonstrates the issue with, effectively: osetid = cpuset_getid() newsetid = cpuset() cpuset_setaffinity(thread) cpuset_setid(osetid) cpuset_getid(thread) -> observe that it matches newsetid instead of osetid. MFC after: 1 week	2020-11-23 02:49:53 +00:00
Kyle Evans	60e60e73fd	freebsd32: take the _umtx_op struct definitions back Providing these in freebsd32.h facilitates local testing/measuring of the structs rather than forcing one to locally recreate them. Sanity checking offsets/sizes remains in kern_umtx.c where these are typically used.	2020-11-23 00:58:14 +00:00
Kyle Evans	f96078b8fe	kern: dup: do not assume oldfde is valid oldfde may be invalidated if the table has grown due to the operation that we're performing, either via fdalloc() or a direct fdgrowtable_exp(). This was technically OK before rS367927 because the old table remained valid until the filedesc became unused, but now it may be freed immediately if it's an unshared table in a single-threaded process, so it is no longer a good assumption to make. This fixes dup/dup2 invocations that grow the file table; in the initial report, it manifested as a kernel panic in devel/gmake's configure script. Reported by: Guy Yur <guyyur gmail com> Reviewed by: rew Differential Revision: https://reviews.freebsd.org/D27319	2020-11-23 00:33:06 +00:00
Kyle Evans	e0cb5b2a77	[2/2] _umtx_op: introduce 32-bit/i386 flags for operations This patch takes advantage of the consolidation that happened to provide two flags that can be used with the native _umtx_op(2): UMTX_OP___32BIT and UMTX_OP__I386. UMTX_OP__32BIT iindicates that we are being provided with 32-bit structures. Note that this flag alone indicates a 64bit time_t, since this is the majority case. UMTX_OP__I386 has been provided so that we can emulate i386 as well, regardless of whether the host is amd64 or not. Both imply a different set of copyops in sysumtx_op. freebsd32__umtx_op simply ignores the flags, since it's already doing a 32-bit operation and it's unlikely we'll be running an emulator under compat32. Future work could consider it, but the author sees little benefit. This will be used by qemu-bsd-user to pass on all _umtx_op calls to the native interface as long as the host/target endianness matches, effectively eliminating most if not all of the remaining unresolved deadlocks for most. This version changed a fair amount from what was under review, mostly in response to refactoring of the prereq reorganization and battle-testing it with qemu-bsd-user. The main changes are as follows: 1.) The i386 flag got renamed to omit '32BIT' since this is redundant. 2.) The flags are now properly handled on 32-bit platforms to emulate other 32-bit platforms. 3.) Robust list handling was fixed, and the 32-bit functionality that was previously gated by COMPAT_FREEBSD32 is now unconditional. 4.) Robust list handling was also improved, including the error reported when a process has already registered 32-bit ABI lists and also detecting if native robust lists have already been registered. Both scenarios now return EBUSY rather than EINVAL, because the input is technically valid but we're too busy with another ABI's lists. libsysdecode/kdump/truss support will go into review soon-ish, along with the associated manpage update. Reviewed by: kib (earlier version) MFC after: 3 weeks	2020-11-22 05:47:45 +00:00
Kyle Evans	15eaec6a5c	_umtx_op: move compat32 definitions back in These are reasonably compact, and a future commit will blur the compat32 lines by supporting 32-bit operations with the native _umtx_op.	2020-11-22 05:34:51 +00:00
Robert Wing	3c85ca21d1	fd: free old file descriptor tables when not shared During the life of a process, new file descriptor tables may be allocated. When a new table is allocated, the old table is placed in a free list and held onto until all processes referencing them exit. When a new file descriptor table is allocated, the old file descriptor table can be freed when the current process has a single-thread and the file descriptor table is not being shared with any other processes. Reviewed by: kevans Approved by: kevans (mentor) Differential Revision: https://reviews.freebsd.org/D18617	2020-11-22 05:00:28 +00:00
Konstantin Belousov	e68c619144	Stop using eventhandlers for itimers subsystem exec and exit hooks. While there, do some minor cleanup for kclocks. They are only registered from kern_time.c, make registration function static. Remove event hooks, they are not used by both registered kclocks. Add some consts. Perhaps we can stop registering kclocks at all and statically initialize them. Reviewed by: mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27305	2020-11-21 21:43:36 +00:00
Konstantin Belousov	5a2a4551f5	Remove unused prototype. Missed part of r367918. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-11-21 10:58:19 +00:00
Konstantin Belousov	74a093eb98	Stop using eventhandler to invoke umtx_exec hook. There is no point in dynamic registration, umtx hook is there always. Reviewed by: mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27303	2020-11-21 10:32:40 +00:00
Kirk McKusick	e75f0f2b48	Only attempt a VOP_UNLOCK() when the vn_lock() has been successful. No MFC as this code is not present in 12-stable. Reported by: Peter Holm Reviewed by: Mateusz Guzik Tested by: Peter Holm Sponsored by: Netflix	2020-11-20 20:22:01 +00:00
Michal Meloun	d9de80d614	Also pass interrupt binding request to non-root interrupt controllers. There are message based controllers that can bind interrupts even if they are not implemented as root controllers (such as the ITS subblock of GIC). MFC after: 3 weeks	2020-11-20 09:05:36 +00:00
Mateusz Guzik	f9fe7b28bc	pipe: thundering herd problem in pipelock All reads and writes are serialized with a hand-rolled lock, but unlocking it always wakes up all waiters. Existing flag fields get resized to make room for introduction of waiter counter without growing the struct. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D27273	2020-11-19 19:25:47 +00:00
Mark Johnston	a33fef5e25	callout(9): Fix a race between CPU migration and callout_drain() Suppose a running callout re-arms itself, and before the callout finishes running another CPU calls callout_drain() and goes to sleep. softclock_call_cc() will wake up the draining thread, which may not run immediately if there is a lot of CPU load. Furthermore, the callout is still in the callout wheel so it can continue to run and re-arm itself. Then, suppose that the callout migrates to another CPU before the draining thread gets a chance to run. The draining thread is in this loop in _callout_stop_safe(): while (cc_exec_curr(cc) == c) { CC_UNLOCK(cc); sleep(); CC_LOCK(cc); } but after the migration, cc points to the wrong CPU's callout state. Then the draining thread goes off and removes the callout from the wheel, but does so using the wrong lock and per-CPU callout state. Fix the problem by doing a re-lookup of the callout CPU after sleeping. Reported by: syzbot+79569cd4d76636b2cc1c@syzkaller.appspotmail.com Reported by: syzbot+1b27e0237aa22d8adffa@syzkaller.appspotmail.com Reported by: syzbot+e21aa5b85a9aff90ef3e@syzkaller.appspotmail.com Reviewed by: emaste, hselasky Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27266	2020-11-19 18:37:28 +00:00
Mitchell Horne	c8a96cdcd9	Add an option for entering KDB on recursive panics There are many cases where one would choose avoid entering the debugger on a normal panic, opting instead to reboot and possibly save a kernel dump. However, recursive kernel panics are an unusual case that might warrant attention from a human, so provide a secondary tunable, debug.debugger_on_recursive_panic, to allow entering the debugger only when this occurs. For for simplicity in maintaining existing behaviour, the tunable defaults to zero. Reviewed by: cem, markj Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27271	2020-11-19 18:03:40 +00:00
Mateusz Guzik	d116b9f1ad	thread: numa-aware zombie reaping The current global list is a significant problem, in particular induces a lot of cross-domain thread frees. When running poudriere on a 2 domain box about half of all frees were of that nature. Patch below introduces per-domain thread data containing zombie lists and domain-aware reaping. By default it only reaps from the current domain, only reaping from others if there is free TID shortage. A dedicated callout is introduced to reap lingering threads if there happens to be no activity. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D27185	2020-11-19 10:00:48 +00:00
Mateusz Guzik	b8cb628534	pipe: tidy up pipelock	2020-11-19 08:16:45 +00:00
Mateusz Guzik	89744405e6	pipe: allow for lockless pipe_stat pipes get stated all thet time and this avoidably contributed to contention. The pipe lock is only held to accomodate MAC and to check the type. Since normally there is no probe for pipe stat depessimize this by having the flag. The pipe_state field gets modified with locks held all the time and it's not feasible to convert them to use atomic store. Move the type flag away to a separate variable as a simple cleanup and to provide stable field to read. Use short for both fields to avoid growing the struct. While here short-circuit MAC for pipe_poll as well.	2020-11-19 06:30:25 +00:00
Mateusz Guzik	2f5b0b48ac	cred: fix minor nits in r367695 Noted by: jhb	2020-11-19 04:28:39 +00:00
Mateusz Guzik	c48f897bbe	smp: fix smp_rendezvous_cpus_retry usage before smp starts Since none of the other CPUs are running there is nobody to clear their entries and the routine spins indefinitely.	2020-11-19 04:27:51 +00:00
Mark Johnston	a28c28e6ef	Remove NO_EVENTTIMERS support The arm configs that required it have been removed from the tree. Removing this option makes the callout code easier to read and discourages developers from adding new configs without eventtimer drivers. Reviewed by: ian, imp, mav Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27270	2020-11-19 02:50:48 +00:00
Mariusz Zaborski	f488d5b797	Add CTLFLAG_MPSAFE to the suser_enabled sysctl. Pointed out by: mjg	2020-11-18 21:26:14 +00:00
Mariusz Zaborski	05e1e482c7	jail: introduce per jail suser_enabled setting The suser_enable sysctl allows to remove a privileged rights from uid 0. This change introduce per jail setting which allow to make root a normal user. Reviewed by: jamie Previous version reviewed by: kevans, emaste, markj, me_igalic.co Discussed with: pjd Differential Revision: https://reviews.freebsd.org/D27128	2020-11-18 21:07:08 +00:00
Mariusz Zaborski	21fe9441e1	Fix style nits.	2020-11-18 20:59:58 +00:00
John Baldwin	5335f6434b	Fix a few nits in vn_printf(). - Mask out recently added VV_* bits to avoid printing them twice. - Keep VI_LOCKed on the same line as the rest of the flags. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27261	2020-11-18 16:21:37 +00:00
Kyle Evans	27a9392d54	_umtx_op: fix robust lists after r367744 A copy-pasto left us copying in 24-bytes at the address of the rb pointer instead of the intended target. Reported by: sigsys@gmail.com Sighing: kevans	2020-11-18 03:30:31 +00:00
Conrad Meyer	f8f74aaa84	linux(4) clone(2): Correctly handle CLONE_FS and CLONE_FILES The two flags are distinct and it is impossible to correctly handle clone(2) without the assistance of fork1(). This change depends on the pwddesc split introduced in r367777. I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd should be treated the opposite way p_fd is (based on RFFDG flag). This is a little ugly, but the benefit is that existing RFFDG API is preserved. Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are copied, while !RFFDG indicates both should be cloned. In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects independent fd tables. The previous conflation of CLONE_FS and CLONE_FILES was introduced in r163371 (2006). Discussed with: markj, trasz (earlier version) Differential Revision: https://reviews.freebsd.org/D27016	2020-11-17 21:20:11 +00:00
Conrad Meyer	85078b8573	Split out cwd/root/jail, cmask state from filedesc table No functional change intended. Tracking these structures separately for each proc enables future work to correctly emulate clone(2) in linux(4). __FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof. Reviewed by: kib Discussed with: markj, mjg Differential Revision: https://reviews.freebsd.org/D27037	2020-11-17 21:14:13 +00:00
Conrad Meyer	ede4af47ae	unix(4): Enhance LOCAL_CREDS_PERSISTENT ABI As this ABI is still fresh (r367287), let's correct some mistakes now: - Version the structure to allow for future changes - Include sender's pid in control message structure - Use a distinct control message type from the cmsgcred / sockcred mess Discussed with: kib, markj, trasz Differential Revision: https://reviews.freebsd.org/D27084	2020-11-17 20:01:21 +00:00
Conrad Meyer	de774e422e	linux(4): Implement name_to_handle_at(), open_by_handle_at() They are similar to our getfhat(2) and fhopen(2) syscalls. Differential Revision: https://reviews.freebsd.org/D27111	2020-11-17 19:51:47 +00:00
Kyle Evans	bd4bcd14e3	Fix !COMPAT_FREEBSD32 kernel build One of the last shifts inadvertently moved these static assertions out of a COMPAT_FREEBSD32 block, which the relevant definitions are limited to. Fix it. Pointy hat: kevans	2020-11-17 04:22:10 +00:00
Kyle Evans	63ecb272a0	umtx_op: reduce redundancy required for compat32 All of the compat32 variants are substantially the same, save for copyin/copyout (mostly). Apply the same kind of technique used with kevent here by having the syscall routines supply a umtx_copyops describing the operations needed. umtx_copyops carries the bare minimum needed- size of timespec and _umtx_time are used for determining if copyout is needed in the sem2_wait case. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27222	2020-11-17 03:36:58 +00:00
Kyle Evans	4be0a1b587	_umtx_op: fix a compat32 bug in UMTX_OP_NWAKE_PRIVATE Specifically, if we're waking up some value n > BATCH_SIZE, then the copyin(9) is wrong on the second iteration due to upp being the wrong type. upp is currently a uint32_t*, so upp + pos advances it by twice as many elements as it should (host pointer size vs. compat32 pointer size). Fix it by just making upp a uint32_t; it's still technically a double pointer, but the distinction doesn't matter all that much here since we're just doing arithmetic on it. Add a test case that demonstrates the problem, placed with the libthr tests since one messing with _umtx_op should be running these tests. Running under compat32, the new test case will hang as threads after the first 128 get missed in the wake. it's not immediately clear how to hit it in practice, since pthread_cond_broadcast() uses a smaller (sleepq batch?) size observed to be around ~50 -- I did not spend much time digging into it. The uintptr_t change makes no functional difference, but i've tossed it in since it's more accurate (semantically). Reported by: Andrew Gierth (andrew_tao173.riddles.org.uk, inspection) Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27231	2020-11-17 03:34:01 +00:00
Konstantin Belousov	cb596eea82	vmem: trivial warning and style fixes. Add __unused to some args. Change type of the iterator variables to match loop control. Remove excessive {}. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27220	2020-11-17 02:18:34 +00:00
Mateusz Guzik	1a7bb89629	cpuset: refcount-clean	2020-11-17 00:04:05 +00:00
Mateusz Guzik	89deca0a33	malloc: make malloc_large closer to standalone This moves entire large alloc handling out of all consumers, apart from deciding to go there. This is a step towards creating a fast path. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27198	2020-11-16 17:56:58 +00:00
Mateusz Guzik	19d3e47dca	select: call seltdfini on process and thread exit Since thread_zone is marked NOFREE the thread_fini callback is never executed, meaning memory allocated by seltdinit is never released. Adding the call to thread_dtor is not sufficient as exiting processes cache the main thread.	2020-11-16 03:12:21 +00:00
Mateusz Guzik	31b2ac4b5a	select: replace reference counting with memory barriers in selfd Refcounting was added to combat a race between selfdfree and doselwakup, but it adds avoidable overhead. selfdfree detects it can free the object by ->sf_si == NULL, thus we can ensure that the condition only holds after all accesses are completed.	2020-11-16 03:09:18 +00:00
Mateusz Guzik	b77594bbbf	sched: fix an incorrect comparison in sched_lend_user_prio_cond Compare with sched_lend_user_prio.	2020-11-15 01:54:44 +00:00
Mateusz Guzik	f34a2f56c3	thread: batch credential freeing	2020-11-14 19:22:02 +00:00
Mateusz Guzik	fb8ab68084	thread: batch resource limit free calls	2020-11-14 19:21:46 +00:00
Mateusz Guzik	5ef7b7a0f3	thread: rework tid batch to use helpers	2020-11-14 19:20:58 +00:00
Mateusz Guzik	d1ca25be49	thread: pad tid lock On a kernel with other changes this bumps 104-way thread creation/destruction from 0.96 mln ops/s to 1.1 mln ops/s.	2020-11-14 19:19:27 +00:00
Mateusz Guzik	9b9bb9ffa5	malloc: retire MALLOC_PROFILE The global array has prohibitive performance impact on multicore systems. The same data (and more) can be obtained with dtrace. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27199	2020-11-13 19:22:53 +00:00
Konstantin Belousov	441eb16a95	Allow some VOPs to return ERELOOKUP to indicate VFS operation restart at top level. Restart syscalls and some sync operations when filesystem indicated ERELOOKUP condition, mostly for VOPs operating on metdata. In particular, lookup results cached in the inode/v_data is no longer valid and needs recalculating. Right now this should be nop. Assert that ERELOOKUP is catched everywhere and not returned to userspace, by asserting that td_errno != ERELOOKUP on syscall return path. In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-13 09:42:32 +00:00
Konstantin Belousov	7cde2ec4fd	Implement vn_lock_pair(). In collaboration with: pho Reviewed by: mckusick (previous version), markj (previous version) Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-13 09:31:57 +00:00
Mateusz Guzik	9aa6d792b5	malloc: retire malloc_last_fail The routine does not serve any practical purpose. Memory can be allocated in many other ways and most consumers pass the M_WAITOK flag, making malloc not fail in the first place. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27143	2020-11-12 20:22:58 +00:00
Mateusz Guzik	62dbc992ad	thread: move nthread management out of tid_alloc While this adds more work single-threaded, it also enables SMP-related speed ups.	2020-11-12 00:29:23 +00:00
Kyle Evans	38033780a3	umtx: drop incorrect timespec32 definition This works for amd64, but none others -- drop it, because we already have a proper definition in sys/compat/freebsd32/freebsd32.h that correctly uses time32_t. MFC after: 1 week	2020-11-11 22:35:23 +00:00
Mateusz Guzik	755341df4f	thread: batch tid_free calls in thread_reap This eliminates the highly pessimal pattern of relocking from multiple CPUs in quick succession. Note this is still globally serialized.	2020-11-11 18:45:06 +00:00
Mateusz Guzik	c5315f5196	thread: lockless zombie list manipulation This gets rid of the most contended spinlock seen when creating/destroying threads in a loop. (modulo kstack) Tested by: alfredo (ppc64), bdragon (ppc64)	2020-11-11 18:43:51 +00:00
Mark Johnston	f52979098d	Fix a pair of races in SIGIO registration First, funsetownlst() list looks at the first element of the list to see whether it's processing a process or a process group list. Then it acquires the global sigio lock and processes the list. However, nothing prevents the first sigio tracker from being freed by a concurrent funsetown() before the sigio lock is acquired. Fix this by acquiring the global sigio lock immediately after checking whether the list is empty. Callers of funsetownlst() ensure that new sigio trackers cannot be added concurrently. Second, fsetown() uses funsetown() to remove an existing sigio structure from a file object. However, funsetown() uses a racy check to avoid the sigio lock, so two threads may call fsetown() on the same file object, both observe that no sigio tracker is present, and enqueue two sigio trackers for the same file object. However, if the file object is destroyed, funsetown() will only remove one sigio tracker, and funsetownlst() may later trigger a use-after-free when it clears the file object reference for each entry in the list. Fix this by introducing funsetown_locked(), which avoids the racy check. Reviewed by: kib Reported by: pho Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27157	2020-11-11 13:44:27 +00:00
Mateusz Guzik	26007fe37c	thread: add more fine-grained tidhash locking Note this still does not scale but is enough to move it out of the way for the foreseable future. In particular a trivial benchmark spawning/killing threads stops contesting on tidhash.	2020-11-11 08:51:04 +00:00
Mateusz Guzik	aae3547be3	thread: rework tidhash vs proc lock interaction Apart from minor clean up this gets rid of proc unlock/lock cycle on thread exit to work around LOR against tidhash lock.	2020-11-11 08:50:04 +00:00
Mateusz Guzik	cf31cadeb6	thread: fix thread0 tid allocation Startup code hardcodes the value instead of allocating it. The first spawned thread would then be a duplicate. Pointy hat: mjg	2020-11-11 08:48:43 +00:00
Mateusz Guzik	40aad3e477	thread: tidy up r367543 "locked" variable is spurious in the committed version.	2020-11-10 21:29:10 +00:00
Mateusz Guzik	5c5ca843b7	Allow rtprio_thread to operate on threads of any process This in particular unbreaks rtkit. The limitation was a leftover of previous state, to quote a comment: /* * Though lwpid is unique, only current process is supported * since there is no efficient way to look up a LWP yet. */ Long since then a global tid hash was introduced to remedy the problem. Permission checks still apply. Submitted by: greg_unrelenting.technology (Greg V) Differential Revision: https://reviews.freebsd.org/D27158	2020-11-10 18:10:50 +00:00
Mateusz Guzik	5c100123a3	thread: retire thread_find tdfind should be used instead.	2020-11-10 01:57:48 +00:00
Mateusz Guzik	f837888a3e	thread: use tdfind in sysctl_kern_proc_kstack This treads linear scans for locked lookup, but more importantly removes the only consumer of thread_find.	2020-11-10 01:57:19 +00:00
Mateusz Guzik	94275e3e69	threads: remove the unused TID_BUFFER_SIZE macro	2020-11-10 01:31:06 +00:00
Mateusz Guzik	934e7e5ec9	thread: adds newer bits for r367537 The committed patch was an older version.	2020-11-10 01:13:58 +00:00
Mateusz Guzik	35bb59edc5	threads: reimplement tid allocation on top of a bitmap There are workloads with very bursty tid allocation and since unr tries very hard to have small-sized bitmaps it keeps reallocating memory. Just doing buildkernel gives almost 150k calls to free coming from unr. This also gets rid of the hack which tried to postpone TID reuse. Reviewed by: kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D27101	2020-11-09 23:05:28 +00:00
Mateusz Guzik	1bd3cf5de5	threads: introduce a limit for total number The intent is to replace the current id allocation method and a known upper bound will be useful. Reviewed by: kib (previous version), markj (previous version) Tested by: pho Differential Revision: https://reviews.freebsd.org/D27100	2020-11-09 23:04:30 +00:00
Mateusz Guzik	f6dd1aefb7	vfs: group mount per-cpu vars into one struct While here move frequently read stuff into the same cacheline. This shrinks struct mount by 64 bytes. Tested by: pho	2020-11-09 23:02:13 +00:00
Mateusz Guzik	f0c90a0931	malloc: provide 384 byte zone Total page count after buildworld on ZFS for 384 (if present) and 512 zones: before: 29713 after: 25946 per-zone page use: vm.uma.malloc_384.keg.domain.1.pages: 11621 vm.uma.malloc_384.keg.domain.0.pages: 11597 vm.uma.malloc_512.keg.domain.1.pages: 1280 vm.uma.malloc_512.keg.domain.0.pages: 1448 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27145	2020-11-09 22:59:41 +00:00
Mateusz Guzik	8e6526e966	malloc: retire mt_stats_zone in favor of pcpu_zone_64 Reviewed by: markj, imp Differential Revision: https://reviews.freebsd.org/D27142	2020-11-09 22:58:29 +00:00
Mateusz Guzik	3a440a421d	Add more per-cpu zones. This covers powers of 2 up to 64. Example pending user is ZFS.	2020-11-09 00:34:23 +00:00
Mateusz Guzik	523d66730c	procdesc: convert the zone to a malloc type The object is 128 bytes in size.	2020-11-09 00:05:21 +00:00
Mateusz Guzik	e90afaa015	kqueue: save space by using only one func pointer for assertions	2020-11-09 00:04:35 +00:00
Edward Tomasz Napierala	a1bd83fede	Move syscall_thread_{enter,exit}() into the slow path. This is only needed for syscalls from unloadable modules. Reviewed by: kib MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26988	2020-11-08 15:54:59 +00:00
Kyle Evans	8c28aa5e45	imgact_binmisc: limit the extent of match on incoming entries imgact_binmisc matches magic/mask from imgp->image_header, which is only a single page in size mapped from the first page of an image. One can specify an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to 256 bytes past the mapped first page. The limitation is that we cannot specify a magic string that exceeds a single page, and we can't allow offset + size to exceed a single page either. A static assert has been added in case someone finds it useful to try and expand the size, but it does seem a little unlikely. While this looks kind of exploitable at a sideways squinty-glance, there are a couple of mitigating factors: 1.) imgact_binmisc is not enabled by default, 2.) entries may only be added by the superuser, 3.) trying to exploit this information to read what's mapped past the end would be worse than a root canal or some other relatably painful experience, and 4.) there's no way one could pull this off without it being completely obvious. The first page is mapped out of an sf_buf, the implementation of which (or lack thereof) depends on your platform. MFC after: 1 week	2020-11-08 04:24:29 +00:00
Michael Tuexen	f908d8247e	The ioctl() calls using FIONREAD, FIONWRITE, FIONSPACE, and SIOCATMARK access the socket send or receive buffer. This is not possible for listening sockets since r319722. Because send()/recv() calls fail on listening sockets, fail also ioctl() indicating EINVAL. PR: 250366 Reported by: Yong-Hao Zou Reviewed by: glebius, rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26897	2020-11-07 21:17:49 +00:00
Kyle Evans	1024ef27fe	imgact_binmisc: move some calculations out of the exec path The offset we need to account for in the interpreter string comes in two variants: 1. Fixed - macros other than #a that will not vary from invocation to invocation 2. Variable - #a, which is substitued with the argv0 that we're replacing Note that we don't have a mechanism to modify an existing entry. By recording both of these offset requirements when the interpreter is added, we can avoid some unnecessary calculations in the exec path. Most importantly, we can know up-front whether we need to grab calculate/grab the the filename for this interpreter. We also get to avoid walking the string a first time looking for macros. For most invocations, it's a swift exit as they won't have any, but there's no point entering a loop and searching for the macro indicator if we already know there will not be one. While we're here, go ahead and only calculate the argv0 name length once per invocation. While it's unlikely that we'll have more than one #a, there's no reason to recalculate it every time we encounter an #a when it will not change. I have not bothered trying to benchmark this at all, because it's arguably a minor and straightforward/obvious improvement. MFC after: 1 week	2020-11-07 18:07:55 +00:00
Mateusz Guzik	42e7abd5db	rms: several cleanups + debug read lockers handling This adds a dedicated counter updated with atomics when INVARIANTS is used. As a side effect one can reliably determine the lock is held for reading by at least one thread, but it's still not possible to find out whether curthread has the lock in said mode. This should be good enough in practice. Problem spotted by avg.	2020-11-07 16:57:53 +00:00
Kyle Evans	ecb4fdf943	imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC) This doesn't change anything at the moment since the out-of-order elements were a pair of uint32_t, but future additions may have caused unnecessary padding by following the existing precedent. MFC after: 1 week	2020-11-07 16:41:59 +00:00
Michal Meloun	eb20867f52	Add a method to determine whether given interrupt is per CPU or not. MFC after: 2 weeks	2020-11-07 14:58:01 +00:00
Edward Tomasz Napierala	da45ea6bc6	Move TDB_USERWR check under 'if (traced)'. If we hadn't been traced in the first place when syscallenter() started executing, we can ignore TDB_USERWR. TDB_USERWR can get set, sure, but if it does, it's because the debugger raced with the syscall, and it cannot depend on winning that race. Reviewed by: kib MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26585	2020-11-07 13:09:51 +00:00
Kyle Evans	2192cd125f	imgact_binmisc: abstract away the list lock (NFC) This module handles relatively few execs (initial qemu-user-static, then qemu-user-static handles exec'ing itself for binaries it's already running), but all execs pay the price of at least taking the relatively expensive sx/slock to check for a match when this module is loaded. Future work will almost certainly swap this out for another lock, perhaps an rmslock. The RLOCK/WLOCK phrasing was chosen based on what the callers are really wanting, rather than using the verbiage typically appropriate for an sx. MFC after: 1 week	2020-11-07 05:10:46 +00:00
Kyle Evans	7d3ed9777a	imgact_binmisc: validate flags coming from userland We may want to reserve bits in the future for kernel-only use, so start rejecting any that aren't the two that we're currently expecting from userland. MFC after: 1 week	2020-11-07 04:10:23 +00:00
Kyle Evans	7667824ade	epoch: support non-preemptible epochs checking in_epoch() Previously, non-preemptible epochs could not check; in_epoch() would always fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING. For default epochs, it's easy enough to verify that we're in the given epoch: if we're in a critical section and our record for the given epoch is active, then we're in it. This patch also adds some additional INVARIANTS bookkeeping. Notably, we set and check the recorded thread in epoch_enter/epoch_exit to try and catch some edge-cases for the caller. It also checks upon freeing that none of the records had a thread in the epoch, which may make it a little easier to diagnose some improper use if epoch_free() took place while some other thread was inside. This version differs slightly from what was just previously reviewed by the below-listed, in that in_epoch() will assert that no CPU has this thread recorded even if it is currently in a critical section. This is intended to catch cases where the caller might have somehow messed up critical section nesting, we can catch both if they exited the critical section or if they exited, migrated, then re-entered (on the wrong CPU). Reviewed by: kib, markj (both previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27098	2020-11-07 03:29:04 +00:00
Kyle Evans	80083216cb	imgact_binmisc: minor re-organization of imgact_binmisc_exec exits Notably, streamline error paths through the existing 'done' label, making it easier to quickly verify correct cleanup. Future work might add a kernel-only flag to indicate that a interpreter uses #a. Currently, all executions via imgact_binmisc pay the penalty of constructing sname/fname, even if they will not use it. qemu-user-static doesn't need it, the stock rc script for qemu-user-static certainly doesn't use it, and I suspect these are the vast majority of (if not the only) current users. MFC after: 1 week	2020-11-07 03:28:32 +00:00
Mateusz Guzik	e25d8b67c3	malloc: tweak the version check in r367432 to include type name While here fix a whitespace problem.	2020-11-07 01:32:16 +00:00
Mateusz Guzik	bdcc222644	malloc: move malloc_type_internal into malloc_type According to code comments the original motivation was to allow for malloc_type_internal changes without ABI breakage. This can be trivially accomplished by providing spare fields and versioning the struct, as implemented in the patch below. The upshots are one less memory indirection on each alloc and disappearance of mt_zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27104	2020-11-06 21:33:59 +00:00
Konstantin Belousov	f10845877e	Suspend all writeable local filesystems on power suspend. This ensures that no writes are pending in memory, either metadata or user data, but not including dirty pages not yet converted to fs writes. Only filesystems declared local are suspended. Note that this does not guarantee absence of the metadata errors or leaks if resume is not done: for instance, on UFS unlinked but opened inodes are leaked and require fsck to gc. Reviewed by: markj Discussed with: imp Tested by: imp (previous version), pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D27054	2020-11-05 20:52:49 +00:00
Mateusz Guzik	16b971ed6d	malloc: add a helper returning size allocated for given request Sample usage: kernel modules can decide whether to stick to malloc or create their own zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27097	2020-11-05 16:21:21 +00:00
Mateusz Guzik	2dee296a3d	Rationalize per-cpu zones. The 2 provided zones had inconsistent naming between each other ("int" and "64") and other allocator zones (which use bytes). Follow malloc by naming them "pcpu-" + size in bytes. This is a step towards replacing ad-hoc per-cpu zones with general slabs.	2020-11-05 15:08:56 +00:00
Mateusz Guzik	ea33cca971	poll/select: change selfd_zone into a malloc type On a sample box vmstat -z shows: ITEM SIZE LIMIT USED FREE REQ 64: 64, 0, 1043784, 4367538,3698187229 selfd: 64, 0, 1520, 13726,182729008 But at the same time: vm.uma.selfd.keg.domain.1.pages: 121 vm.uma.selfd.keg.domain.0.pages: 121 Thus 242 pages got pulled even though the malloc zone would likely accomodate the load without using extra memory.	2020-11-05 12:24:37 +00:00
Mateusz Guzik	2fbb45c601	vfs: change nt_zone into a malloc type Elements are small in size and allocated for short periods.	2020-11-05 12:06:50 +00:00
Kyle Evans	df69035d7f	imgact_binmisc: fix up some minor nits - Removed a bunch of redundant headers - Don't explicitly initialize to 0 - The !error check prior to setting imgp->interpreter_name is redundant, all error paths should and do return or go to 'done'. We have larger problems otherwise.	2020-11-05 04:19:48 +00:00
Mateusz Guzik	3c50616fc1	fd: make all f_count uses go through refcount_*	2020-11-05 02:12:33 +00:00
Mateusz Guzik	d737e9eaf5	fd: hide _fdrop 0 count check behind INVARIANTS While here use refcount_load and make sure to report the tested value.	2020-11-05 02:12:08 +00:00
Mateusz Guzik	331c21dd5e	pipe: whitespace nit in previous	2020-11-04 23:17:41 +00:00
Mateusz Guzik	c22ba7bb06	pipe: fix POLLHUP handling if no events were specified Linux allows polling without any events specified and it happens to be the case in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask and this works fine if the condition is already present. However, if it is missing, selrecord is only called if the eventmask has relevant bits set. This in particular leads to a conditon where pipe_poll can return 0 events and neglect to selrecord, while kern_poll takes it as an indication it has to go to sleep, but then there is nobody to wake it up. While the problem seems systemic to *_poll handlers the least we can do is fix it up for pipes. Reported by: Jeremie Galarneau <jeremie.galarneau at efficios.com> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D27094	2020-11-04 23:11:54 +00:00
Mateusz Guzik	6fc2b069ca	rms: fixup concurrent writer handling and add more features Previously the code had one wait channel for all pending writers. This could result in a buggy scenario where after a writer switches the lock mode form readers to writers goes off CPU, another writer queues itself and then the last reader wakes up the latter instead of the former. Use a separate channel. While here add features to reliably detect whether curthread has the lock write-owned. This will be used by ZFS.	2020-11-04 21:18:08 +00:00
Mark Johnston	f7db0c9532	vmspace: Convert to refcount(9) This is mostly mechanical except for vmspace_exit(). There, use the new refcount_release_if_last() to avoid switching to vmspace0 unless other processes are sharing the vmspace. In that case, upon switching to vmspace0 we can unconditionally release the reference. Remove the volatile qualifier from vm_refcnt now that accesses are protected using refcount(9) KPIs. Reviewed by: alc, kib, mmel MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27057	2020-11-04 16:30:56 +00:00
Brooks Davis	19647e76fc	sysvshm: pass relevant uap members as arguments Alter shmget_allocate_segment and shmget_existing to take the values they want from struct shmget_args rather than passing the struct around. In general, uap structures should only be the interface to sys_<foo> functions. This makes on small functional change and records the allocated space rather than the requested space. If this turns out to be a problem (e.g. if software tries to find undersized segments by exact size rather than using keys), we can correct that easily. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27077	2020-11-03 19:14:03 +00:00
Conrad Meyer	2de07e4096	unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT This option is intended to be semantically identical to Linux's SOL_SOCKET:SO_PASSCRED. For now, it is mutually exclusive with the pre-existing sockopt SOL_LOCAL:LOCAL_CREDS. Reviewed by: markj (penultimate version) Differential Revision: https://reviews.freebsd.org/D27011	2020-11-03 01:17:45 +00:00
Mateusz Guzik	e1b6a7f83f	malloc: prefix zones with malloc- Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27038	2020-11-02 17:39:15 +00:00
Mateusz Guzik	828afdda17	malloc: export kernel zones instead of relying on them being power-of-2 Reviewed by: markj (previous version) Differential Revision: https://reviews.freebsd.org/D27026	2020-11-02 17:38:08 +00:00
Stefan Eßer	1ebef47735	Make sysctl user.local a tunable that can be written at run-time This sysctl value had been provided as a read-only variable that is compiled into the C library based on the value of _PATH_LOCALBASE in paths.h. After this change, the value is compiled into the kernel as an empty string, which is translated to _PATH_LOCALBASE by the C library. This empty string can be overridden at boot time or by a privileged user at run time and will then be returned by sysctl. When set to an empty string, the value returned by sysctl reverts to _PATH_LOCALBASE. This update does not change the behavior on any system that does not modify the default value of user.localbase. I consider this change as experimental and would prefer if the run-time write permission was reconsidered and the sysctl variable defined with CLFLAG_RDTUN instead to restrict it to be set at boot time. MFC after: 1 month	2020-10-31 23:48:41 +00:00
Mateusz Guzik	82c174a3b4	malloc: delegate M_EXEC handling to dedicacted routines It is almost never needed and adds an avoidable branch. While here do minior clean ups in preparation for larger changes. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27019	2020-10-30 20:02:32 +00:00
Stefan Eßer	147eea393f	Add read only sysctl variable user.localbase The value is provided by the C library as for other sysctl variables in the user tree. It is compiled in and returns the value of _PATH_LOCALBASE defined in paths.h. Reviewed by: imp, scottl Differential Revision: https://reviews.freebsd.org/D27009	2020-10-30 18:48:09 +00:00
Mateusz Guzik	0685574968	vfs: change vnode poll to just a malloc type The size is 120, close fit for 128 and rarely used. The infrequent use avoidably populates per-CPU caches and ends up with more memory.	2020-10-30 14:02:56 +00:00

... 3 4 5 6 7 ...

18278 Commits