freebsd-nq

Author	SHA1	Message	Date
Jamie Gritton	7f4e724829	jail: add a missing lock around an osd_jail_call(). allprison_lock should be at least held shared when jail OSD methods are called. Add a shared lock around one such call where that wasn't the case. In another such call, change an exclusive lock grab to be shared in what is likely the more common case.	2020-12-26 20:49:30 -08:00
Jamie Gritton	0fe74ae624	jail: Consistently handle the pr_allow bitmask Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag value itself, which is what sysctl expects. Add prison_set_allow(), which can set or clear a permission bit, and propagates cleared bits down to child jails. Use prison_allow() and prison_set_allow() in the various jail.allow.* sysctls, and others that depend on thoe permissions. Add locking around checking both pr_allow and pr_enforce_statfs in prison_priv_check().	2020-12-26 20:25:02 -08:00
Mark Johnston	26b23f07fb	sendfile: Ensure that sfio->npages is initialized We initialize sfio->npages only when some I/O is required to satisfy the request. However, sendfile_iodone() contains an INVARIANTS-only check that references sfio->npages, and this check is executed even if no I/O is performed, so the check may use an uninitialized value. Fix the problem by initializing sfio->npages earlier. Note that sendfile_swapin() always initializes the page array. In some rare cases we need to trim the page array so ensure that sfio->npages gets updated accordingly. Reported by: syzkaller (with KASAN) Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27726	2020-12-26 16:07:40 -05:00
Jamie Gritton	5d58f959d3	jail: Fix lock-free access to dynamic pr.allow flags Use atomic access and a memory barrier to ensure that the flag parameter in pr_flag_allow is indeed set after the rest of the structure is valid. Simplify adding flag bits with pr_allow_all, a dynamic version of PR_ALLOW_ALL_STATIC.	2020-12-26 12:53:28 -08:00
Jamie Gritton	7de883c82f	jail: Fix an O(n^2) loop when adding jails When a jail is added using the default (system-chosen) JID, and non-default-JID jails already exist, a loop through the allprison list could restart and result in unnecessary O(n^2) behaviour. There should never be more than two list passes required. Also clean up inefficient (though still O(n)) allprison list traversal when finding jails by ID, or when adding jails in the common case of all default JIDs.	2020-12-26 10:39:34 -08:00
Alan Somers	0120603891	AIO: remove the kaiocb->bio linkage Vectored aio will require each aiocb to be associated with multiple bios, so we can't store a link to the latter from the former. But we don't really need to. aio_biowakeup already knows the bio it's using, and the other fields can be stored within the bio and/or buf itself. Also, remove the unused kaiocb.backend2 field. Reviewed By: kib Differential Revision: https://reviews.freebsd.org/D27682	2020-12-23 16:06:15 +00:00
Mateusz Guzik	906a73e791	cache: fix up cache_hold_vnode comment	2020-12-23 07:24:29 +00:00
Andrew Gallatin	02bc3865aa	Optionally bind ktls threads to NUMA domains When ktls_bind_thread is 2, we pick a ktls worker thread that is bound to the same domain as the TCP connection associated with the socket. We use roughly the same code as netinet/tcp_hpts.c to do this. This allows crypto to run on the same domain as the TCP connection is associated with. Assuming TCP_REUSPORT_LB_NUMA (D21636) is in place & in use, this ensures that the crypto source and destination buffers are local to the same NUMA domain as we're running crypto on. This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces cross-domain traffic from over 37% down to about 13% as measured by pcm.x on a dual-socket Xeon using nginx and a Netflix workload. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21648	2020-12-19 21:46:09 +00:00
Kyle Evans	54a837c8cc	kern: cpuset: allow jails to modify child jails' roots This partially lifts a restriction imposed by r191639 ("Prevent a superuser inside a jail from modifying the dedicated root cpuset of that jail") that's perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still cannot modify their own cpuset, but they can modify child jails' roots to further restrict them or widen them back to the modifying jails' own mask. As a side effect of this, the system root may once again widen the mask of jails as long as they're still using a subset of the parent jails' mask. This was previously prevented by the fact that cpuset_getroot of a root set will return that root, rather than the root's parent -- cpuset_modify uses cpuset_getroot since it was introduced in r327895, previously it was just validating against set->cs_parent which allowed the system root to widen jail masks. Reviewed by: jamie MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27352	2020-12-19 03:30:06 +00:00
Konstantin Belousov	673e2dd652	Add ELF flag to disable ASLR stack gap. Also centralize and unify checks to enable ASLR stack gap in a new helper exec_stackgap(). PR: 239873 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-12-18 23:14:39 +00:00
John Baldwin	a095390344	Use a template assembly file for firmware object files. Similar to r366897, this uses the .incbin directive to pull in a firmware file's contents into a .fwo file. The same scheme for computing symbol names from the filename is used as before to maximize compatiblity and not require rebuilding existing .fwo files for NO_CLEAN builds. Using ld -o binary requires extra hacks in linkers to either specify ABI options (e.g. soft- vs hard-float) or to ignore ABI incompatiblities when linking certain objects (e.g. object files with only data). Using the compiler driver avoids the need for these hacks as the compiler driver is able to set all the appropriate ABI options. Reviewed by: imp, markj Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27579	2020-12-17 20:31:17 +00:00
Konstantin Belousov	551e205f6d	Fix a race in tty_signal_sessleader() with unlocked read of s_leader. Since we do not own the session lock, a parallel killjobc() might reset s_leader to NULL after we checked it. Read s_leader only once and ensure that compiler is not allowed to reload. While there, make access to t_session somewhat more pretty by using local variable. PR: 251915 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 1 week	2020-12-17 19:51:39 +00:00
Mateusz Guzik	57efe26bcb	fd: reimplement close_range to avoid spurious relocking	2020-12-17 18:52:30 +00:00
Mateusz Guzik	08a5615cfe	audit: rework AUDIT_SYSCLOSE This in particular avoids spurious lookups on close.	2020-12-17 18:52:04 +00:00
Mateusz Guzik	1e71e7c4f6	fd: refactor closefp in preparation for close_range rework	2020-12-17 18:51:09 +00:00
Mateusz Guzik	08241fedc4	fd: remove redundant saturation check from fget_unlocked_seq refcount_acquire_if_not_zero returns true on saturation. The case of 0 is handled by looping again, after which the originally found pointer will no longer be there. Noted by: kib	2020-12-16 18:01:41 +00:00
Mateusz Guzik	6404d7ffc1	uipc: disable prediction in unp_pcb_lock_peer The branch is not very predictable one way or the other, at least during buildkernel where it only correctly matched 57% of calls.	2020-12-13 21:32:19 +00:00
Mateusz Guzik	8ab96e265d	cache: fix ups bad predicts - last level fallback normally sees CREATE; the code should be optimized to not get there for said case - fast path commonly fails with ENOENT	2020-12-13 21:29:39 +00:00
Mateusz Guzik	d48c2b8d29	vfs: correctly predict last fdrop on failed open Arguably since the count is guaranteed to be 1 the code should be modified to avoid the work.	2020-12-13 21:28:15 +00:00
Konstantin Belousov	203affb291	Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428. Reported by: arichardson Reviewed by: arichardson, markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27597	2020-12-13 19:45:42 +00:00
Konstantin Belousov	0b459854bc	Correct indent. Sponsored by: The FreeBSD Foundation	2020-12-13 19:43:45 +00:00
Mateusz Guzik	edcdcefb88	fd: fix fdrop prediction when closing a fd Most of the time this is the last reference, contrary to typical fdrop use.	2020-12-13 18:06:24 +00:00
Ryan Libby	d3bbf8af68	cache_fplookup: quiet gcc -Wreturn-type Reviewed by: markj, mjg Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27555	2020-12-11 22:51:44 +00:00
Mateusz Guzik	0ecce93dca	fd: make serialization in fdescfree_fds conditional on hold count p_fd nullification in fdescfree serializes against new threads transitioning the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can safely assume there is nobody else using the table. Losing the race and observing > 1 is harmless. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27522	2020-12-10 17:17:22 +00:00
Mark Johnston	3309fa7403	Plug a race between fd table teardown and several loops To export information from fd tables we have several loops which do this: FILDESC_SLOCK(fdp); for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++) <export info for fd i>; FILDESC_SUNLOCK(fdp); Before r367777, fdescfree() acquired the fd table exclusive lock between decrementing fdp->fd_refcount and freeing table entries. This serialized with the loop above, so the file at descriptor i would remain valid until the lock is dropped. Now there is no serialization, so the loops may race with teardown of file descriptor tables. Acquire the exclusive fdtable lock after releasing the final table reference to provide a barrier synchronizing with these loops. Reported by: pho Reviewed by: kib (previous version), mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27513	2020-12-09 14:05:08 +00:00
Mark Johnston	4c1c90ea95	Use refcount_load(9) to load fd table reference counts No functional change intended. Reviewed by: kib, mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27512	2020-12-09 14:04:54 +00:00
Kyle Evans	f1b18a668d	cpuset_set{affinity,domain}: do not allow empty masks cpuset_modify() would not currently catch this, because it only checks that the new mask is a subset of the root set and circumvents the EDEADLK check in cpuset_testupdate(). This change both directly validates the mask coming in since we can trivially detect an empty mask, and it updates cpuset_testupdate to catch stuff like this going forward by always ensuring we don't end up with an empty mask. The check_mask argument has been renamed because the 'check' verbiage does not imply to me that it's actually doing a different operation. We're either augmenting the existing mask, or we are replacing it entirely. Reported by: syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com Discussed with: andrew Reviewed by: andrew, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27511	2020-12-08 18:47:22 +00:00
Kyle Evans	b2780e8537	kern: cpuset: resolve race between cpuset_lookup/cpuset_rel The race plays out like so between threads A and B: 1. A ref's cpuset 10 2. B does a lookup of cpuset 10, grabs the cpuset lock and searches cpuset_ids 3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock while B is still searching and not yet ref'd 4. B ref's cpuset 10 and drops the cpuset lock 5. A proceeds to free the cpuset out from underneath B Resolve the race by only releasing the last reference under the cpuset lock. Thread A now picks up the spinlock and observes that the cpuset has been revived, returning immediately for B to deal with later. Reported by: syzbot+92dff413e201164c796b@syzkaller.appspotmail.com Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27498	2020-12-08 18:45:47 +00:00
Kyle Evans	9c83dab96c	kern: cpuset: plug a unr leak cpuset_rel_defer() is supposed to be functionally equivalent to cpuset_rel() but with anything that might sleep deferred until cpuset_rel_complete -- this setup is used specifically for cpuset_setproc. Add in the missing unr free to match cpuset_rel. This fixes a leak that was observed when I wrote a small userland application to try and debug another issue, which effectively did: cpuset(&newid); cpuset(&scratch); newid gets leaked when scratch is created; it's off the list, so there's no mechanism for anything else to relinquish it. A more realistic reproducer would likely be a process that inherits some cpuset that it's the only ref for, but it creates a new one to modify. Alternatively, administratively reassigning a process' cpuset that it's the last ref for will have the same effect. Discovered through D27498. MFC after: 1 week	2020-12-08 18:44:06 +00:00
Mateusz Guzik	8fcfd0e222	vfs: add cleanup on error missed in r368375 Noted by: jrtc27	2020-12-06 19:24:38 +00:00
Mateusz Guzik	60e2a0d9a4	vfs: factor buffer allocation/copyin out of namei	2020-12-06 04:59:24 +00:00
Mateusz Guzik	0c23d26230	vfs: keep bad ops on vnode reclaim They were only modified to accomodate a redundant assertion. This runs into problems as lockless lookup can still try to use the vnode and crash instead of getting an error. The bug was only present in kernels with INVARIANTS. Reported by: kevans	2020-12-05 05:56:23 +00:00
Konstantin Belousov	be2535b0a6	Add kern_ntp_adjtime(9). Reviewed by: brooks, cy Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27471	2020-12-04 18:56:44 +00:00
Kyle Evans	34af05ead3	kern: soclose: don't sleep on SO_LINGER w/ timeout=0 This is a valid scenario that's handled in the various protocol layers where it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it indicates we should immediately drop the connection, it makes little sense to sleep on it. This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this could result in the thread hanging until a signal interrupts it if the protocol does not mark the socket as disconnected for whatever reason. Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com Reviewed by: glebius, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27407	2020-12-04 04:39:48 +00:00
Mark Johnston	b957b18594	Always use 64-bit physical addresses for dump_avail[] in minidumps As of r365978, minidumps include a copy of dump_avail[]. This is an array of vm_paddr_t ranges. libkvm walks the array assuming that sizeof(vm_paddr_t) is equal to the platform "word size", but that's not correct on some platforms. For instance, i386 uses a 64-bit vm_paddr_t. Fix the problem by always dumping 64-bit addresses. On platforms where vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate dump_avail[] to an array of uint64_t ranges. With this change, libkvm no longer needs to maintain a notion of the target word size, so get rid of it. This is a no-op on platforms where sizeof(vm_paddr_t) == 8. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27082	2020-12-03 17:12:31 +00:00
Oleksandr Tymoshenko	18ce865a4f	Add support for hw.physmem tunable for ARM/ARM64/RISC-V platforms hw.physmem tunable allows to limit number of physical memory available to the system. It's handled in machdep files for x86 and PowerPC. This patch adds required logic to the consolidated physmem management interface that is used by ARM, ARM64, and RISC-V. Submitted by: Klara, Inc. Reviewed by: mhorne Sponsored by: Ampere Computing Differential Revision: https://reviews.freebsd.org/D27152	2020-12-03 05:39:27 +00:00
Mateusz Guzik	10e64782ed	select: make sure there are no wakeup attempts after selfdfree returns Prior to the patch returning selfdfree could still be racing against doselwakeup which set sf_si = NULL and now locks stp to wake up the other thread. A sufficiently unlucky pair can end up going all the way down to freeing select-related structures before the lock/wakeup/unlock finishes. This started manifesting itself as crashes since select data started getting freed in r367714.	2020-12-02 00:48:15 +00:00
Konstantin Belousov	6814c2dac5	lio_listio(2): send signal even if number of jobs is zero. Right now, if lio registered zero jobs, syscall frees lio job structure, cleaning up queued ksi. As result, the realtime signal is dequeued and never delivered. Fix it by allowing sendsig() to copy ksi when job count is zero. PR: 220398 Reported and reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421	2020-12-01 22:53:33 +00:00
Konstantin Belousov	2933165666	vfs_aio.c: style. Mostly re-wrap conditions to split after binary ops. Reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421	2020-12-01 22:46:51 +00:00
Konstantin Belousov	5c5005ec20	vfs_aio.c: correct comment. Reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421	2020-12-01 22:30:32 +00:00
Mark Johnston	dad22308a1	vmem: Revert r364744 A pair of bugs are believed to have caused the hangs described in the commit log message for r364744: 1. uma_reclaim() could trigger reclamation of the reserve of boundary tags used to avoid deadlock. This was fixed by r366840. 2. The loop in vmem_xalloc() would in some cases try to allocate more boundary tags than the expected upper bound of BT_MAXALLOC. The reserve is sized based on the value BT_MAXMALLOC, so this behaviour could deplete the reserve without guaranteeing a successful allocation, resulting in a hang. This was fixed by r366838. PR: 248008 Tested by: rmacklem	2020-12-01 16:06:31 +00:00
Alexander V. Chernikov	8db8bebf1f	Move inner loop logic out of sysctl_sysctl_next_ls(). Refactor sysctl_sysctl_next_ls(): * Move huge inner loop out of sysctl_sysctl_next_ls() into a separate non-recursive function, returning the next step to be taken. * Update resulting node oid parts only on successful lookup * Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno, slightly simplifying logic Reviewed by: freqlabs Differential Revision: https://reviews.freebsd.org/D27029	2020-11-30 21:59:52 +00:00
Toomas Soome	93b18e3730	vt: if loader did pass the font via metadata, use it The built in 8x16 font may be way too small with large framebuffer resolutions, to improve readability, use loader provied font.	2020-11-30 11:45:47 +00:00
Toomas Soome	a4a10b37d4	Add VT driver for VBE framebuffer device Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT. vt_vbefb is built based on vt_efifb and is assuming similar data for initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb in kernel metadata. struct vbe_fb, is populated by boot loader, and is passed to kernel via metadata payload. Differential Revision: https://reviews.freebsd.org/D27373	2020-11-30 08:22:40 +00:00
Matt Macy	2338da0373	Import kernel WireGuard support Data path largely shared with the OpenBSD implementation by Matt Dunwoodie <ncon@nconroy.net> Reviewed by: grehan@freebsd.org MFC after: 1 month Sponsored by: Rubicon LLC, (Netgate) Differential Revision: https://reviews.freebsd.org/D26137	2020-11-29 19:38:03 +00:00
Konstantin Belousov	a9d4fe977a	bio aio: Destroy ephemeral mapping before unwiring page. Apparently some architectures, like ppc in its hashed page tables variants, account mappings by pmap_qenter() in the response from pmap_is_page_mapped(). While there, eliminate useless userp variable. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27409	2020-11-29 10:30:56 +00:00
Alexander Motin	83f6b50123	Remove alignment requirements for KVA buffer mapping. After r368124 pbuf_zone has extra page to handle this particular case.	2020-11-29 01:30:17 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Kyle Evans	e07e3fa3c9	kern: cpuset: drop the lock to allocate domainsets Restructure the loop a little bit to make it a little more clear how it really operates: we never allocate any domains at the beginning of the first iteration, and it will run until we've satisfied the amount we need or we encounter an error. The lock is now taken outside of the loop to make stuff inside the loop easier to evaluate w.r.t. locking. This fixes it to not try and allocate any domains for the freelist under the spinlock, which would have happened before if we needed any new domains. Reported by: syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D27371	2020-11-28 01:21:11 +00:00
Mark Johnston	0c56925bc2	callout(9): Remove some leftover APM BIOS support This code is obsolete since r366546. Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27267	2020-11-27 20:46:02 +00:00

1 2 3 4 5 ...

17997 Commits