freebsd-skq

Author	SHA1	Message	Date
Konstantin Belousov	7cde2ec4fd	Implement vn_lock_pair(). In collaboration with: pho Reviewed by: mckusick (previous version), markj (previous version) Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-13 09:31:57 +00:00
Mateusz Guzik	9aa6d792b5	malloc: retire malloc_last_fail The routine does not serve any practical purpose. Memory can be allocated in many other ways and most consumers pass the M_WAITOK flag, making malloc not fail in the first place. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27143	2020-11-12 20:22:58 +00:00
Mateusz Guzik	62dbc992ad	thread: move nthread management out of tid_alloc While this adds more work single-threaded, it also enables SMP-related speed ups.	2020-11-12 00:29:23 +00:00
Kyle Evans	38033780a3	umtx: drop incorrect timespec32 definition This works for amd64, but none others -- drop it, because we already have a proper definition in sys/compat/freebsd32/freebsd32.h that correctly uses time32_t. MFC after: 1 week	2020-11-11 22:35:23 +00:00
Mateusz Guzik	755341df4f	thread: batch tid_free calls in thread_reap This eliminates the highly pessimal pattern of relocking from multiple CPUs in quick succession. Note this is still globally serialized.	2020-11-11 18:45:06 +00:00
Mateusz Guzik	c5315f5196	thread: lockless zombie list manipulation This gets rid of the most contended spinlock seen when creating/destroying threads in a loop. (modulo kstack) Tested by: alfredo (ppc64), bdragon (ppc64)	2020-11-11 18:43:51 +00:00
Mark Johnston	f52979098d	Fix a pair of races in SIGIO registration First, funsetownlst() list looks at the first element of the list to see whether it's processing a process or a process group list. Then it acquires the global sigio lock and processes the list. However, nothing prevents the first sigio tracker from being freed by a concurrent funsetown() before the sigio lock is acquired. Fix this by acquiring the global sigio lock immediately after checking whether the list is empty. Callers of funsetownlst() ensure that new sigio trackers cannot be added concurrently. Second, fsetown() uses funsetown() to remove an existing sigio structure from a file object. However, funsetown() uses a racy check to avoid the sigio lock, so two threads may call fsetown() on the same file object, both observe that no sigio tracker is present, and enqueue two sigio trackers for the same file object. However, if the file object is destroyed, funsetown() will only remove one sigio tracker, and funsetownlst() may later trigger a use-after-free when it clears the file object reference for each entry in the list. Fix this by introducing funsetown_locked(), which avoids the racy check. Reviewed by: kib Reported by: pho Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27157	2020-11-11 13:44:27 +00:00
Mateusz Guzik	26007fe37c	thread: add more fine-grained tidhash locking Note this still does not scale but is enough to move it out of the way for the foreseable future. In particular a trivial benchmark spawning/killing threads stops contesting on tidhash.	2020-11-11 08:51:04 +00:00
Mateusz Guzik	aae3547be3	thread: rework tidhash vs proc lock interaction Apart from minor clean up this gets rid of proc unlock/lock cycle on thread exit to work around LOR against tidhash lock.	2020-11-11 08:50:04 +00:00
Mateusz Guzik	cf31cadeb6	thread: fix thread0 tid allocation Startup code hardcodes the value instead of allocating it. The first spawned thread would then be a duplicate. Pointy hat: mjg	2020-11-11 08:48:43 +00:00
Mateusz Guzik	40aad3e477	thread: tidy up r367543 "locked" variable is spurious in the committed version.	2020-11-10 21:29:10 +00:00
Mateusz Guzik	5c5ca843b7	Allow rtprio_thread to operate on threads of any process This in particular unbreaks rtkit. The limitation was a leftover of previous state, to quote a comment: /* * Though lwpid is unique, only current process is supported * since there is no efficient way to look up a LWP yet. */ Long since then a global tid hash was introduced to remedy the problem. Permission checks still apply. Submitted by: greg_unrelenting.technology (Greg V) Differential Revision: https://reviews.freebsd.org/D27158	2020-11-10 18:10:50 +00:00
Mateusz Guzik	5c100123a3	thread: retire thread_find tdfind should be used instead.	2020-11-10 01:57:48 +00:00
Mateusz Guzik	f837888a3e	thread: use tdfind in sysctl_kern_proc_kstack This treads linear scans for locked lookup, but more importantly removes the only consumer of thread_find.	2020-11-10 01:57:19 +00:00
Mateusz Guzik	94275e3e69	threads: remove the unused TID_BUFFER_SIZE macro	2020-11-10 01:31:06 +00:00
Mateusz Guzik	934e7e5ec9	thread: adds newer bits for r367537 The committed patch was an older version.	2020-11-10 01:13:58 +00:00
Mateusz Guzik	35bb59edc5	threads: reimplement tid allocation on top of a bitmap There are workloads with very bursty tid allocation and since unr tries very hard to have small-sized bitmaps it keeps reallocating memory. Just doing buildkernel gives almost 150k calls to free coming from unr. This also gets rid of the hack which tried to postpone TID reuse. Reviewed by: kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D27101	2020-11-09 23:05:28 +00:00
Mateusz Guzik	1bd3cf5de5	threads: introduce a limit for total number The intent is to replace the current id allocation method and a known upper bound will be useful. Reviewed by: kib (previous version), markj (previous version) Tested by: pho Differential Revision: https://reviews.freebsd.org/D27100	2020-11-09 23:04:30 +00:00
Mateusz Guzik	f6dd1aefb7	vfs: group mount per-cpu vars into one struct While here move frequently read stuff into the same cacheline. This shrinks struct mount by 64 bytes. Tested by: pho	2020-11-09 23:02:13 +00:00
Mateusz Guzik	f0c90a0931	malloc: provide 384 byte zone Total page count after buildworld on ZFS for 384 (if present) and 512 zones: before: 29713 after: 25946 per-zone page use: vm.uma.malloc_384.keg.domain.1.pages: 11621 vm.uma.malloc_384.keg.domain.0.pages: 11597 vm.uma.malloc_512.keg.domain.1.pages: 1280 vm.uma.malloc_512.keg.domain.0.pages: 1448 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27145	2020-11-09 22:59:41 +00:00
Mateusz Guzik	8e6526e966	malloc: retire mt_stats_zone in favor of pcpu_zone_64 Reviewed by: markj, imp Differential Revision: https://reviews.freebsd.org/D27142	2020-11-09 22:58:29 +00:00
Mateusz Guzik	3a440a421d	Add more per-cpu zones. This covers powers of 2 up to 64. Example pending user is ZFS.	2020-11-09 00:34:23 +00:00
Mateusz Guzik	523d66730c	procdesc: convert the zone to a malloc type The object is 128 bytes in size.	2020-11-09 00:05:21 +00:00
Mateusz Guzik	e90afaa015	kqueue: save space by using only one func pointer for assertions	2020-11-09 00:04:35 +00:00
Edward Tomasz Napierala	a1bd83fede	Move syscall_thread_{enter,exit}() into the slow path. This is only needed for syscalls from unloadable modules. Reviewed by: kib MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26988	2020-11-08 15:54:59 +00:00
Kyle Evans	8c28aa5e45	imgact_binmisc: limit the extent of match on incoming entries imgact_binmisc matches magic/mask from imgp->image_header, which is only a single page in size mapped from the first page of an image. One can specify an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to 256 bytes past the mapped first page. The limitation is that we cannot specify a magic string that exceeds a single page, and we can't allow offset + size to exceed a single page either. A static assert has been added in case someone finds it useful to try and expand the size, but it does seem a little unlikely. While this looks kind of exploitable at a sideways squinty-glance, there are a couple of mitigating factors: 1.) imgact_binmisc is not enabled by default, 2.) entries may only be added by the superuser, 3.) trying to exploit this information to read what's mapped past the end would be worse than a root canal or some other relatably painful experience, and 4.) there's no way one could pull this off without it being completely obvious. The first page is mapped out of an sf_buf, the implementation of which (or lack thereof) depends on your platform. MFC after: 1 week	2020-11-08 04:24:29 +00:00
Michael Tuexen	f908d8247e	The ioctl() calls using FIONREAD, FIONWRITE, FIONSPACE, and SIOCATMARK access the socket send or receive buffer. This is not possible for listening sockets since r319722. Because send()/recv() calls fail on listening sockets, fail also ioctl() indicating EINVAL. PR: 250366 Reported by: Yong-Hao Zou Reviewed by: glebius, rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26897	2020-11-07 21:17:49 +00:00
Kyle Evans	1024ef27fe	imgact_binmisc: move some calculations out of the exec path The offset we need to account for in the interpreter string comes in two variants: 1. Fixed - macros other than #a that will not vary from invocation to invocation 2. Variable - #a, which is substitued with the argv0 that we're replacing Note that we don't have a mechanism to modify an existing entry. By recording both of these offset requirements when the interpreter is added, we can avoid some unnecessary calculations in the exec path. Most importantly, we can know up-front whether we need to grab calculate/grab the the filename for this interpreter. We also get to avoid walking the string a first time looking for macros. For most invocations, it's a swift exit as they won't have any, but there's no point entering a loop and searching for the macro indicator if we already know there will not be one. While we're here, go ahead and only calculate the argv0 name length once per invocation. While it's unlikely that we'll have more than one #a, there's no reason to recalculate it every time we encounter an #a when it will not change. I have not bothered trying to benchmark this at all, because it's arguably a minor and straightforward/obvious improvement. MFC after: 1 week	2020-11-07 18:07:55 +00:00
Mateusz Guzik	42e7abd5db	rms: several cleanups + debug read lockers handling This adds a dedicated counter updated with atomics when INVARIANTS is used. As a side effect one can reliably determine the lock is held for reading by at least one thread, but it's still not possible to find out whether curthread has the lock in said mode. This should be good enough in practice. Problem spotted by avg.	2020-11-07 16:57:53 +00:00
Kyle Evans	ecb4fdf943	imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC) This doesn't change anything at the moment since the out-of-order elements were a pair of uint32_t, but future additions may have caused unnecessary padding by following the existing precedent. MFC after: 1 week	2020-11-07 16:41:59 +00:00
Michal Meloun	eb20867f52	Add a method to determine whether given interrupt is per CPU or not. MFC after: 2 weeks	2020-11-07 14:58:01 +00:00
Edward Tomasz Napierala	da45ea6bc6	Move TDB_USERWR check under 'if (traced)'. If we hadn't been traced in the first place when syscallenter() started executing, we can ignore TDB_USERWR. TDB_USERWR can get set, sure, but if it does, it's because the debugger raced with the syscall, and it cannot depend on winning that race. Reviewed by: kib MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26585	2020-11-07 13:09:51 +00:00
Kyle Evans	2192cd125f	imgact_binmisc: abstract away the list lock (NFC) This module handles relatively few execs (initial qemu-user-static, then qemu-user-static handles exec'ing itself for binaries it's already running), but all execs pay the price of at least taking the relatively expensive sx/slock to check for a match when this module is loaded. Future work will almost certainly swap this out for another lock, perhaps an rmslock. The RLOCK/WLOCK phrasing was chosen based on what the callers are really wanting, rather than using the verbiage typically appropriate for an sx. MFC after: 1 week	2020-11-07 05:10:46 +00:00
Kyle Evans	7d3ed9777a	imgact_binmisc: validate flags coming from userland We may want to reserve bits in the future for kernel-only use, so start rejecting any that aren't the two that we're currently expecting from userland. MFC after: 1 week	2020-11-07 04:10:23 +00:00
Kyle Evans	7667824ade	epoch: support non-preemptible epochs checking in_epoch() Previously, non-preemptible epochs could not check; in_epoch() would always fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING. For default epochs, it's easy enough to verify that we're in the given epoch: if we're in a critical section and our record for the given epoch is active, then we're in it. This patch also adds some additional INVARIANTS bookkeeping. Notably, we set and check the recorded thread in epoch_enter/epoch_exit to try and catch some edge-cases for the caller. It also checks upon freeing that none of the records had a thread in the epoch, which may make it a little easier to diagnose some improper use if epoch_free() took place while some other thread was inside. This version differs slightly from what was just previously reviewed by the below-listed, in that in_epoch() will assert that no CPU has this thread recorded even if it is currently in a critical section. This is intended to catch cases where the caller might have somehow messed up critical section nesting, we can catch both if they exited the critical section or if they exited, migrated, then re-entered (on the wrong CPU). Reviewed by: kib, markj (both previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27098	2020-11-07 03:29:04 +00:00
Kyle Evans	80083216cb	imgact_binmisc: minor re-organization of imgact_binmisc_exec exits Notably, streamline error paths through the existing 'done' label, making it easier to quickly verify correct cleanup. Future work might add a kernel-only flag to indicate that a interpreter uses #a. Currently, all executions via imgact_binmisc pay the penalty of constructing sname/fname, even if they will not use it. qemu-user-static doesn't need it, the stock rc script for qemu-user-static certainly doesn't use it, and I suspect these are the vast majority of (if not the only) current users. MFC after: 1 week	2020-11-07 03:28:32 +00:00
Mateusz Guzik	e25d8b67c3	malloc: tweak the version check in r367432 to include type name While here fix a whitespace problem.	2020-11-07 01:32:16 +00:00
Mateusz Guzik	bdcc222644	malloc: move malloc_type_internal into malloc_type According to code comments the original motivation was to allow for malloc_type_internal changes without ABI breakage. This can be trivially accomplished by providing spare fields and versioning the struct, as implemented in the patch below. The upshots are one less memory indirection on each alloc and disappearance of mt_zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27104	2020-11-06 21:33:59 +00:00
Konstantin Belousov	f10845877e	Suspend all writeable local filesystems on power suspend. This ensures that no writes are pending in memory, either metadata or user data, but not including dirty pages not yet converted to fs writes. Only filesystems declared local are suspended. Note that this does not guarantee absence of the metadata errors or leaks if resume is not done: for instance, on UFS unlinked but opened inodes are leaked and require fsck to gc. Reviewed by: markj Discussed with: imp Tested by: imp (previous version), pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D27054	2020-11-05 20:52:49 +00:00
Mateusz Guzik	16b971ed6d	malloc: add a helper returning size allocated for given request Sample usage: kernel modules can decide whether to stick to malloc or create their own zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27097	2020-11-05 16:21:21 +00:00
Mateusz Guzik	2dee296a3d	Rationalize per-cpu zones. The 2 provided zones had inconsistent naming between each other ("int" and "64") and other allocator zones (which use bytes). Follow malloc by naming them "pcpu-" + size in bytes. This is a step towards replacing ad-hoc per-cpu zones with general slabs.	2020-11-05 15:08:56 +00:00
Mateusz Guzik	ea33cca971	poll/select: change selfd_zone into a malloc type On a sample box vmstat -z shows: ITEM SIZE LIMIT USED FREE REQ 64: 64, 0, 1043784, 4367538,3698187229 selfd: 64, 0, 1520, 13726,182729008 But at the same time: vm.uma.selfd.keg.domain.1.pages: 121 vm.uma.selfd.keg.domain.0.pages: 121 Thus 242 pages got pulled even though the malloc zone would likely accomodate the load without using extra memory.	2020-11-05 12:24:37 +00:00
Mateusz Guzik	2fbb45c601	vfs: change nt_zone into a malloc type Elements are small in size and allocated for short periods.	2020-11-05 12:06:50 +00:00
Kyle Evans	df69035d7f	imgact_binmisc: fix up some minor nits - Removed a bunch of redundant headers - Don't explicitly initialize to 0 - The !error check prior to setting imgp->interpreter_name is redundant, all error paths should and do return or go to 'done'. We have larger problems otherwise.	2020-11-05 04:19:48 +00:00
Mateusz Guzik	3c50616fc1	fd: make all f_count uses go through refcount_*	2020-11-05 02:12:33 +00:00
Mateusz Guzik	d737e9eaf5	fd: hide _fdrop 0 count check behind INVARIANTS While here use refcount_load and make sure to report the tested value.	2020-11-05 02:12:08 +00:00
Mateusz Guzik	331c21dd5e	pipe: whitespace nit in previous	2020-11-04 23:17:41 +00:00
Mateusz Guzik	c22ba7bb06	pipe: fix POLLHUP handling if no events were specified Linux allows polling without any events specified and it happens to be the case in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask and this works fine if the condition is already present. However, if it is missing, selrecord is only called if the eventmask has relevant bits set. This in particular leads to a conditon where pipe_poll can return 0 events and neglect to selrecord, while kern_poll takes it as an indication it has to go to sleep, but then there is nobody to wake it up. While the problem seems systemic to *_poll handlers the least we can do is fix it up for pipes. Reported by: Jeremie Galarneau <jeremie.galarneau at efficios.com> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D27094	2020-11-04 23:11:54 +00:00
Mateusz Guzik	6fc2b069ca	rms: fixup concurrent writer handling and add more features Previously the code had one wait channel for all pending writers. This could result in a buggy scenario where after a writer switches the lock mode form readers to writers goes off CPU, another writer queues itself and then the last reader wakes up the latter instead of the former. Use a separate channel. While here add features to reliably detect whether curthread has the lock write-owned. This will be used by ZFS.	2020-11-04 21:18:08 +00:00
Mark Johnston	f7db0c9532	vmspace: Convert to refcount(9) This is mostly mechanical except for vmspace_exit(). There, use the new refcount_release_if_last() to avoid switching to vmspace0 unless other processes are sharing the vmspace. In that case, upon switching to vmspace0 we can unconditionally release the reference. Remove the volatile qualifier from vm_refcnt now that accesses are protected using refcount(9) KPIs. Reviewed by: alc, kib, mmel MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27057	2020-11-04 16:30:56 +00:00

1 2 3 4 5 ...

17886 Commits