freebsd-dev

Author	SHA1	Message	Date
Eric van Gyzen	3cd1f28e4a	clamp kernel dump compression level when using gzip If the configured compression level for kernel dumps it outside the supported range, clamp it to the closest supported level. Previously, dumpon would fail. zstd already does this internally, so the compressor needs no change. Reviewed by: cem markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23765	2020-02-20 23:53:48 +00:00
Konstantin Belousov	74cb9a5333	Fix a bug in r358168, do not call sigfastblock_setpend() under a mutex. PR: 244250 Reported and tested by: lwhsu Sponsored by: The FreeBSD Foundation	2020-02-20 21:25:12 +00:00
Mateusz Guzik	65cdfb4caa	make sysent for r358172 ("vfs: add realpathat syscall")	2020-02-20 16:58:57 +00:00
Mateusz Guzik	0573d0a9b8	vfs: add realpathat syscall realpath(3) is used a lot e.g., by clang and is a major source of getcwd and fstatat calls. This can be done more efficiently in the kernel. This works by performing a regular lookup while saving the name and found parent directory. If the terminal vnode is a directory we can resolve it using usual means. Otherwise we can use the name saved by lookup and resolve the parent. See the review for sample syscall counts. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23574	2020-02-20 16:58:19 +00:00
Konstantin Belousov	a113b17f10	Do not read sigfastblock word on syscall entry. On machines with SMAP, fueword executes two serializing instructions which can be seen in microbenchmarks. As a measure to restore microbenchmark numbers, only read the word on the attempt to deliver signal in ast(). If the word is set, signal is not delivered and word is kept, preventing interruption of interruptible sleeps by signals until userspace calls sigfastblock(UNBLOCK) which clears the word. This way, the spurious EINTR that userspace can see while in critical section is on first interruptible sleep, if a signal is pending, and on signal posting. It is believed that it is not important for rtld and lbithr critical sections. It might be visible for the application code e.g. for the callback of dl_iterate_phdr(3), but again the belief is that the non-compliance is acceptable. Most important is that the retry of the sleeping syscall does not interrupt unless additional signal is posted. For now I added the knob kern.sigfastblock_fetch_always to enable the word read on syscall entry to be able to diagnose possible issues due to spurious EINTR. While there, do some code restructuting to have all sigfastblock() handling located in kern_sig.c. Reviewed by: jeff Discussed with: mjg Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D23622	2020-02-20 15:34:02 +00:00
Jeff Roberson	6c5f36ff30	Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in virtual address or physical page allocation need to be marked with this flag. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D23712	2020-02-19 08:17:27 +00:00
Mateusz Guzik	d8a84f08e8	refcount: update comments about fencing when releasing counts after r357989 Requested by: kib Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23719	2020-02-16 18:20:09 +00:00
Mateusz Guzik	3403d5245e	vfs: fix vlrureclaim ->v_object access The routine was checking for ->v_type == VBAD. Since vgone drops the interlock early sets this type at the end of the process of dooming a vnode, this opens a time window where it can clear the pointer while the inerlock-holders is accessing it. Another note is that the code was: (vp->v_object != NULL && vp->v_object->resident_page_count > trigger) With the compiler being fully allowed to emit another read to get the pointer, and in fact it did on the kernel used by pho. Use atomic_load_ptr and remember the result. Note that this depends on type-safety of vm_object. Reported by: pho	2020-02-16 03:33:34 +00:00
Mateusz Guzik	c615009461	vfs: check early for VCHR in vput_final to short-circuit in the common case Otherwise the compiler inlines v_decr_devcount which keps getting jumped over in the common case of not dealing with a device.	2020-02-16 03:16:28 +00:00
Matt Macy	45035becfe	Add zfree to zero allocation before free Key and cookie management typically wants to avoid information leaks by explicitly zeroing before free. This routine simplifies that by permitting consumers to do so without carrying the size around. Reviewed by: jeff@, jhb@ MFC after: 1 week Sponsored by: Rubicon Communications, LLC (Netgate) Differential Revision: https://reviews.freebsd.org/D22790	2020-02-16 00:12:53 +00:00
Konstantin Belousov	a7b61c0af1	sem_remove(): fix the loop that compacts sem array on semaphores removal. As written now, it copies random kernel memory from beyond the bounds of the array. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23694	2020-02-15 23:19:23 +00:00
Konstantin Belousov	4cb6ea7e8e	sem_remove(): add some asserts. Assert that sema[idx] allocation from sem[] is sane. Also assert that sem_mtx is owned, it protects the SEM_ALLOC flag. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23694	2020-02-15 23:18:02 +00:00
Konstantin Belousov	8095050846	Use designated initializers for seminfo. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23694	2020-02-15 23:15:42 +00:00
Pawel Biernacki	e0d69c5a88	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (1 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Reviewed by: kib, trasz Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23640	2020-02-15 18:48:38 +00:00
Mateusz Guzik	074ad60a4c	vfs: make write suspension mandatory At the time opt-in was introduced adding yourself as a writer was esrializing across the mount point. Nowadays it is fully per-cpu, the only impact being a small single-threaded hit on top of what's there right now. Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which has is done regardless. Should someone want to microoptimize this single-threaded they can coalesce looking the mount up with adding a write to it.	2020-02-15 13:00:39 +00:00
Mateusz Guzik	eb40664d83	capsicum: use new helpers	2020-02-15 01:30:27 +00:00
Mateusz Guzik	445faddf7f	kqueue: use new capsicum helpers	2020-02-15 01:30:13 +00:00
Mateusz Guzik	32a86c44ee	fd: use new capsicum helpers	2020-02-15 01:28:55 +00:00
Mateusz Guzik	e126c5a3e8	vfs: use new capsicum helpers	2020-02-15 01:28:42 +00:00
Konstantin Belousov	6cf2362e2c	Consolidate read code for timecounters and fix possible overflow in bintime()/binuptime(). The algorithm to read the consistent snapshot of current timehand is repeated in each accessor, including the details proper rollup detection and synchronization with the writer. In fact there are only two different kind of readers: one for bintime()/binuptime() which has to do the in-place calculation, and another kind which fetches some member from struct timehand. Extract the logic into type-checked macros, GETTHBINTIME() for bintime calculation, and GETTHMEMBER() for safe read of a structure' member. This way, the synchronization is only written in bintime_off() and getthmember(). In bintime_off(), use overflow-safe calculation of th_scale * delta(timecounter). In tc_windup, pre-calculate the min delta value which overflows and require slow algorithm, into the new timehands th_large_delta member. This part with overflow fix was written by Bruce Evans. Reported by: Mark Millard <marklmi@yahoo.com> (the overflow issue) Tested by: pho Discussed with: emaste Sponsored by: The FreeBSD Foundation (kib) MFC after: 3 weeks	2020-02-14 23:27:45 +00:00
Mateusz Guzik	df0d5a2a85	vfs: remove no longer needed atomic_load_ptr casts	2020-02-14 23:18:32 +00:00
Mateusz Guzik	8f86349f8b	fd: remove no longer needed atomic_load_ptr casts	2020-02-14 23:18:22 +00:00
Mateusz Guzik	5bc6a91f54	kcov: remove no longer needed atomic_load_ptr casts	2020-02-14 23:18:03 +00:00
Mateusz Guzik	2f7292437d	Merge audit and systrace checks This further shortens the syscall routine by not having to re-check after the system call.	2020-02-14 13:09:41 +00:00
Mateusz Guzik	0e84a878c0	Annotate branches in the syscall path This in particular significantly shortens amd64_syscall, which otherwise keeps jumping forward over 2KB of code in total. Note some of these branches should be either eliminated altogether or coalesced.	2020-02-14 13:08:46 +00:00
Mateusz Guzik	ba8dd40bb1	lockmgr: add a change missed in r357907	2020-02-14 11:56:50 +00:00
Mateusz Guzik	6ed30ea4c0	fd: annotate finstall with prediction branches	2020-02-14 11:22:12 +00:00
Mateusz Guzik	c1b57fa7d3	lockmgr: rename lock_fast_path to lock_flags The routine is not much of a fast path and the flags name better describes its purpose.	2020-02-14 11:21:28 +00:00
Mateusz Guzik	943c4932f3	lockmgr: retire the unused lockmgr_unlock_fast_path routine	2020-02-14 11:20:25 +00:00
Kyle Evans	0f5f49eff7	u_char -> vm_prot_t in a couple of places, NFC The latter is a typedef of the former; the typedef exists and these bits are representing vmprot values, so use the correct type. Submitted by: sigsys@gmail.com MFC after: 3 days	2020-02-14 02:22:08 +00:00
Mateusz Guzik	6ebab6bad2	vfs: use mac fastpath for lookup, open, read, write, mmap	2020-02-13 22:22:55 +00:00
Mateusz Guzik	7b2ff0dcb2	Partially decompose priv_check by adding priv_check_cred_vfs_generation During buildkernel there are very frequent calls to priv_check and they all are for PRIV_VFS_GENERATION (coming from stat/fstat). This results in branching on several potential privileges checking if perhaps that's the one which has to be evaluated. Instead of the kitchen-sink approach provide a way to have commonly used privs directly evaluated.	2020-02-13 22:22:15 +00:00
Mateusz Guzik	e6081fe899	Inline jailed(). It is constantly called from priv_check.	2020-02-13 22:16:30 +00:00
Mateusz Guzik	8bdcfb10d3	Annotate suser_enabled as __read_mostly It is read a lot in priv code.	2020-02-13 22:16:02 +00:00
Jeff Roberson	1f2a6b8501	Since r357804 pcpu zones are required to use zalloc_pcpu(). Prior to this it was only required if you were zeroing. Switch to these interfaces. Reviewed by: mjg	2020-02-13 21:10:17 +00:00
Jeff Roberson	a4d50e49da	Add more precise SMR entry asserts.	2020-02-13 20:50:21 +00:00
Kyle Evans	b30ab6d8fe	sys/kern sysent: re-add dependency on capabilities.conf r356868 inadvertently removed this, so changes to capabilities.conf were no longer considered for being outdated.	2020-02-12 19:06:34 +00:00
Ed Maste	fe16bad415	regen sysent after r357831, r357838 Capability mode changes allowing fdatasync and getloginclass. Sponsored by: The FreeBSD Foundation	2020-02-12 19:05:10 +00:00
Ed Maste	e953765f15	Allow getloginclass in capability mode As with e.g. getgroups and getlogin it allows querying current process credential state. Reported by: sigsys@gmail.com via kevans Sponsored by: The FreeBSD Foundation	2020-02-12 18:59:00 +00:00
Ed Maste	9cdfb2d69a	Allow fdatasync in capability mode fdatasync is essentially a subset of fsync (and may be exactly fsync, depending on filesystem and development effort) and operates only on a provided fd. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2020-02-12 17:12:26 +00:00
Mateusz Guzik	4602214772	vfs: refactor vputx and add more comment Reviewed by: jeff (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23530	2020-02-12 11:19:07 +00:00
Mateusz Guzik	ed67a63c39	vfs: drop remaining zpcpu casts	2020-02-12 11:18:12 +00:00
Mateusz Guzik	123c519731	vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit In particular on amd64 this eliminates an atomic op in the common case, trading it for IPIs in the uncommon case of catching CPUs executing the code while the filesystem is getting suspended or unmounted.	2020-02-12 11:17:45 +00:00
Mateusz Guzik	00ac9d2632	rms: use smp_rendezvous_cpus_retry instead of a hand-rolled variant	2020-02-12 11:17:18 +00:00
Mateusz Guzik	e4f584971b	Add smp_rendezvous_cpus_retry This is a wrapper around smp_rendezvous_cpus which enables use of IPI handlers which can fail and require retrying. wait_func argument is added to to provide a routine which can be used to poll CPU of interest for when the IPI can be retried. Handlers which succeed must call smp_rendezvous_cpus_done to denote that fact. Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D23582	2020-02-12 11:16:55 +00:00
Mateusz Guzik	3acb6572fc	Store offset into zpcpu allocations in the per-cpu area. This shorten zpcpu_get and allows more optimizations. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23570	2020-02-12 11:11:22 +00:00
Mateusz Guzik	48baf00f54	epoch: convert zpcpu_get_cpua(.., curcpu) to zpcpu_get	2020-02-12 11:10:10 +00:00
Gleb Smirnoff	4426b2e64b	Add flag to struct task to mark the task as requiring network epoch. When processing a taskqueue and a task has associated epoch, then enter for duration of the task. If consecutive tasks belong to the same epoch, batch them. Now we are talking about the network epoch only. Shrink the ta_priority size to 8-bits. No current consumers use a priority that won't fit into 8 bits. Also complexity of taskqueue_enqueue() is a square of maximum value of priority, so we unlikely ever want to go over UCHAR_MAX here. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D23518	2020-02-11 18:48:07 +00:00
Mateusz Guzik	57349a4f41	vfs: fix vhold race in mnt_vnode_next_lazy_relock vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the count has to be > 0. Reported by: pho Tested by: pho	2020-02-11 18:19:56 +00:00
Mateusz Guzik	1b853b62f3	capsicum: restore the cap_rights_contains symbol It is expected to be provided by libc. PR: 244033 Reported by: Jan Kokemueller	2020-02-11 18:13:53 +00:00
Mateusz Guzik	2e57c8fde7	vfs: fix device count leak on vrele racing with vgone The race is: CPU1 CPU2 devfs_reclaim_vchr make v_usecount 0 VI_LOCK sees v_usecount == 0, no updates vp->v_rdev = NULL; ... VI_UNLOCK VI_LOCK v_decr_devcount sees v_rdev == NULL, no updates In this scenario si_devcount decrement is not performed. Note this can only happen if the vnode lock is not held. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23529	2020-02-10 22:28:54 +00:00
Li-Wen Hsu	37d4ece7c5	Restore the behavior of allowing empty string in a string sysctl Added as a special case to avoid unnecessary memory operations. Reviewed by: delphij Sponsored by: The FreeBSD Foundation	2020-02-10 20:53:59 +00:00
Hans Petter Selasky	f912e8f2ff	Fix for unbalanced EPOCH(9) usage in the generic kernel interrupt handler. Interrupt handlers are removed via intr_event_execute_handlers() when IH_DEAD is set. The thread removing the interrupt is woken up, and calls intr_event_update(). When this happens, the ie_hflags are cleared and re-built from all the remaining handlers sharing the event. When the last IH_NET handler is removed, the IH_NET flag will be cleared from ih_hflags (or ie_hflags may still be being rebuilt in a different context), and the ithread_execute_handlers() may return with ie_hflags missing IH_NET. This can lead to a scenario where IH_NET was present before calling ithread_execute_handlers, and is not present at its return, meaning the need for epoch must be cached locally. This can happen when loading and unloading network drivers. Also make sure the ie_hflags is not cleared before being updated. This is a regression issue after r357004. Backtrace: panic() # trying to access epoch tracker on stack of dead thread _epoch_enter_preempt() ifunit_ref() ifioctl() fo_ioctl() kern_ioctl() sys_ioctl() syscallenter() amd64_syscall() Differential Revision: https://reviews.freebsd.org/D23483 Reviewed by: glebius@, gallatin@, mav@, jeff@ and kib@ Sponsored by: Mellanox Technologies	2020-02-10 20:23:08 +00:00
Mateusz Guzik	cd951a0d8e	vfs: fix lock recursion in vrele vrele is supposed to be called with an unlocked vnode, but this was never asserted for if v_usecount was > 0. For such counts the lock is never touched by the routine. As a result the kernel has several consumers which expect vunref semantics and get away with calling vrele since they happen to never do it when this is the last reference (and for some of them this may happen to be a guarantee). Work around the problem by changing vrele semantics to tolerate being called with a lock. This eliminates a possible bug where the lock is already held and vputx takes it anyway. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23528	2020-02-10 13:54:34 +00:00
Konstantin Belousov	48fcb46311	Add sysctl kern.proc.sigfastblk for reporting sigfastblock word address. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773	2020-02-09 12:29:51 +00:00
Konstantin Belousov	944cf37bb5	Add AT_BSDFLAGS auxv entry. The intent is to provide bsd-specific flags relevant to interpreter and C runtime. I did not want to reuse AT_FLAGS which is common ELF auxv entry. Use bsdflags to report kernel support for sigfastblock(2). This allows rtld and libthr to safely infer the syscall presence without SIGSYS. The tunable kern.elf{32,64}.sigfastblock blocks reporting. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773	2020-02-09 12:10:37 +00:00
Konstantin Belousov	f88c67a625	Regen.	2020-02-09 11:53:37 +00:00
Konstantin Belousov	146fc63fce	Add a way to manage thread signal mask using shared word, instead of syscall. A new syscall sigfastblock(2) is added which registers a uint32_t variable as containing the count of blocks for signal delivery. Its content is read by kernel on each syscall entry and on AST processing, non-zero count of blocks is interpreted same as the signal mask blocking all signals. The biggest downside of the feature that I see is that memory corruption that affects the registered fast sigblock location, would cause quite strange application misbehavior. For instance, the process would be immune to ^C (but killable by SIGKILL). With consumers (rtld and libthr added), benchmarks do not show a slow-down of the syscalls in micro-measurements, and macro benchmarks like buildworld do not demonstrate a difference. Part of the reason is that buildworld time is dominated by compiler, and clang already links to libthr. On the other hand, small utilities typically used by shell scripts have the total number of syscalls cut by half. The syscall is not exported from the stable libc version namespace on purpose. It is intended to be used only by our C runtime implementation internals. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773	2020-02-09 11:53:12 +00:00
Mateusz Guzik	2f7f11b7de	vfs: tidy up vget_finish and vn_lock - remove assertion which duplicates vn_lock - use VNPASS instead of retyping the failure - report what flags were passed if panicking on them	2020-02-08 15:52:20 +00:00
Mateusz Guzik	3eb6b656c2	vfs: remove now useless ENODEV handling from vn_fullpath consumers Noted by: ngie	2020-02-08 15:51:08 +00:00
Konstantin Belousov	300b525d29	Correct the function name in the comment. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2020-02-08 15:06:06 +00:00
Mateusz Guzik	ea77ce6ef9	rms: use newly added zpcpu routines instead of direct access where appropriate	2020-02-07 22:44:41 +00:00
Jeff Roberson	a40068e524	Fix a race in smr_advance() that could result in unnecessary poll calls. This was relatively harmless but surprising to see in counters. The race occurred when rd_seq was read after the goal was updated and we incorrectly calculated the delta between them. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23464	2020-02-06 20:51:46 +00:00
Jeff Roberson	8d7f16a5db	Add some global counters for SMR. These may eventually become per-smr counters. In my stress test there is only one poll for every 15,000 frees. This means we are effectively amortizing the cache coherency overhead even with very high write rates (3M/s/core). Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23463	2020-02-06 20:10:21 +00:00
Pawel Biernacki	210176ad76	sysctl(9): add CTLFLAG_NEEDGIANT flag Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to mark sysctls that still require locking Giant. Rewrite sysctl_handle_string() to use internal locking instead of locking Giant. Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE. Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet. Reviewed by: kib (mentor) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23378	2020-02-06 12:45:58 +00:00
Mark Johnston	d3631aa582	Avoid releasing object PIP in vn_sendfile() if no pages were grabbed. sendfile(2) optionally takes a set of headers that get prepended to the file data. If the request length is less than that of the headers, sendfile may not allocate an sfio structure, in which case its pointer is null and we should be careful not to dereference. This was introduced in r356902. Reported by: syzkaller Sponsored by: The FreeBSD Foundation	2020-02-05 16:09:21 +00:00
Leandro Lupori	eb5a41cf2f	Add SYSCTL to get KERNBASE and relocated KERNBASE This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE values. This provides an easy, architecture independent way to calculate the running kernel displacement (current/load address minus original base address). The initial goal for this change is to add a new libkvm function that returns the kernel displacement, both for live kernels and crashdumps. This would in turn be used by kgdb to find out how to relocate kernel symbols (if needed). Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D23284	2020-02-05 11:34:10 +00:00
Mateusz Guzik	1a9fe4528b	fd: always nullify fdp in fget routines Some consumers depend on the pointer being NULL if an error is returned. The guarantee got broken in r357469. Reported by: https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef Noted by: markj	2020-02-05 00:20:26 +00:00
Ryan Libby	10c8fb47d9	uma: convert mbuf_jumbo_alloc to UMA_ZONE_CONTIG & tag others Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which require contiguous objects, even if they don't use the new default contig allocator, so that uma knows about their constraints. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23238	2020-02-04 22:40:23 +00:00
Konstantin Belousov	0783b70974	Remove unneeded assert for curproc. Simplify. Reported by: syzkaller by markj Sponsored by: The FreeBSD Foundation	2020-02-04 21:02:08 +00:00
Mark Johnston	60185d649b	Correct the malloc tag used when freeing the temporary semop(2) buffer. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2020-02-04 20:00:45 +00:00
Dmitry Chagin	cbc1089190	For code reuse in Linuxulator rename get_proccess_cputime() and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>. As both functions become a public interface add process lock assert to ensure that the process is not exiting under it. Fix whitespace nit while here. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23340 MFC after 2 weeks	2020-02-04 05:25:51 +00:00
Jeff Roberson	bc6509845d	Implement a deferred write advancement feature that can be used to further amortize shared cacheline writes. Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D23462	2020-02-04 02:44:52 +00:00
Jeff Roberson	c8ea36e881	Fix a recursion on the thread lock by acquiring it after call rtp_to_pri(). Reported by: swills Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23495	2020-02-04 02:42:54 +00:00
Mark Johnston	e489450589	Fix the !SMP case in sched_add() after r355779. If the thread's lock is already that of the runqueue, don't recurse on the queue lock. Reviewed by: jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23492	2020-02-03 22:49:05 +00:00
Mateusz Guzik	8151b6e92a	fd: partially unengrish the previous commit	2020-02-03 22:34:50 +00:00
Mateusz Guzik	e10f063b30	fd: streamline fget_unlocked clang has the unfortunate property of paying little attention to prediction hints when faced with a loop spanning the majority of the rotuine. In particular fget_unlocked has an unlikely corner case where it starts almost from scratch. Faced with this clang generates a maze of taken jumps, whereas gcc produces jump-free code (in the expected case). Work around the problem by providing a variant which only tries once and resorts to calling the original code if anything goes wrong. While here note that the 'seq' parameter is almost never passed, thus the seldom users are redirected to call it directly.	2020-02-03 22:32:49 +00:00
Mateusz Guzik	52604ed792	fd: remove the seq argument from fget_unlocked It is almost always NULL.	2020-02-03 22:27:55 +00:00
Mateusz Guzik	7f1566f884	fd: remove the seq argument from fget routines It is almost always NULL.	2020-02-03 22:27:03 +00:00
Mateusz Guzik	0a1427c5ab	ktrace: provide ktrstat_error This eliminates a branch from its consumers trading it for an extra call if ktrace is enabled for curthread. Given that this is almost never true, the tradeoff is worth it.	2020-02-03 22:26:00 +00:00
Gleb Smirnoff	0017b2adac	Couple protocol drain routines (frag6_drain and sctp_drain) may send packets. An unexpected behaviour for memory reclamation routine. Anyway, we need enter the network epoch for doing that.	2020-02-03 20:48:57 +00:00
Kyle Evans	3d62f685d5	namei: preserve errors from fget_cap_locked Most notably, we want to make sure we don't clobber any capabilities-related errors. This is a regression from r357412 (O_SEARCH) that was picked up by the capsicum tests. PR: 243839 Reviewed by: kib (committed form recommended by) Tested by: lwhsu Differential Revision: https://reviews.freebsd.org/D23479	2020-02-03 18:59:07 +00:00
Warner Losh	58aa35d429	Remove sparc64 kernel support Remove all sparc64 specific files Remove all sparc64 ifdefs Removee indireeect sparc64 ifdefs	2020-02-03 17:35:11 +00:00
Mateusz Guzik	bcd1cf4f03	capsicum: faster cap_rights_contains Instead of doing a 2 iteration loop (determined at runeimt), take advantage of the fact that the size is already known. While here provdie cap_check_inline so that fget_unlocked does not have to do a function call. Verified with the capsicum suite /usr/tests.	2020-02-03 17:08:11 +00:00
Mateusz Guzik	fee204544e	fd: fix f_count acquire in fget_unlocked The code was using a hand-rolled fcmpset loop, while in other places the same count is manipulated with the refcount API. This transferred from a stylistic issue into a bug after the API got extended to support flags. As a result the hand-rolled loop could bump the count high enough to set the bit flag. Another bump + refcount_release would then free the file prematurely. The bug is only present in -CURRENT.	2020-02-03 14:28:31 +00:00
Mateusz Guzik	f1fa1ba3d0	Fix up various vnode-related asserts which did not dump the used vnode	2020-02-03 14:25:32 +00:00
Kyle Evans	6a5abb1ee5	Provide O_SEARCH O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping permissions checks on the directory itself after the initial open(). This is close to the semantics we've historically applied for O_EXEC on a directory, which is UB according to POSIX. Conveniently, O_SEARCH on a file is also explicitly undefined behavior according to POSIX, so O_EXEC would be a fine choice. The spec goes on to state that O_SEARCH and O_EXEC need not be distinct values, but they're not defined to be the same value. This was pointed out as an incompatibility with other systems that had made its way into libarchive, which had assumed that O_EXEC was an alias for O_SEARCH. This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a directory is checked in vn_open_vnode already, so for completeness we add a NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not re-check that when descending in namei. [0] https://pubs.opengroup.org/onlinepubs/9699919799/ Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23247	2020-02-02 16:34:57 +00:00
Mateusz Guzik	2568d5bb79	fd: sprinkle some predits around fget clang inlines fget -> _fget into kern_fstat and eliminates several checkes, but prior to this change it would assume fget_unlocked was likely to fail and consequently avoidable jumps got generated.	2020-02-02 09:38:40 +00:00
Mateusz Guzik	da4f45ea5c	fd: use atomic_load_ptr instead of hand-rolled cast through volatile No change in assembly.	2020-02-02 09:37:16 +00:00
Mateusz Guzik	6698e11f4b	vfs: remove the now empty vop_unlock_post	2020-02-02 09:36:32 +00:00
Mateusz Guzik	7739d92766	cache: replace kern___getcwd with vn_getcwd The previous routine was resulting in extra data copies most notably in linux_getcwd.	2020-02-01 20:38:38 +00:00
Mateusz Guzik	921e7210f8	cache: return the total length from vn_fullpath1 This removes strlen from getcwd.	2020-02-01 20:37:11 +00:00
Mateusz Guzik	4511dd9d41	cache: remove vnode -> path lookup disablement It seems to be of little to no use even when debugging. Interested parties can resurrect it and gate compilation with a macro.	2020-02-01 20:36:35 +00:00
Mateusz Guzik	45757984f8	vfs: consistently use size_t for buflen around VOP_VPTOCNP	2020-02-01 20:34:43 +00:00
Mateusz Guzik	643656cfaf	vfs: replace VOP_MARKATIME with VOP_MMAPPED The routine is only provided by ufs and is only used on mmap and exec. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23422	2020-02-01 06:46:55 +00:00
Mateusz Guzik	90f4ec3328	vfs: save on atomics on the root vnode for absolute lookups There are 2 back-to-back atomics on the vnode, but we can check upfront if one is sufficient. Similarly we can handle relative lookups where current working directory == root directory. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23427	2020-02-01 06:40:35 +00:00
Mateusz Guzik	21c4f1041e	vfs: add vrefactn Differential Revision: https://reviews.freebsd.org/D23427	2020-02-01 06:39:49 +00:00
Jeff Roberson	915c367e8e	Add two missing fences with comments describing them. These were found by inspection and after a lengthy discussion with jhb and kib. They have not produced test failures. Don't pointer chase through cpu0's smr. Use cpu correct smr even when not in a critical section to reduce the likelihood of false sharing.	2020-01-31 22:21:15 +00:00
Mark Johnston	1c29da0279	Reimplement stack capture of running threads on i386 and amd64. After r355784 the td_oncpu field is no longer synchronized by the thread lock, so the stack capture interrupt cannot be delievered precisely. Fix this using a loop which drops the thread lock and restarts if the wrong thread was sampled from the stack capture interrupt handler. Change the implementation to use a regular interrupt instead of an NMI. Now that we drop the thread lock, there is no advantage to the latter. Simplify the KPIs. Remove stack_save_td_running() and add a return value to stack_save_td(). On platforms that do not support stack capture of running threads, stack_save_td() returns EOPNOTSUPP. If the target thread is running in user mode, stack_save_td() returns EBUSY. Reviewed by: kib Reported by: mjg, pho Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23355	2020-01-31 15:43:33 +00:00
Mateusz Guzik	0f4d8b77c0	vfs: revert the overzealous assert added in r357285 to vgone The intent was to make it more likely to catch filesystems with custom need_inactive routines which fail to call vn_need_pageq_flush (or do an equivalent). One immediate case which is missed is vgone from called by inactive itself. A better assertion may land later. The routine is not added to vputx because it is of no use to tmpfs et al. Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com	2020-01-31 11:31:14 +00:00
Mateusz Guzik	1a78ac2416	Add rms_try_rlock and rms_wowned.	2020-01-31 08:36:49 +00:00
Mateusz Guzik	cedad2916e	Remove an overzealous assert from rms_runlock.	2020-01-31 08:36:23 +00:00
Jeff Roberson	da6e9935e4	Don't use "All rights reserved" in new copyrights. Requested by: rgrimes	2020-01-31 02:08:09 +00:00
Jeff Roberson	d4665eaa66	Implement a safe memory reclamation feature that is tightly coupled with UMA. This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is a unique algorithm. This has 3x the performance of epoch in a write heavy workload with less than half of the read side cost. The memory overhead is significantly lessened by limiting the free-to-use latency. A synthetic test uses 1/20th of the memory vs Epoch. There is significant further discussion in the comments and code review. This code should be considered experimental. I will write a man page after it has settled. After further validation the VM will begin using this feature to permit lockless page lookups. Both markj and cperciva tested on arm64 at large core counts to verify fences on weaker ordering architectures. I will commit a stress testing tool in a follow-up. Reviewed by: mmacy, markj, rlibby, hselasky Discussed with: sbahara Differential Revision: https://reviews.freebsd.org/D22586	2020-01-31 00:49:51 +00:00
Mateusz Guzik	3ff65f71cb	Remove duplicated empty lines from kern/*.c No functional changes.	2020-01-30 20:05:05 +00:00
Mateusz Guzik	2823710f05	Tidy up 2 comments in smp_rendezvous_cpus.	2020-01-30 20:02:14 +00:00
Mateusz Guzik	7ab99925fd	Assert that smp_rendezvous_cpus is called with interrupts enabled.	2020-01-30 19:38:51 +00:00
Mateusz Guzik	d53d924f60	vfs: keep the mount point referenced across sys_quotactl Otherwise we risk running into use-after-free. In particular this codepath ends up dropping all protection before suspending writes: ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt Reported by: pho	2020-01-30 19:38:12 +00:00
John Baldwin	fbb9879c0c	Fix use of an uninitialized variable. ctx (and thus ctx.flags) is stack garbage at the start of this function, so initialize ctx.flags to an explicit value instead of using binary operations on the garbage. Reported by: gcc9 Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D23368	2020-01-30 18:28:02 +00:00
Mateusz Guzik	c2ef6aa3d5	vfs: assert that doomed vnodes don't need to call vm_object_page_clean ... after the optional inactive processing.	2020-01-30 04:59:08 +00:00
Mateusz Guzik	07c6e2f4ab	vfs: unlazy before dooming the vnode With this change having the listmtx lock held postpones dooming the vnode. Use this fact to simplify iteration over the lazy list. It also allows filters to safely access ->v_data. Reviewed by: kib (early version) Differential Revision: https://reviews.freebsd.org/D23397	2020-01-30 02:12:52 +00:00
Gleb Smirnoff	79674264df	Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This is a regression from r356642, r356645.	2020-01-30 00:18:00 +00:00
Conrad Meyer	07a65f9d38	hwpstate_intel(4): Silence/fix Coverity reports These were all introduced in the initial import of hwpstate_intel(4). Reported by: Coverity CIDs: 1413161, 1413164, 1413165, 1413167 X-MFC-With: r357002	2020-01-29 03:15:34 +00:00
Warner Losh	42ec4f05a3	Make mqueue objects work across a fork again. In r110908 (2003) alfred added DFLAG_PASSABLE to tag those types of FD that can be passed via unix pipes, but mqueuefs didn't exist yet. Later, in r152825 (2005) davidxu neglected to include DFLAG_PASSABLE since people don't normally pass these things via unix sockets (it's a FreeBSD implementation detail that it's a file descriptor, nobody noticed). Then r223866 (2011) by jonathan used the new flag in fdcopy, which fork uses. Due to that, mqueuefs actually broke mqueue objects being propagated by fork. No mention of mqueuefs was made in r223866, so I think it was an unintended consequence. Fix this by tagging mqueuefs as passable as well. They were prior to alfred's change (and it's clear there's no intent in his change to change this behavior), and POSIX requires this to be the case as well. PR: 243103 Reviewed by: kib@, jiles@ Differential Revision: https://reviews.freebsd.org/D23038	2020-01-27 22:36:54 +00:00
John Baldwin	425e5f9dcf	Revert accidental change from r357146.	2020-01-26 14:23:27 +00:00
John Baldwin	c73222d0e6	Fix some misleading indentation warnings reported by recent clang. These should not be any functional change. While the change in emul10kx-pcm.c looks like a real bug fix (as opposed to inconsistent whitespace), the extra statements were not harmful. Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D23363	2020-01-26 14:20:57 +00:00
Mateusz Guzik	1513f80391	vfs: do an unlocked check before iterating the lazy list For most filesystems it is expected to be empty most of the time.	2020-01-26 07:06:18 +00:00
Mateusz Guzik	cd0e46c66b	vfs: remove vop loop from vop_sigdefer All ops are guaranteed to be present since r357131.	2020-01-26 07:05:06 +00:00
Mateusz Guzik	6d69e665dd	vfs: fix freevnodes count update race against preemption vdbatch_process leaves the critical section too early, openign a time window where another thread can get scheduled and modify vd->freevnodes. Once it the preempted thread gets back it overrides the value with 0. Just move critical_exit to the end of the function.	2020-01-26 00:40:27 +00:00
Mateusz Guzik	dc9a1cb60b	vfs: predict vn_lock failure as unlikely in vget	2020-01-26 00:34:57 +00:00
Jason A. Harmening	a9aa06f7b1	Implement cycle-detecting garbage collector for AF_UNIX sockets The existing AF_UNIX socket garbage collector destroys any socket which may potentially be in a cycle, as indicated by its file reference count being equal to its enqueue count. However, this can produce false positives for in-flight sockets which aren't part of a cycle but are part of one or more SCM_RIGHTS mssages and which have been closed on the sending side. If the garbage collector happens to run at exactly the wrong time, destruction of these sockets will render them unusable on the receiving side, such that no previously-written data may be read. This change rewrites the garbage collector to precisely detect cycles: 1. The existing check of msgcount==f_count is still used to determine whether the socket is potentially in a cycle. 2. The socket is now placed on a local "dead list", which is used to reduce iteration time (and therefore contention on the global unp_link_rwlock). 3. The first pass through the dead list removes each potentially-dead socket's outgoing references from the graph of potentially-dead sockets, using a gc-specific copy of the original reference count. 4. The second series of passes through the dead list removes from the list any socket whose remaining gc refcount is non-zero, as this indicates the socket is actually accessible outside of any possible cycle. Iteration is repeated until no further sockets are removed from the dead list. 5. Sockets remaining in the dead list are destroyed as before. PR: 227285 Submitted by: jan.kokemueller@gmail.com (prior version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23142	2020-01-25 08:57:26 +00:00
Mark Johnston	a89c2c8c34	Revert r357050. It seems to have introduced a couple of regressions. Reported by: cy, pho	2020-01-24 14:58:02 +00:00
Edward Tomasz Napierala	b3fb13eb55	Add kern_unmount() and use in Linuxulator. No functional changes. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22646	2020-01-24 11:57:55 +00:00
Mateusz Guzik	28eb39a5ab	vfs: allow v_usecount to transition 0->1 without the interlock There is nothing to do but to bump the count even during said transition. There are 2 places which can do it: - vget only does this after locking the vnode, meaning there is no change in contract versus inactive or reclamantion - vref only ever did it with the interlock held which did not protect against either (that is, it would always succeed) VCHR vnodes retain special casing due to the need to maintain dev use count. Reviewed by: jeff, kib Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23185	2020-01-24 07:47:44 +00:00
Mateusz Guzik	d93762b94d	vfs: stop handling VI_OWEINACT in vget vget is almost always called with LK_SHARED, meaning the flag (if present) is almost guaranteed to get cleared. Stop handling it in the first place and instead let the thread which wanted to do inactive handle the bumepd usecount. Reviewed by: jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D23184	2020-01-24 07:45:59 +00:00
Mateusz Guzik	74c4b7cc60	vfs: stop unlocking the vnode upfront in vput Doing so runs into races with filesystems which make half-constructed vnodes visible to other users, while depending on the chain vput -> vinactive -> vrecycle to be executed without dropping the vnode lock. Impediments for making this work got cleared up (notably vop_unlock_post now does not do anything and lockmgr stops touching the lock after the final write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can now be eliminated. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23344	2020-01-24 07:44:25 +00:00
Mateusz Guzik	c00115f108	lockmgr: don't touch the lock past unlock This evens it up with other locking primitives. Note lock profiling still touches the lock, which again is in line with the rest. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23343	2020-01-24 07:42:57 +00:00
Mark Johnston	1bfca40c57	Set td_oncpu before dropping the thread lock during a switch. After r355784 we no longer hold a thread's thread lock when switching it out. Preserve the previous synchronization protocol for td_oncpu by setting it together with td_state, before dropping the thread lock during a switch. Reported and tested by: pho Reviewed by: kib Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D23270	2020-01-23 16:24:51 +00:00
Jeff Roberson	91e31c3c08	Consistently use busy and vm_page_valid() rather than touching page bits directly. This improves API compliance, asserts, etc. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23283	2020-01-23 04:54:49 +00:00
Jeff Roberson	1eb13fce84	Block the thread lock in sched_throw() and use cpu_switch() to unblock it. The introduction of lockless switch in r355784 created a race to re-use the exiting thread that was only possible to hit on a hypervisor. Reported/Tested by: rlibby Discussed with: rlibby, jhb	2020-01-23 03:36:50 +00:00
Gleb Smirnoff	ad3980121b	DEVICE_POLLING is an alternative to network interrupts and also needs to enter epoch. Assert that in the netisr_poll() and do the work for the idle poll routine.	2020-01-23 01:30:50 +00:00
Gleb Smirnoff	511d1afb6b	Enter the network epoch for interrupt handlers of INTR_TYPE_NET. Provide tunable to limit how many times handlers may be executed without reentering epoch. Differential Revision: https://reviews.freebsd.org/D23242	2020-01-23 01:24:47 +00:00
Gleb Smirnoff	c4eb66309f	Add ie_hflags to struct intr_event, which accumulates flags from all handlers on this event. For now handle only IH_ENTROPY in that manner.	2020-01-23 01:20:59 +00:00
Conrad Meyer	4577cf3744	cpufreq(4): Add support for Intel Speed Shift Intel Speed Shift is Intel's technology to control frequency in hardware, with hints from software. Let's get a working version of this in the tree and we can refine it from here. Submitted by: bwidawsk, scottph Reviewed by: bcr (manpages), myself Discussed with: jhb, kib (earlier versions) With feedback from: Greg V, gallatin, freebsdnewbie AT freenet.de Relnotes: yes Differential Revision: https://reviews.freebsd.org/D18028	2020-01-22 23:28:42 +00:00
Hans Petter Selasky	1f69a50940	Make sure the VNET is properly set when calling tcp_drop() from the ktls taskqueue callback function. A valid VNET is needed when updating statistics. panic() tcp_state_change() tcp_drop() ktls_reset_send_tag() taskqueue_run_locked() taskqueue_thread_loop() Sponsored by: Mellanox Technologies	2020-01-21 11:43:25 +00:00
Mateusz Guzik	6403455301	cache: revert r352613 now that vhold does not take locks	2020-01-20 19:52:23 +00:00
Mateusz Guzik	8bba93c7e0	cache: make numcachehv use counter(9) on all archs Requested by: kib	2020-01-20 14:42:11 +00:00
Jeff Roberson	d6e13f3b4d	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033	2020-01-19 23:47:32 +00:00
Mateusz Guzik	a9099e5b10	vfs: switch vop_stdunlock to call lockmgr_unlock Since the flags argument is now alawys 0 the new call provides the same behavior.	2020-01-19 21:41:34 +00:00
Jeff Roberson	811d05fcb7	Provide an API for interlocked refcount sleeps. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:18:17 +00:00
Mateusz Guzik	28479aaae2	vfs: allow v_holdcnt to transition 0->1 without the interlock Since r356672 ("vfs: rework vnode list management") there is nothing to do apart from altering freevnodes count, but this much can be safely done based on the result of atomic_fetchadd. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23186	2020-01-19 17:47:04 +00:00
Mateusz Guzik	059cb4843b	cache: counter_u64_add_protected -> counter_u64_add Fixes booting on RISC-V where it does happen to not be equivalent. Reported by: lwhsu	2020-01-19 17:05:26 +00:00
Mateusz Guzik	1399033590	cache: convert numcachehv to counter(9) on 64-bit platforms	2020-01-19 05:37:27 +00:00
Mateusz Guzik	512fa9a4e0	vfs: plug a conditional assigment of lo_name in getnewvnode It only matters for witness. No functional changes.	2020-01-19 05:36:45 +00:00
Kyle Evans	05d7dd739c	sysent targets: further cleanup and deduplication r355473 vastly improved the readability and cleanliness of these Makefiles. Every single one of them follows the same pattern and duplicates the exact same logic. Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and include a common sysent.mk to handle the rest. This makes it less tedious to make sweeping changes. Some default values are provided for GENERATED/SYSENT_*; almost all of these just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use effectively the same filenames with an arbitrary prefix. Most ABIs will be able to get away with just setting GENERATED_PREFIX and including ^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile is the notable exception, as it doesn't take a SYSENT_CONF and the generated files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits the pattern enough to use the common version. Reviewed by: brooks, imp Nice!: emaste Differential Revision: https://reviews.freebsd.org/D23197	2020-01-18 20:37:45 +00:00
Mateusz Guzik	2d0c620272	vfs: distribute freevnodes counter per-cpu It gets rolled up to the global when deferred requeueing is performed. A dedicated read routine makes sure to return a value only off by a certain amount. This soothes a global serialisation point for all 0<->1 hold count transitions. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23235	2020-01-18 01:29:02 +00:00
Mateusz Guzik	d3cc535474	vfs: provide F_ISUNIONSTACK as a kludge for libc Prior to introduction of this op libc's readdir would call fstatfs(2), in effect unnecessarily copying kilobytes of data just to check fs name and a mount flag. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D23162	2020-01-17 14:42:25 +00:00
Mateusz Guzik	1ad72b270c	vfs: shorten lock hold time in vdbatch_process	2020-01-17 14:39:00 +00:00
Gleb Smirnoff	66c6c556b6	Change argument order of epoch_call() to more natural, first function, then its argument. Reviewed by: imp, cem, jhb	2020-01-17 06:10:24 +00:00
Mateusz Guzik	66f67d5e5e	vfs: increment numvnodes without the vnode list lock unless under pressure The vnode list lock is only needed to reclaim free vnodes or kick the vnlru thread (or to block and not miss a wake up (but note the sleep has a timeout so this would not be a correctness issue)). Try to get away without the lock by just doing an atomic increment. The lock is contended e.g., during poudriere -j 104 where about half of all acquires come from vnode allocation code. Note the entire scheme needs a rewrite, the above just reduces it's SMP impact. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23140	2020-01-16 21:45:21 +00:00
Mateusz Guzik	b7f50b9ad1	vfs: refcator vnode allocation Semantics are almost identical. Some code is deduplicated and there are fewer memory accesses. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D23158	2020-01-16 21:43:13 +00:00
Mateusz Guzik	875cfc082d	vfs: reimplement vlrureclaim to actually use LRU Take advantage of global ordering introduced in r356672. Reviewed by: mckusick (previous version) Differential Revision: https://reviews.freebsd.org/D23067	2020-01-16 10:44:02 +00:00
Jeff Roberson	a81c400e75	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102	2020-01-16 05:01:21 +00:00
Kirk McKusick	bbb1e07d65	Peter Holm reports that his test that does an umount(8) on an active mount point while numerous tests are running that are writing to files on that mount point cause the unmount(8) to hang forever. The unmount(8) system call is handled in the kernel by the dounmount() function. The cause of the hang is that prior to dounmount() calling VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT flag indicates that VFS_SYNC() should not return until all the dirty buffers associated with the mount point have been written to disk. Because user processes are allowed to continue writing and can do so faster than the data can be written to disk, the call to VFS_SYNC() can never finish. Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes when they request to do a write thus having a finite number of dirty buffers to write that cannot be expanded. There is no need to call VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs to flush everything again anyway after suspending writes, to catch anything that was dirtied between the VFS_SYNC() and writes being suspended. The fix is to simply remove the unnecessary call to VFS_SYNC() from dounmount(). Reported by: Peter Holm Analysis by: Chuck Silvers Tested by: Peter Holm MFC after: 7 days Sponsored by: Netflix	2020-01-15 18:53:32 +00:00
Gleb Smirnoff	9074694339	Since this code uses if_ref()/if_rele() it must include if_var.h explicitly, not via header pollution.	2020-01-15 03:39:11 +00:00
Gleb Smirnoff	3264dcadc9	- Move global network epoch definition to epoch.h, as more different subsystems tend to need to know about it, and including if_var.h is huge header pollution for them. Polluting possible non-network users with single symbol seems much lesser evil. - Remove non-preemptible network epoch. Not used yet, and unlikely to get used in close future.	2020-01-15 03:34:21 +00:00
Mateusz Guzik	cda3176851	vfs: in vop_stdadd_writecount only vlazy vnodes on mounts using msync The only reason to vlazy there is to (overzealously) ensure all vnodes which need to be visited by msync scan can be found there. In particluar this is of no use zfs and tmpfs. While here depessimize the check.	2020-01-15 01:34:05 +00:00
Ryan Libby	51871224c0	malloc: remove assumptions about MINALLOCSIZE Remove assumptions about the minimum MINALLOCSIZE, in order to allow testing of smaller MINALLOCSIZE. A following patch will lower the MINALLOCSIZE, but not so much that the present patch is required for correctness at these sites. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon	2020-01-14 02:14:02 +00:00
Konstantin Belousov	fedab1b499	Code must not unlock a mutex while owning the thread lock. Reviewed by: hselasky, markj Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23150	2020-01-13 14:30:19 +00:00
Mateusz Guzik	0c236d3d52	vfs: per-cpu batched requeuing of free vnodes Constant requeuing adds significant lock contention in certain workloads. Lessen the problem by batching it. Per-cpu areas are locked in order to synchronize against UMA freeing memory. vnode's v_mflag is converted to short to prevent the struct from growing. Sample result from an incremental make -s -j 104 bzImage on tmpfs: stock: 122.38s user 1780.45s system 6242% cpu 30.480 total patched: 144.84s user 985.90s system 4856% cpu 23.282 total Reviewed by: jeff Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D22998	2020-01-13 02:39:41 +00:00
Mateusz Guzik	cc3593fbd9	vfs: rework vnode list management The current notion of an active vnode is eliminated. Vnodes transition between 0<->1 hold counts all the time and the associated traversal between different lists induces significant scalability problems in certain workloads. Introduce a global list containing all allocated vnodes. They get unlinked only when UMA reclaims memory and are only requeued when hold count reaches 0. Sample result from an incremental make -s -j 104 bzImage on tmpfs: stock: 118.55s user 3649.73s system 7479% cpu 50.382 total patched: 122.38s user 1780.45s system 6242% cpu 30.480 total Reviewed by: jeff Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D22997	2020-01-13 02:37:25 +00:00
Mateusz Guzik	57083d2576	vfs: add per-mount vnode lazy list and use it for deferred inactive + msync This obviates the need to scan the entire active list looking for vnodes of interest. msync is handled by adding all vnodes with write count to the lazy list. deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag. Vnodes get dequeued from the list when their hold count reaches 0. Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that spurious locking is avoided in the common case. Reviewed by: jeff Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D22995	2020-01-13 02:34:02 +00:00
Conrad Meyer	365cd52245	Fix a typo in r356667 comment No functional change. Reported by: bdragon Approved by: csprng(markm), earlier version X-MFC-With: r356667	2020-01-12 23:52:16 +00:00
Conrad Meyer	86def3dcd6	getrandom(2): Add Linux GRND_INSECURE API flag Treat it as a synonym for GRND_NONBLOCK. The reasoning is this: We have two choices for handling Linux's GRND_INSECURE API flag. 1. We could ignore it completely (like GRND_RANDOM). However, this might produce the surprising result of GRND_INSECURE requests blocking, when the Linux API does not block. 2. Alternatively, we could treat GRND_INSECURE requests as requests for GRND_NONBLOCk. Here, the surprising result for Linux programs is that invocations with unseeded random(4) will produce EAGAIN, rather than garbage. Honoring the flag in the way Linux does seems fraught. If we actually use the output of a random(4) implementation prior to seeding, we leak some entropy (in an information theory and also practical sense) from what will be the initial seed to attackers (or allow attackers to arbitrary DoS initial seeding, if we don't leak). This seems unacceptable -- it defeats the purpose of blocking on initial seeding. Secondary to that concern, before seeding we may have arbitrarily little entropy collected; producing output from zero or a handful of entropy bits does not seem particularly useful to userspace. If userspace can accept garbage, insecure, non-random bytes, they can create their own insecure garbage with srandom(time(NULL)) or similar. Any program which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG bytes. So asking the kernel to produce such an output from the secure getrandom(2) API seems inane. For now, we've elected to emulate GRND_INSECURE as an alternative spelling of GRND_NONBLOCK (2). Consider this API not-quite stable for now. We guarantee it will never block. But we will attempt to monitor actual port uptake of this bizarre API and may revise our plans for the unseeded behavior (prior stable/13 branching). Approved by: csprng(markm), manpages(bcr) See also: https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/ See also: https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/ Differential Revision: https://reviews.freebsd.org/D23130	2020-01-12 20:47:38 +00:00
Edward Tomasz Napierala	ca603bb1ee	dd kern_getpriority(), make Linuxulator use it. Reviewed by: kib, emaste MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22842	2020-01-12 14:25:44 +00:00
Edward Tomasz Napierala	7a0ef283e6	Add kern_setpriority(), use it in Linuxulator. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22841	2020-01-12 13:38:51 +00:00
Mateusz Guzik	d199ad3b44	Add "panicked" boolean which can be tested instead of panicstr The test is performed all the time and reading entire panicstr to do it wastes space.	2020-01-12 06:09:10 +00:00
Mateusz Guzik	879e0604ee	Add KERNEL_PANICKED macro for use in place of direct panicstr tests	2020-01-12 06:07:54 +00:00
Mateusz Guzik	91de98e6d4	vfs: only recalculate watermarks when limits are changing Previously they would get recalculated all the time, in particular in: getnewvnode -> vcheckspace -> vspace	2020-01-11 23:00:57 +00:00
Mateusz Guzik	e6ae744e0e	vfs: deduplicate vnode allocation logic This creates a dedicated routine (vn_alloc) to allocate vnodes. As a side effect code duplicationw with getnewvnode_reserve is eleminated. Add vn_free for symmetry.	2020-01-11 22:59:44 +00:00
Mateusz Guzik	b52d50cf69	vfs: prealloc vnodes in getnewvnode_reserve Having a reserved vnode count does not guarantee that getnewvnodes wont block later. Said blocking partially defeats the purpose of reserving in the first place. Preallocate instaed. The only consumer was always passing "1" as count and never nesting reservations.	2020-01-11 22:58:14 +00:00
Mateusz Guzik	6928306764	vfs: incomplete pass at converting more ints to u_long Most notably numvnodes and freevnodes were u_long, but parameters used to govern them remained as ints.	2020-01-11 22:56:20 +00:00
Mateusz Guzik	bf62296f35	vfs: add missing CLTFLA_MPSAFE annotations This covers all kern/vfs_*.c files.	2020-01-11 22:55:12 +00:00
Kyle Evans	1171c633fb	Set .ORDER for makesyscalls generated files When either makesyscalls.lua or syscalls.master changes, all of the ${GENERATED} targets are now out-of-date. With make jobs > 1, this means we will run the makesyscalls script in parallel for the same ABI, generating the same set of output files. Prior to r356603 , there is a large window for interlacing output for some of the generated files that we were generating in-place rather than staging in a temp dir. After that, we still should't need to run the script more than once per-ABI as the first invocation should update all of them. Add .ORDER to do so cleanly. Reviewed by: brooks Discussed with: sjg Differential Revision: https://reviews.freebsd.org/D23099	2020-01-10 18:24:17 +00:00
Mark Johnston	dc727127f1	Change malloc_domain() to return the allocation size to the caller. Otherwise the malloc type accounting in malloc_domainset(9) is wrong after r355203. Reviewed by: rlibby Reported by: kaktus Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23095	2020-01-09 15:02:48 +00:00
Kyle Evans	6a38cd3a54	kern/Makefile: systrace_args.c is also generated	2020-01-09 06:10:25 +00:00
Kyle Evans	39eae263cd	shmfd: posix_fallocate(2): only take rangelock for section we need Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX for safety, so we still get proper synchronization of shmfd->shm_size in effect. There's no need to block readers/writers of earlier segments when we're just reserving more space, so narrow the scope -- it would likely be safe to narrow it completely to just the section of the range that extends beyond our current size, but this likely isn't worth it since the size isn't stable until the writelock is granted the first time. Suggested by: cem (passing comment)	2020-01-09 04:03:17 +00:00
Kyle Evans	f10405323a	posixshm: implement posix_fallocate(2) Linux expects to be able to use posix_fallocate(2) on a memfd. Other places would use this with shm_open(2) to act as a smarter ftruncate(2). Test has been added to go along with this. Reviewed by: kib (earlier version) Differential Revision: https://reviews.freebsd.org/D23042	2020-01-08 19:08:44 +00:00
Kyle Evans	2856d85ecb	posix_fallocate: push vnop implementation into the fileop layer This opens the door for other descriptor types to implement posix_fallocate(2) as needed. Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D23042	2020-01-08 19:05:32 +00:00
Mateusz Guzik	a9a047bc87	vfs: handle doomed vnodes in vdefer_inactive vgone dooms the vnode while keeping VI_OWEINACT set and then drops the interlock. vputx can pick up the interlock and pass it to vdefer_inactive since the flag is set. The race is harmless, just don't defer anything as vgone will take care of it. Reported by: pho	2020-01-07 20:24:21 +00:00
Mateusz Guzik	c8b3463dd0	vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT) The previous behavior of leaving VI_OWEINACT vnodes on the active list without a hold count is eliminated. Hold count is kept and inactive processing gets explicitly deferred by setting the VI_DEFINACT flag. The syncer is then responsible for vdrop. Reviewed by: kib (previous version) Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D23036	2020-01-07 15:56:24 +00:00
Mateusz Guzik	b7cc9d1847	vfs: trylock in vfs_msync and refactor the func - use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock - evaluate flags before looping over vnodes Reviewed by: kib Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D23035	2020-01-07 15:44:19 +00:00
Mateusz Guzik	c92fe112a7	vfs: use a dedicated counter for free vnode recycling Otherwise vlrureclaim activitity is mixed in and it is hard to tell which vnodes got reclaimed.	2020-01-07 15:42:01 +00:00
Mateusz Guzik	cc2b586d69	vfs: prevent numvnodes and freevnodes re-reads when appropriate Otherwise in code like this: if (numvnodes > desiredvnodes) vnlru_free_locked(numvnodes - desiredvnodes, NULL); numvnodes can drop below desiredvnodes prior to the call and if the compiler generated another read the subtraction would get a negative value.	2020-01-07 04:34:03 +00:00
Mateusz Guzik	37fe521a6f	vfs: annotate numvnodes and vnode_free_list_mtx with __exclusive_cache_line	2020-01-07 04:30:49 +00:00
Mateusz Guzik	478368ca41	vfs: eliminate v_tag from struct vnode There was only one consumer and it was using it incorrectly. It is given an equivalent hack. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23037	2020-01-07 04:29:34 +00:00
Mateusz Guzik	a91190c63e	vfs: add a helper for allocating marker vnodes	2020-01-07 04:27:40 +00:00
Pawel Biernacki	91c4b68fa3	kern_sysctl: make sysctl.debug work as intended r136999 introduced SYSTCL_DEBUG but apparently "opt_sysctl.h" was never included making the option ignored. r322954 introduced sysctl.reuse_test with OID number equal to 0, effectively shadowing the very special sysctl.debug one. Use OID_AUTO as it doesn't need any special treatment. Reviewed by: kib (mentor) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23056	2020-01-06 19:47:59 +00:00
Mateusz Guzik	2e77cad11d	locks: add default delay struct Use it for all primitives. This makes everything fit in 8 bytes.	2020-01-05 12:48:19 +00:00
Mateusz Guzik	6b8dd26e7c	locks: convert delay times to u_short int is just a waste of space for this purpose.	2020-01-05 12:47:29 +00:00
Mateusz Guzik	d6ae918835	Mark mtxpool_sleep as read mostly, not frequently. The latter is not justified.	2020-01-05 12:46:35 +00:00
Kyle Evans	535b1df993	shm: correct KPI mistake introduced around memfd_create When file sealing and shm_open2 were introduced, we should have grown a new kern_shm_open2 helper that did the brunt of the work with the new interface while kern_shm_open remains the same. Instead, more complexity was introduced to kern_shm_open to handle the additional features and consumers had to keep changing in somewhat awkward ways, and a kern_shm_open2 was added to wrap kern_shm_open. Backpedal on this and correct the situation- kern_shm_open returns to the interface it had prior to file sealing being introduced, and neither function needs an initial_seals argument anymore as it's handled in kern_shm_open2 based on the shmflags.	2020-01-05 04:06:40 +00:00
Kyle Evans	58366f05c0	shmfd/mmap: restrict maxprot with MAP_SHARED + F_SEAL_WRITE If a write seal is set on a shared mapping, we must exclude VM_PROT_WRITE as the fd is effectively read-only. This was discovered by running devel/linux-ltp, which mmap's with acceptable protections specified then attempts to raise to PROT_READ\|PROT_WRITE with mprotect(2), which we allowed. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22978	2020-01-05 03:15:16 +00:00
Mateusz Guzik	7e2ea5772b	vfs: factor out avoidable branches in _vn_lock	2020-01-05 01:00:11 +00:00
Mateusz Guzik	8dbc63520c	vfs: drop thread argument from vinactive	2020-01-05 00:59:47 +00:00
Mateusz Guzik	867fd730c6	vfs: patch up vnode count assertions to report found value	2020-01-05 00:59:16 +00:00
Jeff Roberson	727c691857	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828	2020-01-04 03:15:34 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Kyle Evans	3a22f09cbf	emulated atomic64: disable interrupts as the lock mechanism on !SMP Reviewed by: jhibbits, bdragon Differential Revision: https://reviews.freebsd.org/D23015	2020-01-03 18:29:20 +00:00
Brandon Bergren	9aafc7c052	[PowerPC] [MIPS] Implement 32-bit kernel emulation of atomic64 operations This is a lock-based emulation of 64-bit atomics for kernel use, split off from an earlier patch by jhibbits. This is needed to unblock future improvements that reduce the need for locking on 64-bit platforms by using atomic updates. The implementation allows for future integration with userland atomic64, but as that implies going through sysarch for every use, the current status quo of userland doing its own locking may be for the best. Submitted by: jhibbits (original patch), kevans (mips bits) Reviewed by: jhibbits, jeff, kevans Differential Revision: https://reviews.freebsd.org/D22976	2020-01-02 23:20:37 +00:00
Konstantin Belousov	478ca4b004	Rename umtxq_check_susp() to thread_check_susp() and make it usable outside of kern_umtx.c. To be used in several future changes. Discussed with: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-01-02 22:13:59 +00:00
Konstantin Belousov	8f4d74eb1e	Style: remove trailing spaces/tabs. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-01-02 22:07:03 +00:00
Pawel Biernacki	e8cdbb4815	sysctl: hide 2.x era compat node r23081 introduced kern.dummy oid as a semi ABI compat for kern.maxsockbuf that was moved to a new namespace. It never functioned as an alias of any kind and was just returning 0 unconditionally, hence it was probably provided to keep some 3rd party programmes happy about sysctl(3) not reporting an error because of non-existing oid. After nearly 23 years it seems reasonable to just hide it from sysctl(8) list not to cause unnecessary confusion as for its purpose. Reported by: Antranig Vartanian <antranigv@freebsd.am> Reviewed by: kib (mentor) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D22982	2020-01-02 01:23:43 +00:00
Mateusz Guzik	57db0e12c8	vfs: drop an always-false check from vlrureclaim The vnode gets held few lines prior, making the VI_FREE condition illegal.	2020-01-01 22:51:17 +00:00
Alexander V. Chernikov	c83dda362e	Split gigantic rtsock route_output() into smaller functions. Amount of changes to the original code has been intentionally minimised to ease diffing. The changes are mostly mechanical, with the following exceptions: * lltable handler is now called directly based of RTF_LLINFO flag presense. * "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified, fixing several potential use-after-free cases in rt_addrinfo. * llable asserts has been replaced with error-returning, preventing kernel crashes when lltable gw af family is invalid (root required). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22864	2019-12-31 17:26:53 +00:00
Alexander Motin	024932aae9	Use atomic for start_count in devstat_start_transaction(). Combined with earlier nstart/nend removal it allows to remove several locks from request path of GEOM and few other places. It would be cool if we had more SMP-friendly statistics, but this helps too. Sponsored by: iXsystems, Inc.	2019-12-30 03:13:38 +00:00
Mark Johnston	9f5632e6c8	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885	2019-12-28 19:04:00 +00:00
Justin Hibbits	741dfd86b3	Fix the powerpc copyout fixup from r356113 Summary: r356113 used an older patch, which predated the freebsd_copyout_auxargs() addition. Fix this by using a private powerpc_copyout_auxargs() instead, and keep it private to powerpc, not in MI files. Reviewed by: kib, bdragon Differential Revision: https://reviews.freebsd.org/D22935	2019-12-27 17:38:25 +00:00
Mateusz Guzik	3983dc32d7	Plug a warning in read-mostly spinlocks reported by gcc.	2019-12-27 13:37:19 +00:00
Mateusz Guzik	eb9764615d	vfs: remove production kernel checks and mp == NULL support from vdrop 1. The only place in the tree which calls getnewvnode with mp == NULL does it for vp_crossmp which will never execute this codepath. Any vnode which legally has ->v_mount == NULL is also doomed, which once more wont execute this code. 2. Remove an assertion for v_holdcnt from production kernels. It gets taken care of by refcount macros in debug kernels. Any code which would want to pass NULL mp can construct a fake one instead. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D22722	2019-12-27 11:26:12 +00:00
Mateusz Guzik	1f162fef76	Add read-mostly sleepable locks To be used when like rmlocks, except when sleeping for readers needs to be allowed. See the manpage for more information. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D22823	2019-12-27 11:19:57 +00:00
Justin Hibbits	6b45727338	Fix the build from r356113. Types had changed from when the patch was first created, and a final build was not done pre-commit.	2019-12-27 04:52:17 +00:00
Justin Hibbits	adea0d6368	Eliminate the last MI difference in AT_* definitions (for powerpc). Summary: As a transition aide, implement an alternative elfN_freebsd_fixup which is called for old powerpc binaries. Similarly, add a translation to rtld to convert old values to new ones (as expected by a new rtld). Translation of old<->new values is incomplete, but sufficient to allow an installworld of a new userspace from an old one when a new kernel is running. Test Plan: Someone needs to see how a new kernel/rtld/libc works with an old binary. If if works we can probalby ship this. If not we probalby need some more compat bits. Submitted by: brooks Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D20799	2019-12-27 04:07:03 +00:00
Conrad Meyer	f3bae413e9	random(9): Deprecate random(9), remove meaningless srandom(9) srandom(9) is meaningless on SMP systems or any system with, say, interrupts. One could never rely on random(9) to produce a reproducible sequence of outputs on the basis of a specific srandom() seed because the global state was shared by all kernel contexts. As such, removing it is literally indistinguishable to random(9) consumers (as compared with retaining it). Mark random(9) as deprecated and slated for quick removal. This is not to say we intend to remove all fast, non-cryptographic PRNG(s) in the kernel. It/they just won't be random(9), as it exists today, in either name or implementation. Before random(9) is removed, a replacement will be provided and in-tree consumers will be converted. Note that despite the name, the random(9) interface does not bear any resemblance to random(3). Instead, it is the same crummy 1988 Park-Miller LCG used in libc rand(3).	2019-12-26 19:41:09 +00:00
Conrad Meyer	af00898b5d	gone_in(9): Trivial string grammar and style cleanups	2019-12-26 18:25:07 +00:00
Kyle Evans	f46412c021	kern_cons: add a stub kbdinit for configs with no keyboard/console drivers A weak symbol here is decidedly cleaner than any #ifdef soup or relocating kbdinit, the former leading to maintenance required on addition of any console/keyboard drivers and the latter pushing kbd init bits away from where they're used.	2019-12-26 15:47:19 +00:00
Kyle Evans	3ed7166aca	kbd: merge linker set drivers into standard kbd driver list This leads to the revert of r355806; this reduces duplication in keyboard registration and driver switch lookup and leaves us with one authoritative source for currently registered drivers. The reduced duplication later is nice as we have more procedure involved in keyboard setup. keyboard_driver->flags is used to more quickly detect bogus adds/removes. From KPI consumers' perspective, nothing changes- kbd_add_driver of an already-registered driver will succeed, and a single kbd_delete_driver will later remove it as expected. In contrast to historical behavior, kbd_delete_driver on a driver registered via linker set will now actually de-register the driver so that it may not be used -- e.g. if kbdmux's MOD_LOAD handler fails somewhere. Detection for already-registered drivers in kbd_add_driver has improved, as the previous SLIST_NEXT(driver) != NULL check would not have caught a driver that's at the tail end. kbdinit is now called from cninit() rather than via SYSINIT so that keyboard drivers are available as early as console drivers. This is particularly important as cnprobe will, in both syscons and vt, attempt to do any early configuration of keyboard drivers built-in (see: kbd_configure). Reviewed by: imp (earlier version, pre-cninit change) Differential Revision: https://reviews.freebsd.org/D22835	2019-12-26 15:21:34 +00:00
Conrad Meyer	fea73412a0	sleep(9), sleepqueue(9): const'ify wchan pointers _sleep(9), wakeup(9), sleepqueue(9), et al do not dereference or modify the channel pointers provided in any way; they are merely used as intptrs into a dictionary structure to match waiters with wakers. Correctly annotate this such that _sleep() and wakeup() may be used on const pointers without invoking ugly patterns like __DECONST(). Plumb const through all of the underlying sleepqueue bits. No functional change. Reviewed by: rlibby Discussed with: kib, markj Differential Revision: https://reviews.freebsd.org/D22914	2019-12-24 16:19:33 +00:00
Brandon Bergren	7821a820d0	[PowerPC] Implement Secure-PLT jump table processing for ppc32. Due to clang and LLD's tendency to use a PLT for builtins, and as they don't have full support for EABI, we sometimes have to deal with a PLT in .ko files in a clang-built kernel. As such, augment the in-kernel linker to support jump table processing. As there is no particular reason to support lazy binding in kernel modules, only implement Secure-PLT immediate binding. As part of these changes, add elf_cpu_parse_dynamic() to the MD API of the in-kernel linker (except on platforms that use raw object files.) The new function will allow MD code to act on MD tags in _DYNAMIC. Use this new function in the PowerPC MD code to ensure BSS-PLT modules using PLT will be rejected during insertion, and to poison the runtime resolver to ensure we get a clear panic reason if a call is made to the resolver. Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D22608	2019-12-24 15:56:24 +00:00
Conrad Meyer	e30f025ff9	kern_synch: Fix some UB It is UB to evaluate pointer comparisons when pointers do not point within the same object. Instead, convert the pointers to numbers and compare the numbers. Reported by: kib Discussed with: rlibby	2019-12-24 06:08:29 +00:00
Konstantin Belousov	52f3524cfd	Do not use waitable allocation of pbuf when creating cluster for write. Previously just ensuring that we do not sleep when clustering for md(4) vnode was enough. Now, with the switch of the pbuf allocator to uma and completely broken per-subsystem pbuf limits, it might cause unbounded sleep even for non-md(4) vnodes. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22899	2019-12-23 20:15:19 +00:00
Philip Paeps	1b6dd6d772	Print upper 32 bits in devmap table entries If the devmap entry uses the upper 32 bits they wouldn't be printed in devmap_dump_table(). This fixes that. Submitted by: Nicholas O'Brien <nickisobrien_gmail.com> Sponsored by: Axiado	2019-12-20 03:40:53 +00:00
Mark Johnston	5d3bd9e242	Fix SIGINFO stack collection to ignore threads with swapped-out stacks. We by definition cannot trace the stack of such a thread. Also remove a redundant stack_zero() call in the SIGINFO handler, the stack structure is cleared by the MD stack_capture(). Sponsored by: The FreeBSD Foundation	2019-12-19 19:34:25 +00:00
Jeff Roberson	d8d5f03610	Fix a bug in r355784. I missed a sched_add() call that needed to reacquire the thread lock. Reported by: mjg	2019-12-19 18:22:11 +00:00
Hans Petter Selasky	cc79ea3a26	Restore important comment in RCU/EPOCH support in FreeBSD after r355784. Sponsored by: Mellanox Technologies	2019-12-18 09:30:32 +00:00
Mateusz Guzik	6fa079fc3f	vfs: flatten vop vectors This eliminates the following loop from all VOP calls: while(vop != NULL && \ vop->vop_spare2 == NULL && vop->vop_bypass == NULL) vop = vop->vop_default; Reviewed by: jeff Tesetd by: pho Differential Revision: https://reviews.freebsd.org/D22738	2019-12-16 00:06:22 +00:00
Mateusz Guzik	3fd19ce7a5	mtx: eliminate recursion support from thread lock Now that it is not used after schedlock changes got merged. Note the unlock routine temporarily still checks for it on account of just using regular spin unlock. This is a prelude towards a general clean up.	2019-12-16 00:04:33 +00:00
Jeff Roberson	686bcb5c14	schedlock 4/4 Don't hold the scheduler lock while doing context switches. Instead we unlock after selecting the new thread and switch within a spinlock section leaving interrupts and preemption disabled to prevent local concurrency. This means that mi_switch() is entered with the thread locked but returns without. This dramatically simplifies scheduler locking because we will not hold the schedlock while spinning on blocked lock in switch. This change has not been made to 4BSD but in principle it would be more straightforward. Discussed with: markj Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22778	2019-12-15 21:26:50 +00:00
Jeff Roberson	1c81a87efd	schedlock 3/4 Eliminate lock recursion from turnstiles. This was simply used to avoid tracking the top-level turnstile lock. explicitly check for it before picking up and dropping locks. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22746	2019-12-15 21:19:41 +00:00
Jeff Roberson	045b7c6084	schedlock 2/4 Do all sleepqueue post-processing in sleepq_remove_thread() so that we do not require the thread lock after a context switch. Reviewed by: jhb, kib Differential Revision: https://reviews.freebsd.org/D22745	2019-12-15 21:18:07 +00:00
Ian Lepore	b19c9dea3e	Rewrite arm kernel stack unwind code to work when unwinding through modules. The arm kernel stack unwinder has apparently never been able to unwind when the path of execution leads through a kernel module. There was code that tried to handle modules by looking for the unwind data in them, but it did so by trying to find symbols which have never existed in arm kernel modules. That caused the unwind code to panic, and because part of panic handling calls into the unwind code, that just created a recursion loop. Locating the unwind data in a loaded module requires accessing the Elf section headers to find the SHT_ARM_EXIDX section. For preloaded modules those headers are present in a metadata blob. For dynamically loaded modules, the headers are present only while the loading is in progress; the memory is freed once the module is ready to use. For that reason, there is new code in kern/link_elf.c, wrapped in #ifdef __arm__, to extract the unwind info while the headers are loaded. The values are saved into new fields in the linker_file structure which are also conditional on __arm__. In arm/unwind.c there is new code to locally cache the per-module info needed to find the unwind tables. The local cache is crafted for lockless read access, because the unwind code often needs to run in context where sleeping is not allowed. A large comment block describes the local cache list, so I won't repeat it all here.	2019-12-15 21:16:35 +00:00
Jeff Roberson	61a74c5ccd	schedlock 1/4 Eliminate recursion from most thread_lock consumers. Return from sched_add() without the thread_lock held. This eliminates unnecessary atomics and lock word loads as well as reducing the hold time for scheduler locks. This will eventually allow for lockless remote adds. Discussed with: kib Reviewed by: jhb Tested by: pho Differential Revision: https://reviews.freebsd.org/D22626	2019-12-15 21:11:15 +00:00
Jeff Roberson	d29f674f2e	Fix a mistake in r355765. We need to activate the page if it is not yet on a pagequeue. Reported by: pho	2019-12-15 06:26:47 +00:00
Jeff Roberson	a808177864	Add a deferred free mechanism for freeing swap space that does not require an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654	2019-12-15 03:15:06 +00:00
Jeff Roberson	af00971419	Handle pagein clustering in vm_page_grab_valid() so that it can be used by exec_map_first_page(). This will also enable pagein clustering for other interested consumers (tmpfs, md, etc). Discussed with: alc Approved by: kib Differential Revision: https://reviews.freebsd.org/D22731	2019-12-15 02:00:32 +00:00
Doug Moore	9f70442a04	Simplify the processing a leaf mask to find big-enough ranges of set bits, by storing and modifying the complement of the original leaf mask, and by avoiding some unnecessary intermediate variables in computing the shift amounts. The logic is similar to what has recently been committed to sys/sys/bitstring.h. Compute better hint updates for the case when the cursor starts in mid-leaf, and eliminates some otherwise viable solutions. Assume the worst case, that all the eliminated offsets could have been solutions, and you can still compute a better hint than we use now. Eliminate some unnecessary conditional control flow. Approved by: alc Tested by: pho Differential Revision: https://reviews.freebsd.org/D22666	2019-12-14 19:44:42 +00:00
Mateusz Guzik	6f836483ec	Remove the useless return value from proc_set_cred	2019-12-14 00:43:17 +00:00
John Baldwin	4b28d96e5d	Remove the deprecated timeout(9) interface. All in-tree consumers have been converted to callout(9). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22602	2019-12-13 21:03:12 +00:00
Warner Losh	b832a7e505	Create new wrapper function: bus_delayed_attach_children() Delay the attachment of children, when requested, until after interrutps are running. This is often needed to allow children to run transactions on i2c or spi busses. It's a common enough idiom that it will be useful to have its own wrapper. Reviewed by: ian Differential Revision: https://reviews.freebsd.org/D21465	2019-12-13 19:39:33 +00:00
John Baldwin	bf2276f378	Use callout(9) instead of deprecated timeout(9) for fail points. Allocate the callout structure on-demand from fail_point_use_timeout_path() since most fail points do not use timeouts. Reviewed by: markj (earlier version), cem Differential Revision: https://reviews.freebsd.org/D22599	2019-12-13 19:26:04 +00:00
Edward Tomasz Napierala	34ad5ac242	Add kern_kill() and use it in Linuxulator. It's just a cleanup, no functional changes. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22645	2019-12-13 18:44:02 +00:00
Edward Tomasz Napierala	be2cfdbc86	Add kern_getsid() and use it in Linuxulator; no functional changes. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22647	2019-12-13 18:39:36 +00:00
Ryan Libby	9825eadf2c	bitset: rename confusing macro NAND to ANDNOT s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual implementation is "and not" (or "but not"), i.e. A but not B. Fortunately this does appear to be what all existing callers want. Don't supply a NAND (not (A and B)) operation at this time. Discussed with: jeff Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22791	2019-12-13 09:32:16 +00:00
Conrad Meyer	cd5650407e	kern/subr_unit: Rip srandomdev, random(3) out of dead code The simulation cannot be reproduced, so the value of using a deterministic PRNG like random(3) is dubious. The number of repitions used in the sample isn't a problem for the Chacha implementation of arc4random we have today. (Also, no one actually runs this code; it was provided as an example of the work the author did validating the implementation. It's not even test code.)	2019-12-13 04:48:20 +00:00
Rick Macklem	ea9a16b252	r355677 requires that vop_stdioctl() be global so it can be called from NFS. r355677 modified the NFS client so that it does lseek(SEEK_DATA/SEEK_HOLE) for NFSv4.2, but calls vop_stdioctl() otherwise. As such, vop_stdioctl() needs to be a global function. Missed during the code merge for r355677.	2019-12-13 00:14:12 +00:00
Edward Tomasz Napierala	d6fee74a0c	Add kern_sync(9), and make kernel code call it instead of going via sys_sync(2). Minor cleanup, no functional changes. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19366	2019-12-12 18:45:31 +00:00
Mark Johnston	7789ab32b3	Rename tdq_ipipending and clear it in sched_switch(). This fixes a regression after r355311. Specifically, sched_preempt() may trigger a context switch by calling thread_lock(), since thread_lock() calls critical_exit() in its slow path and the interrupted thread may have already been marked for preemption. This would happen before tdq_ipipending is cleared, blocking further preemption IPIs. The CPU can be left in this state indefinitely if the interrupted thread migrates. Rename tdq_ipipending to tdq_owepreempt. Any switch satisfies a remote preemption request, so clear tdq_owepreempt in sched_switch() instead of sched_preempt() to avoid subtle problems of the sort described above. Reviewed by: jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22758	2019-12-12 02:43:24 +00:00
Mateusz Guzik	c8b29d1212	vfs: locking primitives which elide ->v_vnlock and shared locking disablement Both of these features are not needed by many consumers and result in avoidable reads which in turn puts them on profiles due to cache-line ping ponging. On top of that the current lockgmr entry point is slower than necessary single-threaded. As an attempted clean up preparing for other changes, provide new routines which don't support any of the aforementioned features. With these patches in place vop_stdlock and vop_stdunlock disappear from flamegraphs during -j 104 buildkernel. Reviewed by: jeff (previous version) Tested by: pho Differential Revision: https://reviews.freebsd.org/D22665	2019-12-11 23:11:21 +00:00
Mateusz Guzik	55eb92db8d	fd: static-ize and devolatile openfiles Almost all access is using atomics. The only read is sysctl which should use a whole-int-at-a-time friendly read internally.	2019-12-11 23:09:12 +00:00
Andriy Gapon	64ebbdd54d	add a sanity check to the system call registration code A system call number should be at least reserved. We do not expect an attempt to register a fixed number system call when nothing at all is known about it. MFC after: 3 weeks Sponsored by: Panzura	2019-12-11 15:52:29 +00:00

... 3 4 5 6 7 ...

17445 Commits