freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	569a186d02	Remove UNP_NASCENT, reverting r303855. unp_connectat() no longer holds the link lock across calls to sonewconn(), so the recursion described in r303855 can no longer occur. No functional change intended. Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2020-03-20 16:17:54 +00:00
Mark Johnston	429537caeb	kern_dup(): Call filecaps_free_prep() in a write section. filecaps_free_prep() bzeros the capabilities structure and we need to be careful to synchronize with unlocked readers, which expect a consistent rights structure. Reviewed by: kib, mjg Reported by: syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24120	2020-03-19 15:40:05 +00:00
Mark Johnston	2d896b816b	Enter a write sequence when updating rights. The Capsicum system calls modify file descriptor table entries. To ensure that readers observe a consistent snapshot of descriptor writes, the system calls need to signal to unlocked readers that an update is pending. Note that ioctl rights are always checked with the descriptor table lock held, so it is not strictly necessary to signal unlocked readers. However, we probably want to enable lockless ioctl checks eventually, so use seqc_write_begin() in kern_cap_ioctls_limit() too. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24119	2020-03-19 15:39:45 +00:00
Brandon Bergren	3069380898	[PowerPC][Book-E] Fix missing load base in elf_cpu_parse_dynamic(). When I implemented MD DYNAMIC parsing, I was originally passing a linker_file_t so that the MD code could relocate pointers. However, it turns out this isn't even filled in until later, so it was always 0. Just pass the load base (ef->address) directly, as that's really the only thing we were interested in in the first place. This fixes a crash on RB800 where it was trying to write to an unmapped address when updating the GOT. Reviewed by: jhibbits Sponsored by: Tag1 Consulting, Inc. Differential Revision: https://reviews.freebsd.org/D24105	2020-03-18 02:58:18 +00:00
Conrad Meyer	34086d5bda	Implement sysctl kern.boot_id Boot IDs are random, opaque 128-bit identifiers that distinguish distinct system boots. A new ID is generated each time the system boots. Unlike kern.boottime, the value is not modified by NTP adjustments. It remains fixed until the machine is restarted. PR: 244867 Reported by: Ricardo Fraile <rfraile AT rfraile.eu> MFC after: I do not intend to, but feel free	2020-03-17 22:27:16 +00:00
Conrad Meyer	a99c321802	Remove misleading / redundant bzero in callout_callwheel_init The intent seems to be zeroing all of the cc_cpu array, or its singleton on such platforms. The assumption made is that the BSP is always zero. The code smell was introduced in r326218, which changed the prior explicit zero to 'curcpu'. The change is only valid if curcpu continues to be zero, contrary to the aim expressed in that commit message. So, more succinctly, the expression could be: memset(cc_cpu,0,sizeof(cc_cpu)). However, there's no point. cc_cpu lives in the data section and has a zero initial value already. So this revision just removes the problematic statement. No functional change. Appeases a (false positive, ish) Coverity CID. CID: 1383567 Reported by: Puneeth Jothaiah <puneethkumar.jothaia AT dell.com> Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D24089	2020-03-16 22:25:25 +00:00
Bjoern A. Zeeb	1b786d0191	kern_jail: missing \0 termination check on osrelease parameter If a user spplies a non-\0 terminated osrelease parameter reading it back may disclose kernel memory. This is a problem in case of nested jails (children.max > 0, which is not the default). Otherwise root outside the jail has access to kernel memory by other means and root inside a jail cannot create a child jail. Add the proper \0 check at the end of a supplied osrelease parameter and make sure any copies of the field will be \0-terminated. Submitted by: Hans Christian Woithe (chwoithe yahoo.com) MFC after: 3 days	2020-03-14 14:04:55 +00:00
Michael Tuexen	db4493f7b6	sendfile() does currently not support SCTP sockets. Therefore, fail the call. Reviewed by: markj@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D24059	2020-03-13 18:38:28 +00:00
Conrad Meyer	7a119578a4	kern_shutdown: Add missing EKCD ifdef Submitted by: Puneeth Jothaiah <puneethkumar.jothaia AT dell.com> Reviewed by: bdrewery Sponsored by: Dell EMC Isilon	2020-03-12 21:26:36 +00:00
Konstantin Belousov	00ebd80972	Fix signal delivery might be on sigfastblock clearing. When clearing sigfastblock, either by sigfastblock(UNSETPTR) call or implicitly on execve(2), kernel must check for pending signals and reschedule them if needed. E.g. on execve, all other threads are terminated, and current thread fast block pointer is cleaned. If any signal was left pending, it can now be delivered to the current thread, and we should prepare for ast() on return to userspace to notice the signals. Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2020-03-10 20:25:03 +00:00
Konstantin Belousov	0bc52b0bdb	Return reschedule_signals() to being static again. It was used after sigfastblock_setpend() call in in ast() when current thread fast-blocks signals. Add a flag to sigfastblock_setpend() to request reschedule, and remove the direct use of the function from subr_trap.c Tested by: pho Sponsored by: The FreeBSD Foundation	2020-03-10 20:04:38 +00:00
Konstantin Belousov	2d3c083fd7	pipe: explain why not deallocating inode number is fine. Suggested and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24009	2020-03-09 23:40:25 +00:00
Konstantin Belousov	c6d3d601c9	Preallocate pipe buffers on pipe creation. Return ENOMEM if one of the buffer cannot be created even with the minimal size. This should avoid subsequent spurious ENOMEM errors from write(2) when buffer cannot be allocated on the fly, after we reported that the pipe was create succesfully. Reported by: Keno Fischer <keno@juliacomputing.com> Reviewed by: markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23993	2020-03-09 21:55:26 +00:00
Konstantin Belousov	1213de28f8	Style. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23993	2020-03-09 19:46:28 +00:00
Andrew Gallatin	98085bae8c	make lacp's use_numa hashing aware of send tags When I did the use_numa support, I missed the fact that there is a separate hash function for send tag nic selection. So when use_numa is enabled, ktls offload does not work properly, as it does not reliably allocate a send tag on the proper egress nic since different egress nics are selected for send-tag allocation and packet transmit. To fix this, this change: - refectors lacp_select_tx_port_by_hash() and lacp_select_tx_port() to make lacp_select_tx_port_by_hash() always called by lacp_select_tx_port() - pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash() - adds a numa_domain field to if_snd_tag_alloc_params - plumbs the numa domain into places where we allocate send tags In testing with NIC TLS setup on a NUMA machine, I see thousands of output errors before the change when enabling kern.ipc.tls.ifnet.permitted=1. After the change, I see no errors, and I see the NIC sysctl counters showing active TLS offload sessions. Reviewed by: rrs, hselasky, jhb Sponsored by: Netflix	2020-03-09 13:44:51 +00:00
Mateusz Guzik	d2222aa0e9	fd: use smr for managing struct pwd This has a side effect of eliminating filedesc slock/sunlock during path lookup, which in turn removes contention vs concurrent modifications to the fd table. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23889	2020-03-08 00:23:36 +00:00
Mark Johnston	d869a17e62	Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23978	2020-03-06 19:10:00 +00:00
Mark Johnston	fffcb56f7a	Add COUNTER_U64_SYSINIT() and COUNTER_U64_DEFINE_EARLY(). The aim is to reduce the boilerplate needed today to define and initialize global counters. Also add SI_SUB_COUNTER to the sysinit ordering. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23977	2020-03-06 19:09:01 +00:00
Chuck Silvers	f15ccf8836	Add a new "mntfs" pseudo file system which provides private device vnodes for file systems to safely access their disk devices, and adapt FFS to use it. Also add a new BO_NOBUFS flag to allow enforcing that file systems using mntfs vnodes do not accidentally use the original devfs vnode to create buffers. Reviewed by: kib, mckusick Approved by: imp (mentor) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23787	2020-03-06 18:41:37 +00:00
Konstantin Belousov	695e0701a0	buffer pager: deref ucred immediately after read. Ucred is passed to bread(9) so that non-local filesystems use proper credentials. But, since clean buffer might be cached unless buf_pager_relbuf is not enabled, it makes credentials to have extra reference until buffer is reclaimed. Ucred reference would prevent jail from destroying if creds are jailed. Dereferencing the read credentials on the valid buffer avoid that, and should be fine because the buffer is valid and does not need re-read. PR: 238032 Reported by: bz Reproduced and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23775	2020-03-05 15:52:34 +00:00
Mateusz Guzik	8d4d271e92	execve: use LOCKSHARED when looking up the interpreter Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23956	2020-03-04 19:52:34 +00:00
Chuck Silvers	37bf88e790	if vm_pager_get_pages_async() returns an error, release the sfio->nios refcount that we took earlier that represents the I/O that ended up not being started. Reviewed by: glebius Approved by: imp (mentor) Sponsored by: Netflix	2020-03-04 00:22:50 +00:00
Bjoern A. Zeeb	a2fba2a700	upic_ktrls: make RSS compile again here The results of ktls_get_cpu() are stored in u_int and NETISR_CPUID_NONE requires u_int. Adjust uint16_t to uint_t in order to make RSS kernels compile some more again. HPTS still has to be fixed, which is a bit more complicated. Reviewed by: jhb, gallatin, rrs Differential Revision: https://reviews.freebsd.org/D23726	2020-03-03 14:07:44 +00:00
Mark Johnston	4cf919edb9	Fix the malloc type used in sys_shm_unlink() after r354808. PR: 244563 Reported by: swills	2020-03-03 00:28:37 +00:00
Pawel Biernacki	b05ca4290c	sys/: Document few more sysctls. Submitted by: Antranig Vartanian <antranigv@freebsd.am> Reviewed by: kaktus Commented by: jhb Approved by: kib (mentor) Sponsored by: illuria security Differential Revision: https://reviews.freebsd.org/D23759	2020-03-02 15:30:52 +00:00
Mateusz Guzik	2f423bce54	vfs: stop taking additional refs on root vnode during lookup They are spurious since introduction of struct pwd, which provides them implicitly. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23885	2020-03-01 21:54:28 +00:00
Mateusz Guzik	8d03b99b9d	fd: move vnodes out of filedesc into a dedicated structure The new structure is copy-on-write. With the assumption that path lookups are significantly more frequent than chdirs and chrooting this is a win. This provides stable root and jail root vnodes without the need to reference them on lookup, which in turn means less work on globally shared structures. Note this also happens to fix a bug where jail vnode was never referenced, meaning subsequent access on lookup could run into use-after-free. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23884	2020-03-01 21:53:46 +00:00
Mateusz Guzik	8243063f9b	fd: make fgetvp_rights work without the filedesc lock Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23883	2020-03-01 21:50:13 +00:00
Mark Johnston	5aa5420ff2	Ensure that arm64 thread structures are allocated from the direct map. Otherwise we can fail to handle translation faults on curthread, leading to a panic. Reviewed by: alc, rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23895	2020-02-29 18:41:48 +00:00
Jeff Roberson	6be21eb778	Provide a lock free alternative to resolve bogus pages. This is not likely to be much of a perf win, just a nice code simplification. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23866	2020-02-28 21:42:48 +00:00
Jeff Roberson	7aaf252c96	Convert a few triviail consumers to the new unlocked grab API. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23847	2020-02-28 20:34:30 +00:00
Jeff Roberson	f72eaaeb03	Use unlocked grab for uipc_shm/tmpfs. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23865	2020-02-28 20:33:28 +00:00
Mark Johnston	46994ec2b1	Fix standalone builds of systrace.ko after r357912. Sponsored by: The FreeBSD Foundation	2020-02-28 17:05:04 +00:00
Mark Johnston	c99d0c5801	Add a blocking counter KPI. refcount(9) was recently extended to support waiting on a refcount to drop to zero, as this was needed for a lockless VM object paging-in-progress counter. However, this adds overhead to all uses of refcount(9) and doesn't really match traditional refcounting semantics: once a counter has dropped to zero, the protected object may be freed at any point and it is not safe to dereference the counter. This change removes that extension and instead adds a new set of KPIs, blockcount_*, for use by VM object PIP and busy. Reviewed by: jeff, kib, mjg Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23723	2020-02-28 16:05:18 +00:00
Jeff Roberson	561af25fa7	Simplify lazy advance with a 64bit atomic cmpset. This provides the potential to force a lazy (tick based) SMR to advance when there are blocking waiters by decoupling the wr_seq value from the ticks value. Add some missing compiler barriers. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23825	2020-02-27 19:05:26 +00:00
Warner Losh	729ea680be	Remove trailing white space.	2020-02-26 16:22:28 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Gleb Smirnoff	6bc27f086a	Generalize resources freeing in sendfile with different scenarios. Now we execute sendfile_iodone() in all possible cases, which guarantees that vm_object_pip_wakeup() is called and sfio structure is freed. At the beginning of sendfile initialize sfio->m to NULL, that would indicate that the mbuf chain either doesn't exist, or belongs to the syscall (not to I/O completion). Fill sfio->m only at a point when we are positive that there are I/Os ongoing and before releasing syscall's reference on sfio. In sendfile_iodone() perform vm_object_pip_wakeup() once last reference is released, then check for sfio->m. NULL pointer indicates that we need only to free the memory. Reviewed by: jtl, gallatin	2020-02-25 19:29:05 +00:00
Gleb Smirnoff	f85e1a806b	Make ktls_frame() never fail. Caller must supply correct mbufs. This makes sendfile code a bit simplier.	2020-02-25 19:26:40 +00:00
Gleb Smirnoff	69302907d6	When sendfile_swapin() sweeps through pages in search for a bogus page skip first and last pages. This is a micro optimisation.	2020-02-25 19:11:20 +00:00
Ryan Libby	fe20aaec0a	sys/kern: quiet -Wwrite-strings Quiet a variety of Wwrite-strings warnings in sys/kern at low-impact sites. This patch avoids addressing certain others which would need to plumb const through structure definitions. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23798	2020-02-23 03:32:16 +00:00
Ryan Libby	2782c00c04	vfs: quiet -Wwrite-strings Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23797	2020-02-23 03:32:11 +00:00
Ryan Libby	eaa17d4291	sys/vm: quiet -Wwrite-strings Discussed with: kib Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23796	2020-02-23 03:32:04 +00:00
Konstantin Belousov	04869b812b	Add td_pflags2, yet another thread-private flags word. There is no more free bits in td_pflags. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-02-22 20:43:04 +00:00
Jeff Roberson	226dd6db47	Add an atomic-free tick moderated lazy update variant of SMR. This enables very cheap read sections with free-to-use latencies and memory overhead similar to epoch. On a recent AMD platform a read section cost 1ns vs 5ns for the default SMR. On Xeon the numbers should be more like 1 ns vs 11. The memory consumption should be proportional to the product of the free rate and 2*1/hz while normal SMR consumption is proportional to the product of free rate and maximum read section time. While here refactor the code to make future additions more straightforward. Name the overall technique Global Unbound Sequences (GUS) and adjust some comments accordingly. This helps distinguish discussions of the general technique (SMR) vs this specific implementation (GUS). Discussed with: rlibby, markj	2020-02-22 03:44:10 +00:00
Mateusz Guzik	721a81c369	vfs: stop duplicating vnode work in audit during path lookup Duplicating the work was putting an avoidable requirement that the filedesc lock is held across the entire operation (otherwise by the time audit reads vnode pointers another thread in the same process can chdir somewhere else, making audit log things using different vnode than the one which will be used for actual lookup). Do the obvious thing and pass down vnodes which will be used.	2020-02-21 01:44:31 +00:00
Eric van Gyzen	3cd1f28e4a	clamp kernel dump compression level when using gzip If the configured compression level for kernel dumps it outside the supported range, clamp it to the closest supported level. Previously, dumpon would fail. zstd already does this internally, so the compressor needs no change. Reviewed by: cem markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23765	2020-02-20 23:53:48 +00:00
Konstantin Belousov	74cb9a5333	Fix a bug in r358168, do not call sigfastblock_setpend() under a mutex. PR: 244250 Reported and tested by: lwhsu Sponsored by: The FreeBSD Foundation	2020-02-20 21:25:12 +00:00
Mateusz Guzik	65cdfb4caa	make sysent for r358172 ("vfs: add realpathat syscall")	2020-02-20 16:58:57 +00:00
Mateusz Guzik	0573d0a9b8	vfs: add realpathat syscall realpath(3) is used a lot e.g., by clang and is a major source of getcwd and fstatat calls. This can be done more efficiently in the kernel. This works by performing a regular lookup while saving the name and found parent directory. If the terminal vnode is a directory we can resolve it using usual means. Otherwise we can use the name saved by lookup and resolve the parent. See the review for sample syscall counts. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23574	2020-02-20 16:58:19 +00:00
Konstantin Belousov	a113b17f10	Do not read sigfastblock word on syscall entry. On machines with SMAP, fueword executes two serializing instructions which can be seen in microbenchmarks. As a measure to restore microbenchmark numbers, only read the word on the attempt to deliver signal in ast(). If the word is set, signal is not delivered and word is kept, preventing interruption of interruptible sleeps by signals until userspace calls sigfastblock(UNBLOCK) which clears the word. This way, the spurious EINTR that userspace can see while in critical section is on first interruptible sleep, if a signal is pending, and on signal posting. It is believed that it is not important for rtld and lbithr critical sections. It might be visible for the application code e.g. for the callback of dl_iterate_phdr(3), but again the belief is that the non-compliance is acceptable. Most important is that the retry of the sleeping syscall does not interrupt unless additional signal is posted. For now I added the knob kern.sigfastblock_fetch_always to enable the word read on syscall entry to be able to diagnose possible issues due to spurious EINTR. While there, do some code restructuting to have all sigfastblock() handling located in kern_sig.c. Reviewed by: jeff Discussed with: mjg Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D23622	2020-02-20 15:34:02 +00:00
Jeff Roberson	6c5f36ff30	Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in virtual address or physical page allocation need to be marked with this flag. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D23712	2020-02-19 08:17:27 +00:00
Mateusz Guzik	d8a84f08e8	refcount: update comments about fencing when releasing counts after r357989 Requested by: kib Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23719	2020-02-16 18:20:09 +00:00
Mateusz Guzik	3403d5245e	vfs: fix vlrureclaim ->v_object access The routine was checking for ->v_type == VBAD. Since vgone drops the interlock early sets this type at the end of the process of dooming a vnode, this opens a time window where it can clear the pointer while the inerlock-holders is accessing it. Another note is that the code was: (vp->v_object != NULL && vp->v_object->resident_page_count > trigger) With the compiler being fully allowed to emit another read to get the pointer, and in fact it did on the kernel used by pho. Use atomic_load_ptr and remember the result. Note that this depends on type-safety of vm_object. Reported by: pho	2020-02-16 03:33:34 +00:00
Mateusz Guzik	c615009461	vfs: check early for VCHR in vput_final to short-circuit in the common case Otherwise the compiler inlines v_decr_devcount which keps getting jumped over in the common case of not dealing with a device.	2020-02-16 03:16:28 +00:00
Matt Macy	45035becfe	Add zfree to zero allocation before free Key and cookie management typically wants to avoid information leaks by explicitly zeroing before free. This routine simplifies that by permitting consumers to do so without carrying the size around. Reviewed by: jeff@, jhb@ MFC after: 1 week Sponsored by: Rubicon Communications, LLC (Netgate) Differential Revision: https://reviews.freebsd.org/D22790	2020-02-16 00:12:53 +00:00
Konstantin Belousov	a7b61c0af1	sem_remove(): fix the loop that compacts sem array on semaphores removal. As written now, it copies random kernel memory from beyond the bounds of the array. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23694	2020-02-15 23:19:23 +00:00
Konstantin Belousov	4cb6ea7e8e	sem_remove(): add some asserts. Assert that sema[idx] allocation from sem[] is sane. Also assert that sem_mtx is owned, it protects the SEM_ALLOC flag. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23694	2020-02-15 23:18:02 +00:00
Konstantin Belousov	8095050846	Use designated initializers for seminfo. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23694	2020-02-15 23:15:42 +00:00
Pawel Biernacki	e0d69c5a88	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (1 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Reviewed by: kib, trasz Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23640	2020-02-15 18:48:38 +00:00
Mateusz Guzik	074ad60a4c	vfs: make write suspension mandatory At the time opt-in was introduced adding yourself as a writer was esrializing across the mount point. Nowadays it is fully per-cpu, the only impact being a small single-threaded hit on top of what's there right now. Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which has is done regardless. Should someone want to microoptimize this single-threaded they can coalesce looking the mount up with adding a write to it.	2020-02-15 13:00:39 +00:00
Mateusz Guzik	eb40664d83	capsicum: use new helpers	2020-02-15 01:30:27 +00:00
Mateusz Guzik	445faddf7f	kqueue: use new capsicum helpers	2020-02-15 01:30:13 +00:00
Mateusz Guzik	32a86c44ee	fd: use new capsicum helpers	2020-02-15 01:28:55 +00:00
Mateusz Guzik	e126c5a3e8	vfs: use new capsicum helpers	2020-02-15 01:28:42 +00:00
Konstantin Belousov	6cf2362e2c	Consolidate read code for timecounters and fix possible overflow in bintime()/binuptime(). The algorithm to read the consistent snapshot of current timehand is repeated in each accessor, including the details proper rollup detection and synchronization with the writer. In fact there are only two different kind of readers: one for bintime()/binuptime() which has to do the in-place calculation, and another kind which fetches some member from struct timehand. Extract the logic into type-checked macros, GETTHBINTIME() for bintime calculation, and GETTHMEMBER() for safe read of a structure' member. This way, the synchronization is only written in bintime_off() and getthmember(). In bintime_off(), use overflow-safe calculation of th_scale * delta(timecounter). In tc_windup, pre-calculate the min delta value which overflows and require slow algorithm, into the new timehands th_large_delta member. This part with overflow fix was written by Bruce Evans. Reported by: Mark Millard <marklmi@yahoo.com> (the overflow issue) Tested by: pho Discussed with: emaste Sponsored by: The FreeBSD Foundation (kib) MFC after: 3 weeks	2020-02-14 23:27:45 +00:00
Mateusz Guzik	df0d5a2a85	vfs: remove no longer needed atomic_load_ptr casts	2020-02-14 23:18:32 +00:00
Mateusz Guzik	8f86349f8b	fd: remove no longer needed atomic_load_ptr casts	2020-02-14 23:18:22 +00:00
Mateusz Guzik	5bc6a91f54	kcov: remove no longer needed atomic_load_ptr casts	2020-02-14 23:18:03 +00:00
Mateusz Guzik	2f7292437d	Merge audit and systrace checks This further shortens the syscall routine by not having to re-check after the system call.	2020-02-14 13:09:41 +00:00
Mateusz Guzik	0e84a878c0	Annotate branches in the syscall path This in particular significantly shortens amd64_syscall, which otherwise keeps jumping forward over 2KB of code in total. Note some of these branches should be either eliminated altogether or coalesced.	2020-02-14 13:08:46 +00:00
Mateusz Guzik	ba8dd40bb1	lockmgr: add a change missed in r357907	2020-02-14 11:56:50 +00:00
Mateusz Guzik	6ed30ea4c0	fd: annotate finstall with prediction branches	2020-02-14 11:22:12 +00:00
Mateusz Guzik	c1b57fa7d3	lockmgr: rename lock_fast_path to lock_flags The routine is not much of a fast path and the flags name better describes its purpose.	2020-02-14 11:21:28 +00:00
Mateusz Guzik	943c4932f3	lockmgr: retire the unused lockmgr_unlock_fast_path routine	2020-02-14 11:20:25 +00:00
Kyle Evans	0f5f49eff7	u_char -> vm_prot_t in a couple of places, NFC The latter is a typedef of the former; the typedef exists and these bits are representing vmprot values, so use the correct type. Submitted by: sigsys@gmail.com MFC after: 3 days	2020-02-14 02:22:08 +00:00
Mateusz Guzik	6ebab6bad2	vfs: use mac fastpath for lookup, open, read, write, mmap	2020-02-13 22:22:55 +00:00
Mateusz Guzik	7b2ff0dcb2	Partially decompose priv_check by adding priv_check_cred_vfs_generation During buildkernel there are very frequent calls to priv_check and they all are for PRIV_VFS_GENERATION (coming from stat/fstat). This results in branching on several potential privileges checking if perhaps that's the one which has to be evaluated. Instead of the kitchen-sink approach provide a way to have commonly used privs directly evaluated.	2020-02-13 22:22:15 +00:00
Mateusz Guzik	e6081fe899	Inline jailed(). It is constantly called from priv_check.	2020-02-13 22:16:30 +00:00
Mateusz Guzik	8bdcfb10d3	Annotate suser_enabled as __read_mostly It is read a lot in priv code.	2020-02-13 22:16:02 +00:00
Jeff Roberson	1f2a6b8501	Since r357804 pcpu zones are required to use zalloc_pcpu(). Prior to this it was only required if you were zeroing. Switch to these interfaces. Reviewed by: mjg	2020-02-13 21:10:17 +00:00
Jeff Roberson	a4d50e49da	Add more precise SMR entry asserts.	2020-02-13 20:50:21 +00:00
Kyle Evans	b30ab6d8fe	sys/kern sysent: re-add dependency on capabilities.conf r356868 inadvertently removed this, so changes to capabilities.conf were no longer considered for being outdated.	2020-02-12 19:06:34 +00:00
Ed Maste	fe16bad415	regen sysent after r357831, r357838 Capability mode changes allowing fdatasync and getloginclass. Sponsored by: The FreeBSD Foundation	2020-02-12 19:05:10 +00:00
Ed Maste	e953765f15	Allow getloginclass in capability mode As with e.g. getgroups and getlogin it allows querying current process credential state. Reported by: sigsys@gmail.com via kevans Sponsored by: The FreeBSD Foundation	2020-02-12 18:59:00 +00:00
Ed Maste	9cdfb2d69a	Allow fdatasync in capability mode fdatasync is essentially a subset of fsync (and may be exactly fsync, depending on filesystem and development effort) and operates only on a provided fd. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2020-02-12 17:12:26 +00:00
Mateusz Guzik	4602214772	vfs: refactor vputx and add more comment Reviewed by: jeff (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23530	2020-02-12 11:19:07 +00:00
Mateusz Guzik	ed67a63c39	vfs: drop remaining zpcpu casts	2020-02-12 11:18:12 +00:00
Mateusz Guzik	123c519731	vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit In particular on amd64 this eliminates an atomic op in the common case, trading it for IPIs in the uncommon case of catching CPUs executing the code while the filesystem is getting suspended or unmounted.	2020-02-12 11:17:45 +00:00
Mateusz Guzik	00ac9d2632	rms: use smp_rendezvous_cpus_retry instead of a hand-rolled variant	2020-02-12 11:17:18 +00:00
Mateusz Guzik	e4f584971b	Add smp_rendezvous_cpus_retry This is a wrapper around smp_rendezvous_cpus which enables use of IPI handlers which can fail and require retrying. wait_func argument is added to to provide a routine which can be used to poll CPU of interest for when the IPI can be retried. Handlers which succeed must call smp_rendezvous_cpus_done to denote that fact. Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D23582	2020-02-12 11:16:55 +00:00
Mateusz Guzik	3acb6572fc	Store offset into zpcpu allocations in the per-cpu area. This shorten zpcpu_get and allows more optimizations. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23570	2020-02-12 11:11:22 +00:00
Mateusz Guzik	48baf00f54	epoch: convert zpcpu_get_cpua(.., curcpu) to zpcpu_get	2020-02-12 11:10:10 +00:00
Gleb Smirnoff	4426b2e64b	Add flag to struct task to mark the task as requiring network epoch. When processing a taskqueue and a task has associated epoch, then enter for duration of the task. If consecutive tasks belong to the same epoch, batch them. Now we are talking about the network epoch only. Shrink the ta_priority size to 8-bits. No current consumers use a priority that won't fit into 8 bits. Also complexity of taskqueue_enqueue() is a square of maximum value of priority, so we unlikely ever want to go over UCHAR_MAX here. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D23518	2020-02-11 18:48:07 +00:00
Mateusz Guzik	57349a4f41	vfs: fix vhold race in mnt_vnode_next_lazy_relock vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the count has to be > 0. Reported by: pho Tested by: pho	2020-02-11 18:19:56 +00:00
Mateusz Guzik	1b853b62f3	capsicum: restore the cap_rights_contains symbol It is expected to be provided by libc. PR: 244033 Reported by: Jan Kokemueller	2020-02-11 18:13:53 +00:00
Mateusz Guzik	2e57c8fde7	vfs: fix device count leak on vrele racing with vgone The race is: CPU1 CPU2 devfs_reclaim_vchr make v_usecount 0 VI_LOCK sees v_usecount == 0, no updates vp->v_rdev = NULL; ... VI_UNLOCK VI_LOCK v_decr_devcount sees v_rdev == NULL, no updates In this scenario si_devcount decrement is not performed. Note this can only happen if the vnode lock is not held. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23529	2020-02-10 22:28:54 +00:00
Li-Wen Hsu	37d4ece7c5	Restore the behavior of allowing empty string in a string sysctl Added as a special case to avoid unnecessary memory operations. Reviewed by: delphij Sponsored by: The FreeBSD Foundation	2020-02-10 20:53:59 +00:00
Hans Petter Selasky	f912e8f2ff	Fix for unbalanced EPOCH(9) usage in the generic kernel interrupt handler. Interrupt handlers are removed via intr_event_execute_handlers() when IH_DEAD is set. The thread removing the interrupt is woken up, and calls intr_event_update(). When this happens, the ie_hflags are cleared and re-built from all the remaining handlers sharing the event. When the last IH_NET handler is removed, the IH_NET flag will be cleared from ih_hflags (or ie_hflags may still be being rebuilt in a different context), and the ithread_execute_handlers() may return with ie_hflags missing IH_NET. This can lead to a scenario where IH_NET was present before calling ithread_execute_handlers, and is not present at its return, meaning the need for epoch must be cached locally. This can happen when loading and unloading network drivers. Also make sure the ie_hflags is not cleared before being updated. This is a regression issue after r357004. Backtrace: panic() # trying to access epoch tracker on stack of dead thread _epoch_enter_preempt() ifunit_ref() ifioctl() fo_ioctl() kern_ioctl() sys_ioctl() syscallenter() amd64_syscall() Differential Revision: https://reviews.freebsd.org/D23483 Reviewed by: glebius@, gallatin@, mav@, jeff@ and kib@ Sponsored by: Mellanox Technologies	2020-02-10 20:23:08 +00:00
Mateusz Guzik	cd951a0d8e	vfs: fix lock recursion in vrele vrele is supposed to be called with an unlocked vnode, but this was never asserted for if v_usecount was > 0. For such counts the lock is never touched by the routine. As a result the kernel has several consumers which expect vunref semantics and get away with calling vrele since they happen to never do it when this is the last reference (and for some of them this may happen to be a guarantee). Work around the problem by changing vrele semantics to tolerate being called with a lock. This eliminates a possible bug where the lock is already held and vputx takes it anyway. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23528	2020-02-10 13:54:34 +00:00
Konstantin Belousov	48fcb46311	Add sysctl kern.proc.sigfastblk for reporting sigfastblock word address. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773	2020-02-09 12:29:51 +00:00
Konstantin Belousov	944cf37bb5	Add AT_BSDFLAGS auxv entry. The intent is to provide bsd-specific flags relevant to interpreter and C runtime. I did not want to reuse AT_FLAGS which is common ELF auxv entry. Use bsdflags to report kernel support for sigfastblock(2). This allows rtld and libthr to safely infer the syscall presence without SIGSYS. The tunable kern.elf{32,64}.sigfastblock blocks reporting. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773	2020-02-09 12:10:37 +00:00
Konstantin Belousov	f88c67a625	Regen.	2020-02-09 11:53:37 +00:00
Konstantin Belousov	146fc63fce	Add a way to manage thread signal mask using shared word, instead of syscall. A new syscall sigfastblock(2) is added which registers a uint32_t variable as containing the count of blocks for signal delivery. Its content is read by kernel on each syscall entry and on AST processing, non-zero count of blocks is interpreted same as the signal mask blocking all signals. The biggest downside of the feature that I see is that memory corruption that affects the registered fast sigblock location, would cause quite strange application misbehavior. For instance, the process would be immune to ^C (but killable by SIGKILL). With consumers (rtld and libthr added), benchmarks do not show a slow-down of the syscalls in micro-measurements, and macro benchmarks like buildworld do not demonstrate a difference. Part of the reason is that buildworld time is dominated by compiler, and clang already links to libthr. On the other hand, small utilities typically used by shell scripts have the total number of syscalls cut by half. The syscall is not exported from the stable libc version namespace on purpose. It is intended to be used only by our C runtime implementation internals. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773	2020-02-09 11:53:12 +00:00
Mateusz Guzik	2f7f11b7de	vfs: tidy up vget_finish and vn_lock - remove assertion which duplicates vn_lock - use VNPASS instead of retyping the failure - report what flags were passed if panicking on them	2020-02-08 15:52:20 +00:00
Mateusz Guzik	3eb6b656c2	vfs: remove now useless ENODEV handling from vn_fullpath consumers Noted by: ngie	2020-02-08 15:51:08 +00:00
Konstantin Belousov	300b525d29	Correct the function name in the comment. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2020-02-08 15:06:06 +00:00
Mateusz Guzik	ea77ce6ef9	rms: use newly added zpcpu routines instead of direct access where appropriate	2020-02-07 22:44:41 +00:00
Jeff Roberson	a40068e524	Fix a race in smr_advance() that could result in unnecessary poll calls. This was relatively harmless but surprising to see in counters. The race occurred when rd_seq was read after the goal was updated and we incorrectly calculated the delta between them. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23464	2020-02-06 20:51:46 +00:00
Jeff Roberson	8d7f16a5db	Add some global counters for SMR. These may eventually become per-smr counters. In my stress test there is only one poll for every 15,000 frees. This means we are effectively amortizing the cache coherency overhead even with very high write rates (3M/s/core). Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23463	2020-02-06 20:10:21 +00:00
Pawel Biernacki	210176ad76	sysctl(9): add CTLFLAG_NEEDGIANT flag Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to mark sysctls that still require locking Giant. Rewrite sysctl_handle_string() to use internal locking instead of locking Giant. Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE. Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet. Reviewed by: kib (mentor) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D23378	2020-02-06 12:45:58 +00:00
Mark Johnston	d3631aa582	Avoid releasing object PIP in vn_sendfile() if no pages were grabbed. sendfile(2) optionally takes a set of headers that get prepended to the file data. If the request length is less than that of the headers, sendfile may not allocate an sfio structure, in which case its pointer is null and we should be careful not to dereference. This was introduced in r356902. Reported by: syzkaller Sponsored by: The FreeBSD Foundation	2020-02-05 16:09:21 +00:00
Leandro Lupori	eb5a41cf2f	Add SYSCTL to get KERNBASE and relocated KERNBASE This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE values. This provides an easy, architecture independent way to calculate the running kernel displacement (current/load address minus original base address). The initial goal for this change is to add a new libkvm function that returns the kernel displacement, both for live kernels and crashdumps. This would in turn be used by kgdb to find out how to relocate kernel symbols (if needed). Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D23284	2020-02-05 11:34:10 +00:00
Mateusz Guzik	1a9fe4528b	fd: always nullify fdp in fget routines Some consumers depend on the pointer being NULL if an error is returned. The guarantee got broken in r357469. Reported by: https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef Noted by: markj	2020-02-05 00:20:26 +00:00
Ryan Libby	10c8fb47d9	uma: convert mbuf_jumbo_alloc to UMA_ZONE_CONTIG & tag others Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which require contiguous objects, even if they don't use the new default contig allocator, so that uma knows about their constraints. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23238	2020-02-04 22:40:23 +00:00
Konstantin Belousov	0783b70974	Remove unneeded assert for curproc. Simplify. Reported by: syzkaller by markj Sponsored by: The FreeBSD Foundation	2020-02-04 21:02:08 +00:00
Mark Johnston	60185d649b	Correct the malloc tag used when freeing the temporary semop(2) buffer. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2020-02-04 20:00:45 +00:00
Dmitry Chagin	cbc1089190	For code reuse in Linuxulator rename get_proccess_cputime() and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>. As both functions become a public interface add process lock assert to ensure that the process is not exiting under it. Fix whitespace nit while here. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23340 MFC after 2 weeks	2020-02-04 05:25:51 +00:00
Jeff Roberson	bc6509845d	Implement a deferred write advancement feature that can be used to further amortize shared cacheline writes. Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D23462	2020-02-04 02:44:52 +00:00
Jeff Roberson	c8ea36e881	Fix a recursion on the thread lock by acquiring it after call rtp_to_pri(). Reported by: swills Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23495	2020-02-04 02:42:54 +00:00
Mark Johnston	e489450589	Fix the !SMP case in sched_add() after r355779. If the thread's lock is already that of the runqueue, don't recurse on the queue lock. Reviewed by: jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23492	2020-02-03 22:49:05 +00:00
Mateusz Guzik	8151b6e92a	fd: partially unengrish the previous commit	2020-02-03 22:34:50 +00:00
Mateusz Guzik	e10f063b30	fd: streamline fget_unlocked clang has the unfortunate property of paying little attention to prediction hints when faced with a loop spanning the majority of the rotuine. In particular fget_unlocked has an unlikely corner case where it starts almost from scratch. Faced with this clang generates a maze of taken jumps, whereas gcc produces jump-free code (in the expected case). Work around the problem by providing a variant which only tries once and resorts to calling the original code if anything goes wrong. While here note that the 'seq' parameter is almost never passed, thus the seldom users are redirected to call it directly.	2020-02-03 22:32:49 +00:00
Mateusz Guzik	52604ed792	fd: remove the seq argument from fget_unlocked It is almost always NULL.	2020-02-03 22:27:55 +00:00
Mateusz Guzik	7f1566f884	fd: remove the seq argument from fget routines It is almost always NULL.	2020-02-03 22:27:03 +00:00
Mateusz Guzik	0a1427c5ab	ktrace: provide ktrstat_error This eliminates a branch from its consumers trading it for an extra call if ktrace is enabled for curthread. Given that this is almost never true, the tradeoff is worth it.	2020-02-03 22:26:00 +00:00
Gleb Smirnoff	0017b2adac	Couple protocol drain routines (frag6_drain and sctp_drain) may send packets. An unexpected behaviour for memory reclamation routine. Anyway, we need enter the network epoch for doing that.	2020-02-03 20:48:57 +00:00
Kyle Evans	3d62f685d5	namei: preserve errors from fget_cap_locked Most notably, we want to make sure we don't clobber any capabilities-related errors. This is a regression from r357412 (O_SEARCH) that was picked up by the capsicum tests. PR: 243839 Reviewed by: kib (committed form recommended by) Tested by: lwhsu Differential Revision: https://reviews.freebsd.org/D23479	2020-02-03 18:59:07 +00:00
Warner Losh	58aa35d429	Remove sparc64 kernel support Remove all sparc64 specific files Remove all sparc64 ifdefs Removee indireeect sparc64 ifdefs	2020-02-03 17:35:11 +00:00
Mateusz Guzik	bcd1cf4f03	capsicum: faster cap_rights_contains Instead of doing a 2 iteration loop (determined at runeimt), take advantage of the fact that the size is already known. While here provdie cap_check_inline so that fget_unlocked does not have to do a function call. Verified with the capsicum suite /usr/tests.	2020-02-03 17:08:11 +00:00
Mateusz Guzik	fee204544e	fd: fix f_count acquire in fget_unlocked The code was using a hand-rolled fcmpset loop, while in other places the same count is manipulated with the refcount API. This transferred from a stylistic issue into a bug after the API got extended to support flags. As a result the hand-rolled loop could bump the count high enough to set the bit flag. Another bump + refcount_release would then free the file prematurely. The bug is only present in -CURRENT.	2020-02-03 14:28:31 +00:00
Mateusz Guzik	f1fa1ba3d0	Fix up various vnode-related asserts which did not dump the used vnode	2020-02-03 14:25:32 +00:00
Kyle Evans	6a5abb1ee5	Provide O_SEARCH O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping permissions checks on the directory itself after the initial open(). This is close to the semantics we've historically applied for O_EXEC on a directory, which is UB according to POSIX. Conveniently, O_SEARCH on a file is also explicitly undefined behavior according to POSIX, so O_EXEC would be a fine choice. The spec goes on to state that O_SEARCH and O_EXEC need not be distinct values, but they're not defined to be the same value. This was pointed out as an incompatibility with other systems that had made its way into libarchive, which had assumed that O_EXEC was an alias for O_SEARCH. This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a directory is checked in vn_open_vnode already, so for completeness we add a NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not re-check that when descending in namei. [0] https://pubs.opengroup.org/onlinepubs/9699919799/ Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23247	2020-02-02 16:34:57 +00:00
Mateusz Guzik	2568d5bb79	fd: sprinkle some predits around fget clang inlines fget -> _fget into kern_fstat and eliminates several checkes, but prior to this change it would assume fget_unlocked was likely to fail and consequently avoidable jumps got generated.	2020-02-02 09:38:40 +00:00
Mateusz Guzik	da4f45ea5c	fd: use atomic_load_ptr instead of hand-rolled cast through volatile No change in assembly.	2020-02-02 09:37:16 +00:00
Mateusz Guzik	6698e11f4b	vfs: remove the now empty vop_unlock_post	2020-02-02 09:36:32 +00:00
Mateusz Guzik	7739d92766	cache: replace kern___getcwd with vn_getcwd The previous routine was resulting in extra data copies most notably in linux_getcwd.	2020-02-01 20:38:38 +00:00
Mateusz Guzik	921e7210f8	cache: return the total length from vn_fullpath1 This removes strlen from getcwd.	2020-02-01 20:37:11 +00:00
Mateusz Guzik	4511dd9d41	cache: remove vnode -> path lookup disablement It seems to be of little to no use even when debugging. Interested parties can resurrect it and gate compilation with a macro.	2020-02-01 20:36:35 +00:00
Mateusz Guzik	45757984f8	vfs: consistently use size_t for buflen around VOP_VPTOCNP	2020-02-01 20:34:43 +00:00
Mateusz Guzik	643656cfaf	vfs: replace VOP_MARKATIME with VOP_MMAPPED The routine is only provided by ufs and is only used on mmap and exec. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23422	2020-02-01 06:46:55 +00:00
Mateusz Guzik	90f4ec3328	vfs: save on atomics on the root vnode for absolute lookups There are 2 back-to-back atomics on the vnode, but we can check upfront if one is sufficient. Similarly we can handle relative lookups where current working directory == root directory. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23427	2020-02-01 06:40:35 +00:00
Mateusz Guzik	21c4f1041e	vfs: add vrefactn Differential Revision: https://reviews.freebsd.org/D23427	2020-02-01 06:39:49 +00:00
Jeff Roberson	915c367e8e	Add two missing fences with comments describing them. These were found by inspection and after a lengthy discussion with jhb and kib. They have not produced test failures. Don't pointer chase through cpu0's smr. Use cpu correct smr even when not in a critical section to reduce the likelihood of false sharing.	2020-01-31 22:21:15 +00:00
Mark Johnston	1c29da0279	Reimplement stack capture of running threads on i386 and amd64. After r355784 the td_oncpu field is no longer synchronized by the thread lock, so the stack capture interrupt cannot be delievered precisely. Fix this using a loop which drops the thread lock and restarts if the wrong thread was sampled from the stack capture interrupt handler. Change the implementation to use a regular interrupt instead of an NMI. Now that we drop the thread lock, there is no advantage to the latter. Simplify the KPIs. Remove stack_save_td_running() and add a return value to stack_save_td(). On platforms that do not support stack capture of running threads, stack_save_td() returns EOPNOTSUPP. If the target thread is running in user mode, stack_save_td() returns EBUSY. Reviewed by: kib Reported by: mjg, pho Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23355	2020-01-31 15:43:33 +00:00
Mateusz Guzik	0f4d8b77c0	vfs: revert the overzealous assert added in r357285 to vgone The intent was to make it more likely to catch filesystems with custom need_inactive routines which fail to call vn_need_pageq_flush (or do an equivalent). One immediate case which is missed is vgone from called by inactive itself. A better assertion may land later. The routine is not added to vputx because it is of no use to tmpfs et al. Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com	2020-01-31 11:31:14 +00:00
Mateusz Guzik	1a78ac2416	Add rms_try_rlock and rms_wowned.	2020-01-31 08:36:49 +00:00
Mateusz Guzik	cedad2916e	Remove an overzealous assert from rms_runlock.	2020-01-31 08:36:23 +00:00
Jeff Roberson	da6e9935e4	Don't use "All rights reserved" in new copyrights. Requested by: rgrimes	2020-01-31 02:08:09 +00:00
Jeff Roberson	d4665eaa66	Implement a safe memory reclamation feature that is tightly coupled with UMA. This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is a unique algorithm. This has 3x the performance of epoch in a write heavy workload with less than half of the read side cost. The memory overhead is significantly lessened by limiting the free-to-use latency. A synthetic test uses 1/20th of the memory vs Epoch. There is significant further discussion in the comments and code review. This code should be considered experimental. I will write a man page after it has settled. After further validation the VM will begin using this feature to permit lockless page lookups. Both markj and cperciva tested on arm64 at large core counts to verify fences on weaker ordering architectures. I will commit a stress testing tool in a follow-up. Reviewed by: mmacy, markj, rlibby, hselasky Discussed with: sbahara Differential Revision: https://reviews.freebsd.org/D22586	2020-01-31 00:49:51 +00:00
Mateusz Guzik	3ff65f71cb	Remove duplicated empty lines from kern/*.c No functional changes.	2020-01-30 20:05:05 +00:00
Mateusz Guzik	2823710f05	Tidy up 2 comments in smp_rendezvous_cpus.	2020-01-30 20:02:14 +00:00
Mateusz Guzik	7ab99925fd	Assert that smp_rendezvous_cpus is called with interrupts enabled.	2020-01-30 19:38:51 +00:00
Mateusz Guzik	d53d924f60	vfs: keep the mount point referenced across sys_quotactl Otherwise we risk running into use-after-free. In particular this codepath ends up dropping all protection before suspending writes: ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt Reported by: pho	2020-01-30 19:38:12 +00:00
John Baldwin	fbb9879c0c	Fix use of an uninitialized variable. ctx (and thus ctx.flags) is stack garbage at the start of this function, so initialize ctx.flags to an explicit value instead of using binary operations on the garbage. Reported by: gcc9 Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D23368	2020-01-30 18:28:02 +00:00
Mateusz Guzik	c2ef6aa3d5	vfs: assert that doomed vnodes don't need to call vm_object_page_clean ... after the optional inactive processing.	2020-01-30 04:59:08 +00:00
Mateusz Guzik	07c6e2f4ab	vfs: unlazy before dooming the vnode With this change having the listmtx lock held postpones dooming the vnode. Use this fact to simplify iteration over the lazy list. It also allows filters to safely access ->v_data. Reviewed by: kib (early version) Differential Revision: https://reviews.freebsd.org/D23397	2020-01-30 02:12:52 +00:00
Gleb Smirnoff	79674264df	Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This is a regression from r356642, r356645.	2020-01-30 00:18:00 +00:00
Conrad Meyer	07a65f9d38	hwpstate_intel(4): Silence/fix Coverity reports These were all introduced in the initial import of hwpstate_intel(4). Reported by: Coverity CIDs: 1413161, 1413164, 1413165, 1413167 X-MFC-With: r357002	2020-01-29 03:15:34 +00:00
Warner Losh	42ec4f05a3	Make mqueue objects work across a fork again. In r110908 (2003) alfred added DFLAG_PASSABLE to tag those types of FD that can be passed via unix pipes, but mqueuefs didn't exist yet. Later, in r152825 (2005) davidxu neglected to include DFLAG_PASSABLE since people don't normally pass these things via unix sockets (it's a FreeBSD implementation detail that it's a file descriptor, nobody noticed). Then r223866 (2011) by jonathan used the new flag in fdcopy, which fork uses. Due to that, mqueuefs actually broke mqueue objects being propagated by fork. No mention of mqueuefs was made in r223866, so I think it was an unintended consequence. Fix this by tagging mqueuefs as passable as well. They were prior to alfred's change (and it's clear there's no intent in his change to change this behavior), and POSIX requires this to be the case as well. PR: 243103 Reviewed by: kib@, jiles@ Differential Revision: https://reviews.freebsd.org/D23038	2020-01-27 22:36:54 +00:00
John Baldwin	425e5f9dcf	Revert accidental change from r357146.	2020-01-26 14:23:27 +00:00
John Baldwin	c73222d0e6	Fix some misleading indentation warnings reported by recent clang. These should not be any functional change. While the change in emul10kx-pcm.c looks like a real bug fix (as opposed to inconsistent whitespace), the extra statements were not harmful. Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D23363	2020-01-26 14:20:57 +00:00
Mateusz Guzik	1513f80391	vfs: do an unlocked check before iterating the lazy list For most filesystems it is expected to be empty most of the time.	2020-01-26 07:06:18 +00:00
Mateusz Guzik	cd0e46c66b	vfs: remove vop loop from vop_sigdefer All ops are guaranteed to be present since r357131.	2020-01-26 07:05:06 +00:00
Mateusz Guzik	6d69e665dd	vfs: fix freevnodes count update race against preemption vdbatch_process leaves the critical section too early, openign a time window where another thread can get scheduled and modify vd->freevnodes. Once it the preempted thread gets back it overrides the value with 0. Just move critical_exit to the end of the function.	2020-01-26 00:40:27 +00:00
Mateusz Guzik	dc9a1cb60b	vfs: predict vn_lock failure as unlikely in vget	2020-01-26 00:34:57 +00:00
Jason A. Harmening	a9aa06f7b1	Implement cycle-detecting garbage collector for AF_UNIX sockets The existing AF_UNIX socket garbage collector destroys any socket which may potentially be in a cycle, as indicated by its file reference count being equal to its enqueue count. However, this can produce false positives for in-flight sockets which aren't part of a cycle but are part of one or more SCM_RIGHTS mssages and which have been closed on the sending side. If the garbage collector happens to run at exactly the wrong time, destruction of these sockets will render them unusable on the receiving side, such that no previously-written data may be read. This change rewrites the garbage collector to precisely detect cycles: 1. The existing check of msgcount==f_count is still used to determine whether the socket is potentially in a cycle. 2. The socket is now placed on a local "dead list", which is used to reduce iteration time (and therefore contention on the global unp_link_rwlock). 3. The first pass through the dead list removes each potentially-dead socket's outgoing references from the graph of potentially-dead sockets, using a gc-specific copy of the original reference count. 4. The second series of passes through the dead list removes from the list any socket whose remaining gc refcount is non-zero, as this indicates the socket is actually accessible outside of any possible cycle. Iteration is repeated until no further sockets are removed from the dead list. 5. Sockets remaining in the dead list are destroyed as before. PR: 227285 Submitted by: jan.kokemueller@gmail.com (prior version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23142	2020-01-25 08:57:26 +00:00
Mark Johnston	a89c2c8c34	Revert r357050. It seems to have introduced a couple of regressions. Reported by: cy, pho	2020-01-24 14:58:02 +00:00
Edward Tomasz Napierala	b3fb13eb55	Add kern_unmount() and use in Linuxulator. No functional changes. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22646	2020-01-24 11:57:55 +00:00
Mateusz Guzik	28eb39a5ab	vfs: allow v_usecount to transition 0->1 without the interlock There is nothing to do but to bump the count even during said transition. There are 2 places which can do it: - vget only does this after locking the vnode, meaning there is no change in contract versus inactive or reclamantion - vref only ever did it with the interlock held which did not protect against either (that is, it would always succeed) VCHR vnodes retain special casing due to the need to maintain dev use count. Reviewed by: jeff, kib Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23185	2020-01-24 07:47:44 +00:00
Mateusz Guzik	d93762b94d	vfs: stop handling VI_OWEINACT in vget vget is almost always called with LK_SHARED, meaning the flag (if present) is almost guaranteed to get cleared. Stop handling it in the first place and instead let the thread which wanted to do inactive handle the bumepd usecount. Reviewed by: jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D23184	2020-01-24 07:45:59 +00:00
Mateusz Guzik	74c4b7cc60	vfs: stop unlocking the vnode upfront in vput Doing so runs into races with filesystems which make half-constructed vnodes visible to other users, while depending on the chain vput -> vinactive -> vrecycle to be executed without dropping the vnode lock. Impediments for making this work got cleared up (notably vop_unlock_post now does not do anything and lockmgr stops touching the lock after the final write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can now be eliminated. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23344	2020-01-24 07:44:25 +00:00
Mateusz Guzik	c00115f108	lockmgr: don't touch the lock past unlock This evens it up with other locking primitives. Note lock profiling still touches the lock, which again is in line with the rest. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23343	2020-01-24 07:42:57 +00:00
Mark Johnston	1bfca40c57	Set td_oncpu before dropping the thread lock during a switch. After r355784 we no longer hold a thread's thread lock when switching it out. Preserve the previous synchronization protocol for td_oncpu by setting it together with td_state, before dropping the thread lock during a switch. Reported and tested by: pho Reviewed by: kib Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D23270	2020-01-23 16:24:51 +00:00
Jeff Roberson	91e31c3c08	Consistently use busy and vm_page_valid() rather than touching page bits directly. This improves API compliance, asserts, etc. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23283	2020-01-23 04:54:49 +00:00
Jeff Roberson	1eb13fce84	Block the thread lock in sched_throw() and use cpu_switch() to unblock it. The introduction of lockless switch in r355784 created a race to re-use the exiting thread that was only possible to hit on a hypervisor. Reported/Tested by: rlibby Discussed with: rlibby, jhb	2020-01-23 03:36:50 +00:00
Gleb Smirnoff	ad3980121b	DEVICE_POLLING is an alternative to network interrupts and also needs to enter epoch. Assert that in the netisr_poll() and do the work for the idle poll routine.	2020-01-23 01:30:50 +00:00
Gleb Smirnoff	511d1afb6b	Enter the network epoch for interrupt handlers of INTR_TYPE_NET. Provide tunable to limit how many times handlers may be executed without reentering epoch. Differential Revision: https://reviews.freebsd.org/D23242	2020-01-23 01:24:47 +00:00
Gleb Smirnoff	c4eb66309f	Add ie_hflags to struct intr_event, which accumulates flags from all handlers on this event. For now handle only IH_ENTROPY in that manner.	2020-01-23 01:20:59 +00:00
Conrad Meyer	4577cf3744	cpufreq(4): Add support for Intel Speed Shift Intel Speed Shift is Intel's technology to control frequency in hardware, with hints from software. Let's get a working version of this in the tree and we can refine it from here. Submitted by: bwidawsk, scottph Reviewed by: bcr (manpages), myself Discussed with: jhb, kib (earlier versions) With feedback from: Greg V, gallatin, freebsdnewbie AT freenet.de Relnotes: yes Differential Revision: https://reviews.freebsd.org/D18028	2020-01-22 23:28:42 +00:00
Hans Petter Selasky	1f69a50940	Make sure the VNET is properly set when calling tcp_drop() from the ktls taskqueue callback function. A valid VNET is needed when updating statistics. panic() tcp_state_change() tcp_drop() ktls_reset_send_tag() taskqueue_run_locked() taskqueue_thread_loop() Sponsored by: Mellanox Technologies	2020-01-21 11:43:25 +00:00
Mateusz Guzik	6403455301	cache: revert r352613 now that vhold does not take locks	2020-01-20 19:52:23 +00:00
Mateusz Guzik	8bba93c7e0	cache: make numcachehv use counter(9) on all archs Requested by: kib	2020-01-20 14:42:11 +00:00
Jeff Roberson	d6e13f3b4d	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033	2020-01-19 23:47:32 +00:00
Mateusz Guzik	a9099e5b10	vfs: switch vop_stdunlock to call lockmgr_unlock Since the flags argument is now alawys 0 the new call provides the same behavior.	2020-01-19 21:41:34 +00:00
Jeff Roberson	811d05fcb7	Provide an API for interlocked refcount sleeps. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:18:17 +00:00
Mateusz Guzik	28479aaae2	vfs: allow v_holdcnt to transition 0->1 without the interlock Since r356672 ("vfs: rework vnode list management") there is nothing to do apart from altering freevnodes count, but this much can be safely done based on the result of atomic_fetchadd. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23186	2020-01-19 17:47:04 +00:00
Mateusz Guzik	059cb4843b	cache: counter_u64_add_protected -> counter_u64_add Fixes booting on RISC-V where it does happen to not be equivalent. Reported by: lwhsu	2020-01-19 17:05:26 +00:00
Mateusz Guzik	1399033590	cache: convert numcachehv to counter(9) on 64-bit platforms	2020-01-19 05:37:27 +00:00
Mateusz Guzik	512fa9a4e0	vfs: plug a conditional assigment of lo_name in getnewvnode It only matters for witness. No functional changes.	2020-01-19 05:36:45 +00:00
Kyle Evans	05d7dd739c	sysent targets: further cleanup and deduplication r355473 vastly improved the readability and cleanliness of these Makefiles. Every single one of them follows the same pattern and duplicates the exact same logic. Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and include a common sysent.mk to handle the rest. This makes it less tedious to make sweeping changes. Some default values are provided for GENERATED/SYSENT_*; almost all of these just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use effectively the same filenames with an arbitrary prefix. Most ABIs will be able to get away with just setting GENERATED_PREFIX and including ^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile is the notable exception, as it doesn't take a SYSENT_CONF and the generated files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits the pattern enough to use the common version. Reviewed by: brooks, imp Nice!: emaste Differential Revision: https://reviews.freebsd.org/D23197	2020-01-18 20:37:45 +00:00
Mateusz Guzik	2d0c620272	vfs: distribute freevnodes counter per-cpu It gets rolled up to the global when deferred requeueing is performed. A dedicated read routine makes sure to return a value only off by a certain amount. This soothes a global serialisation point for all 0<->1 hold count transitions. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23235	2020-01-18 01:29:02 +00:00
Mateusz Guzik	d3cc535474	vfs: provide F_ISUNIONSTACK as a kludge for libc Prior to introduction of this op libc's readdir would call fstatfs(2), in effect unnecessarily copying kilobytes of data just to check fs name and a mount flag. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D23162	2020-01-17 14:42:25 +00:00
Mateusz Guzik	1ad72b270c	vfs: shorten lock hold time in vdbatch_process	2020-01-17 14:39:00 +00:00
Gleb Smirnoff	66c6c556b6	Change argument order of epoch_call() to more natural, first function, then its argument. Reviewed by: imp, cem, jhb	2020-01-17 06:10:24 +00:00
Mateusz Guzik	66f67d5e5e	vfs: increment numvnodes without the vnode list lock unless under pressure The vnode list lock is only needed to reclaim free vnodes or kick the vnlru thread (or to block and not miss a wake up (but note the sleep has a timeout so this would not be a correctness issue)). Try to get away without the lock by just doing an atomic increment. The lock is contended e.g., during poudriere -j 104 where about half of all acquires come from vnode allocation code. Note the entire scheme needs a rewrite, the above just reduces it's SMP impact. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23140	2020-01-16 21:45:21 +00:00
Mateusz Guzik	b7f50b9ad1	vfs: refcator vnode allocation Semantics are almost identical. Some code is deduplicated and there are fewer memory accesses. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D23158	2020-01-16 21:43:13 +00:00
Mateusz Guzik	875cfc082d	vfs: reimplement vlrureclaim to actually use LRU Take advantage of global ordering introduced in r356672. Reviewed by: mckusick (previous version) Differential Revision: https://reviews.freebsd.org/D23067	2020-01-16 10:44:02 +00:00
Jeff Roberson	a81c400e75	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102	2020-01-16 05:01:21 +00:00
Kirk McKusick	bbb1e07d65	Peter Holm reports that his test that does an umount(8) on an active mount point while numerous tests are running that are writing to files on that mount point cause the unmount(8) to hang forever. The unmount(8) system call is handled in the kernel by the dounmount() function. The cause of the hang is that prior to dounmount() calling VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT flag indicates that VFS_SYNC() should not return until all the dirty buffers associated with the mount point have been written to disk. Because user processes are allowed to continue writing and can do so faster than the data can be written to disk, the call to VFS_SYNC() can never finish. Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes when they request to do a write thus having a finite number of dirty buffers to write that cannot be expanded. There is no need to call VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs to flush everything again anyway after suspending writes, to catch anything that was dirtied between the VFS_SYNC() and writes being suspended. The fix is to simply remove the unnecessary call to VFS_SYNC() from dounmount(). Reported by: Peter Holm Analysis by: Chuck Silvers Tested by: Peter Holm MFC after: 7 days Sponsored by: Netflix	2020-01-15 18:53:32 +00:00

... 2 3 4 5 6 ...

17441 Commits