freebsd-skq

Author	SHA1	Message	Date
Mark Johnston	435c7cfb24	Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so put these in a more generic place as a step towards importing the other sanitizers. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29103	2021-03-08 12:39:06 -05:00
Mark Johnston	7995dae9d3	posix timers: Improve the overrun calculation timer_settime(2) may be used to configure a timeout in the past. If the timer is also periodic, we also try to compute the number of timer overruns that occurred between the initial timeout and the time at which the timer fired. This is done in a loop which iterates once per period between the initial timeout and now. If the period is small and the initial timeout was a long time ago, this loop can take forever to run, so the system is effectively DOSed. Replace the loop with a more direct calculation of (now - initial timeout) / period to compute the number of overruns. Reported by: syzkaller Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29093	2021-03-08 12:39:06 -05:00
Mark Johnston	60d12ef952	posix timers: Sprinkle some style fixes MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Mark Johnston	8ff2b41c05	posix timers: Declare unexported functions as static MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-08 12:39:06 -05:00
Konstantin Belousov	56b9bee63a	Make kern.timecounter.hardware tunable Noted and reviewed by: kevans MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D29122	2021-03-08 03:48:21 +02:00
Hans Petter Selasky	c743a6bd4f	Implement mallocarray_domainset(9) variant of mallocarray(9). Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-06 11:38:55 +01:00
Konstantin Belousov	ead7697f04	Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2). MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-03-06 07:24:18 +02:00
Mark Johnston	89b650872b	ktls: Hide initialization message behind bootverbose We don't typically print anything when a subsystem initializes itself, and KTLS is currently disabled by default anyway. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29097	2021-03-05 13:11:02 -05:00
Mark Johnston	5e6989ba4f	link_elf_obj: Handle init_array sections in KLDs Reuse existing handling for .ctors, print a warning if multiple constructor sections are present. Destructors are not handled as of yet. This is required for KASAN. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29049	2021-03-04 10:07:10 -05:00
Mark Johnston	49f6925ca3	ktls: Cache output buffers for software encryption Maintain a cache of physically contiguous runs of pages for use as output buffers when software encryption is configured and in-place encryption is not possible. This makes allocation and free cheaper since in the common case we avoid touching the vm_page structures for the buffer, and fewer calls into UMA are needed. gallatin@ reports a ~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after this change. It is possible that we will not be able to allocate these buffers if physical memory is fragmented. To avoid frequently calling into the physical memory allocator in this scenario, rate-limit allocation attempts after a failure. In the failure case we fall back to the old behaviour of allocating a page at a time. N.B.: this scheme could be simplified, either by simply using malloc() and looking up the PAs of the pages backing the buffer, or by falling back to page by page allocation and creating a mapping in the cache zone. This requires some way to save a mapping of an M_EXTPG page array in the mbuf, though. m_data is not really appropriate. The second approach may be possible by saving the mapping in the plinks union of the first vm_page structure of the array, but this would force a vm_page access when freeing an mbuf. Reviewed by: gallatin, jhb Tested by: gallatin Sponsored by: Ampere Computing Submitted by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28556	2021-03-03 17:34:01 -05:00
Alan Somers	ab63da3564	Speed up geom_stats_resync in the presence of many devices The old code had a O(n) loop, where n is the size of /dev/devstat. Multiply that by another O(n) loop in devstat_mmap for a total of O(n^2). This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can quickly determine the right amount of memory to map, eliminating the O(n) loop in userland. This change decreases the time to run "gstat -bI0.001" with 16,384 md devices from 29.7s to 4.2s. Also, fix a memory leak first reported as PR 203097. Sponsored by: Axcient Reviewed by: mav, imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28968	2021-03-02 18:33:45 -07:00
Robert Wing	0cc746f193	filt_fsevent: only record interested events Respect filter-specific flags for the EVFILT_FS filter. When a kevent is registered with the EVFILT_FS filter, it is always triggered when an EVFILT_FS event occurs, regardless of the filter-specific flags used. Fix that. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28974	2021-03-02 14:19:22 -09:00
Konstantin Belousov	28cd3a673e	O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path Requested and reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:40 +02:00
Konstantin Belousov	49c98a4bf3	nameicap_check_dotdot: trim tracker on check Tracker should contain exactly the path from the starting directory to the current lookup point. Otherwise we might not detect some cases of dotdot escape. Consequently, if we are walking up the tree by dotdot lookup, we must remove an entries below the walked directory. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:35 +02:00
Konstantin Belousov	e8a2862aa0	Add nameicap_cleanup_from(), to clean tracker list starting from some element Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:30 +02:00
Konstantin Belousov	2388ad7c29	nameicap_tracker_add: avoid duplicates in the tracker list Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:23 +02:00
Konstantin Belousov	59e7494281	Do not call nameicap_tracker_add() for dotdot case. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:21:14 +02:00
Konstantin Belousov	20e91ca36a	open(2): Remove O_BENEATH and AT_BENEATH with the reasoning that the flags did not worked properly, and were not shipped in a release. O_RESOLVE_BENEATH is kept as useful. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907	2021-03-02 20:16:55 +02:00
Kyle Evans	60c4ec806d	jail: allow root to implicitly widen its cpuset to attach The default behavior for attaching processes to jails is that the jail's cpuset augments the attaching processes, so that it cannot be used to escalate a user's ability to take advantage of more CPUs than the administrator wanted them to. This is problematic when root needs to manage jails that have disjoint sets with whatever process is attaching, as this would otherwise result in a deadlock. Therefore, if we did not have an appropriate common subset of cpus/domains for our new policy, we now allow the process to simply take on the jail set if it has the privilege to widen its mask anyways. With the new logic, root can still usefully cpuset a process that attaches to a jail with the desire of maintaining the set it was given pre-attachment while still retaining the ability to manage child jails without jumping through hoops. A test has been added to demonstrate the issue; cpuset of a process down to just the first CPU and attempting to attach to a jail without access to any of the same CPUs previously resulted in EDEADLK and now results in taking on the jail's mask for privileged users. PR: 253724 Reviewed by: jamie (also discussed with) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28952	2021-03-01 12:38:31 -06:00
Rick Macklem	a5f9fe2bab	copy_file_range(2): Fix for small values of input file offset and len r366302 broke copy_file_range(2) for small values of input file offset and len. It was possible for rem to be greater than len and then "len - rem" was a large value, since both variables are unsigned. Reported by: koobs, Pablo <pablogsal gmail com> (Python) Reviewed by: asomers, koobs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28981	2021-03-01 06:31:10 -08:00
Konstantin Belousov	b5449c92b4	Use atomic_interrupt_fence() instead of bare __compiler_membar() for the which which definitely use membar to sync with interrupt handlers. libc and rtld uses of __compiler_membar() seems to want compiler barriers proper. The barrier in sched_unpin_lite() after td_pinned decrement seems to be not needed and removed, instead of convertion. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28956	2021-02-28 01:27:29 +02:00
Mateusz Guzik	1239a72221	cache: temporarily drop the assert that dvp != vp when adding an entry Historically it was allowed for any names, but arguably should never be even attempted. Allow it again since there is a release pending and allowing it is bug-compatible with previous behavior. Reported by: otis	2021-02-27 22:29:50 +00:00
Jamie Gritton	589e4c1df4	jail: Add safety around prison_deref() flags. do_jail_attach() now only uses the PD_XXX flags that refer to lock status, so make sure that something else like PD_KILL doesn't slip through. Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.	2021-02-25 20:10:42 -08:00
Jamie Gritton	108a9384e9	jail: Fix locking on an early jail_set error. I had locked allprison_lock without immediately setting PD_LIST_LOCKED.	2021-02-25 19:52:58 -08:00
John Baldwin	90972f0402	ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter. I missed updating this counter when rebasing the changes in `9c64fc4029` after the switch to COUNTER_U64_DEFINE_EARLY in `1755b2b989`. Fixes: `9c64fc4029` Add Chacha20-Poly1305 as a KTLS cipher suite. Sponsored by: Netflix	2021-02-25 15:00:13 -08:00
Ryan Libby	d7671ad8d6	Close races in vm object chain traversal for unlock We were unlocking the vm object before reading the backing_object field. In the meantime, the object could be freed and reused. This could cause us to go off the rails in the object chain traversal, failing to unlock the rest of the objects in the original chain and corrupting the lock state of the victim chain. Reviewed by: bdrewery, kib, markj, vangyzen MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28926	2021-02-25 12:11:19 -08:00
Edward Tomasz Napierala	e7a5b3bd05	Modify lock_delay() to increase the delay time after spinning Modify lock_delay() to increase the delay time after spinning, not before. Previously we would spin at least twice instead of once. In NetApp's benchmarks this fixes a performance regression compared to FreeBSD 10, which called cpu_spinwait() directly. Reviewed By: mjg Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27331	2021-02-25 18:55:26 +00:00
Mark Johnston	369706a6f8	buf: Fix the dirtybufthresh check dirtybufthresh is a watermark, slightly below the high watermark for dirty buffers. When a delayed write is issued, the dirtying thread will start flushing buffers if the dirtybufthresh watermark is reached. This helps ensure that the high watermark is not reached, otherwise performance will degrade as clustering and other optimizations are disabled (see buf_dirty_count_severe()). When the buffer cache was partitioned into "domains", the dirtybufthresh threshold checks were not updated. Fix this. Reported by: Shrikanth R Kamath <kshrikanth@juniper.net> Reviewed by: rlibby, mckusick, kib, bdrewery Sponsored by: Juniper Networks, Inc., Klara, Inc. Fixes: `3cec5c77d6` MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28901	2021-02-25 10:04:44 -05:00
Mark Johnston	faa998f6ff	sendfile: Use the pager size to determine the file extent when possible Previously sendfile would issue a VOP_GETATTR and use the returned size, i.e., the file size. When paging in file data, sendfile_swapin() will use the pager to determine whether it needs to zero-fill, most often because of a hole in a sparse file. An attempt to page in beyond the end of a file is treated this way, and occurs when the requested page is past the end of the pager. In other words, both the file size and pager size were used interchangeably. With ZFS, updates to the pager and file sizes are not synchronized by the exclusive vnode lock, at least partially due to its use of MNTK_SHARED_WRITES. In particular, the pager size is updated after the file size, so in the presence of a writer concurrently extending the file, sendfile could incorrectly instantiate "holes" in the page cache pages backing the file, which manifests as data corruption when reading the file back from the page cache. The on-disk copy is unaffected. Fix this by consistently using the pager size when available. Reported by: dumbbell Reviewed by: chs, kib Tested by: dumbbell, pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28811	2021-02-25 10:04:44 -05:00
Jamie Gritton	c861373bdf	jail: re-commit `811e27fa3c` with fixes Make sure PD_KILL isn't passed to do_jail_attach, where it might end up trying to kill the caller's prison (even prison0). Fix the child jail loop in prison_deref_kill, which was doing the post-order part during the pre-order part. That's not a system- killer, but make jails not always die correctly.	2021-02-24 21:54:49 -08:00
Jamie Gritton	ddfffb41a2	jail: back out `811e27fa3c` until it doesn't break Jenkins Reported by: arichardson	2021-02-24 21:10:47 -08:00
Mark Johnston	1d44514fcd	rmlock: Add a required compiler membar to the rlock slow path The tracker flags need to be loaded only after the tracker is removed from its per-CPU queue. Otherwise, readers may fail to synchronize with pending writers attempting to propagate priority to active readers, and readers and writers deadlock on each other. This was observed in a stable/12-based armv7 kernel where the compiler had reordered the load of rmp_flags to before the stores updating the queue. Reviewed by: rlibby, scottl Discussed with: kib Sponsored by: Rubicon Communications, LLC ("Netgate") MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28821	2021-02-23 21:17:12 -05:00
Alex Richardson	fa32350347	close_range: add audit support This fixes the closefrom test in sys/audit. Includes cherry-picks of the following commits from openbsm: `4dfc628aaf` `99ff6fe32a` `da48a0399e` Reviewed By: kevans Differential Revision: https://reviews.freebsd.org/D28388	2021-02-23 17:47:07 +00:00
Jamie Gritton	0a2a96f35a	jail: Don't allow jails under dying parents If a jail is created with jail_set(...JAIL_DYING), and it has a parent currently in a dying state, that will bring the parent jail back to life. Restrict that to require that the parent itself be explicitly brought back first, and not implicitly created along with the new child jail. Differential Revision: https://reviews.freebsd.org/D28515	2021-02-22 17:04:06 -08:00
Jamie Gritton	701d6b50ae	jail: Fix a LOR introduced in `1158508a80`	2021-02-22 15:51:10 -08:00
Jamie Gritton	811e27fa3c	jail: Add PD_KILL to remove a prison in prison_deref(). Add the PD_KILL flag that instructs prison_deref() to take steps to actively kill a prison and its descendents, namely marking it PRISON_STATE_DYING, clearing its PR_PERSIST flag, and killing any attached processes. This replaces a similar loop in sys_jail_remove(), bringing the operation under the same single hold on allprison_lock that it already has. It is also used to clean up failed jail (re-)creations in kern_jail_set(), which didn't generally take all the proper steps. Differential Revision: https://reviews.freebsd.org/D28473	2021-02-22 12:27:44 -08:00
Mark Johnston	608c44f96e	m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages The caller should not be passing M_ZERO in the first place, so PG_ZERO will not be preserved by the page allocator and clearing it accomplishes nothing. Reviewed by: gallatin, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28808	2021-02-22 10:04:46 -05:00
Jamie Gritton	1158508a80	jail: Add pr_state to struct prison Rather that using references (pr_ref and pr_uref) to deduce the state of a prison, keep track of its state explicitly. A prison is either "invalid" (pr_ref == 0), "alive" (pr_uref > 0) or "dying" (pr_uref == 0). State transitions are generally tied to the reference counts, but with some flexibility: a new prison is "invalid" even though it now starts with a reference, and jail_remove(2) sets the state to "dying" before the user reference count drops to zero (which was prviously accomplished via the PR_REMOVE flag). pr_state is protected by both the prison mutex and allprison_lock, so it has the same availablity guarantees as the reference counts do. Differential Revision: https://reviews.freebsd.org/D27876	2021-02-21 13:24:47 -08:00
Mateusz Guzik	ee9b37ae5c	jail: fix build after the previous commit Noted by: Michael Butler <imb protected-networks.net>	2021-02-21 21:05:25 +00:00
Jamie Gritton	f7496dcab0	jail: Change the locking around pr_ref and pr_uref Require both the prison mutex and allprison_lock when pr_ref or pr_uref go to/from zero. Adding a non-first or removing a non-last reference remain lock-free. This means that a shared hold on allprison_lock is sufficient for prison_isalive() to be useful, which removes a number of cases of lock/check/unlock on the prison mutex. Expand the locking in kern_jail_set() to keep allprison_lock held exclusive until the new prison is valid, thus making invalid prisons invisible to any thread holding allprison_lock (except of course the one creating or destroying the prison). This renders prison_isvalid() nearly redundant, now used only in asserts. Differential Revision: https://reviews.freebsd.org/D28419 Differential Revision: https://reviews.freebsd.org/D28458	2021-02-21 10:55:44 -08:00
Konstantin Belousov	2bfd8992c7	vnode: move write cluster support data to inodes. The data is only needed by filesystems that 1. use buffer cache 2. utilize clustering write support. Requested by: mjg Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Konstantin Belousov	750ea20d3f	Delete dead CLUSTERDEBUG config option. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Mateusz Guzik	81174cd8e2	vfs: employ vfs_ref_from_vp in statfs and fstatfs Avoids locking and unlocking the vnode. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28695	2021-02-21 00:43:05 +00:00
Mateusz Guzik	a15f787adb	vfs: add vfs_ref_from_vp This generalizes what vop_stdgetwritemount used to be doing. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28695	2021-02-21 00:43:05 +00:00
Jamie Gritton	6e1d1bfcac	jail: Improve locking when removing prisons Change the flow of prison_deref() so it doesn't let go of allprison_lock until it's completely done using it (except for a possible drop as part of an upgrade on its first try). Differential Revision: https://reviews.freebsd.org/D28458 MFC after: 3 days	2021-02-20 14:38:58 -08:00
Jamie Gritton	d4380c0cdd	jail: Change both root and working directories in jail_attach(2) jail_attach(2) performs an internal chroot operation, leaving it up to the calling process to assure the working directory is inside the jail. Add a matching internal chdir operation to the jail's root. Also ignore kern.chroot_allow_open_directories, and always disallow the operation if there are any directory descriptors open. Reported by: mjg Approved by: markj, kib MFC after: 3 days	2021-02-19 14:13:35 -08:00
John Baldwin	9c64fc4029	Add Chacha20-Poly1305 as a KTLS cipher suite. Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and TLS 1.3 (RFCs 7905 and 8446). For both versions, Chacha20 uses the server and client IVs as implicit nonces xored with the record sequence number to generate the per-record nonce matching the construction used with AES-GCM for TLS 1.3. Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27839	2021-02-18 09:26:32 -08:00
Thomas Skibo	9976b42b69	ddb: fix show devmap output on 32-bit arm The output has been broken since `1b6dd6d772`. Casting to uintmax_t before the call to printf is necessary to ensure that 32-bit addresses are interpreted correctly. PR: 243236 MFC after: 3 days	2021-02-18 11:53:14 -04:00
Konstantin Belousov	662283b108	vn_printf: handle VI_FOPENING Noted by: mjg Sponsored by: The FreeBSD Foundation MFC after: 6 days Fixes: `fa3bd463ce`	2021-02-18 16:28:28 +02:00
Alex Richardson	fa2528ac64	Use atomic loads/stores when updating td->td_state KCSAN complains about racy accesses in the locking code. Those races are fine since they are inside a TD_SET_RUNNING() loop that expects the value to be changed by another CPU. Use relaxed atomic stores/loads to indicate that this variable can be written/read by multiple CPUs at the same time. This will also prevent the compiler from doing unexpected re-ordering. Reported by: GENERIC-KCSAN Test Plan: KCSAN no longer complains, kernel still runs fine. Reviewed By: markj, mjg (earlier version) Differential Revision: https://reviews.freebsd.org/D28569	2021-02-18 14:02:48 +00:00
Konstantin Belousov	fa3bd463ce	lockf: ensure atomicity of lockf for open(O_CREAT\|O_EXCL\|O_EXLOCK) or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking exclusive open is in progress, and not allowing F_LOCK request to make a progress until the first open finishes. Requested by: mckusick Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28697	2021-02-18 01:22:05 +02:00
Warner Losh	00065c7630	Giant: move back Giant removal until 14 Update the Giant Lock warning message to FreeBSD 14. It's growing increasling clear that this won't be done before 13.0. MFC: Insta (re@'s request)	2021-02-17 14:33:09 -07:00
Jamie Gritton	cc7b730653	jail: Handle a possible race between jail_remove(2) and fork(2) jail_remove(2) includes a loop that sends SIGKILL to all processes in a jail, but skips processes in PRS_NEW state. Thus it is possible the a process in mid-fork(2) during jail removal can survive the jail being removed. Add a prison flag PR_REMOVE, which is checked before the new process returns. If the jail is being removed, the process will then exit. Also check this flag in jail_attach(2) which has a similar issue. Reported by: trasz Approved by: kib MFC after: 3 days	2021-02-16 11:19:13 -08:00
Konstantin Belousov	c61fae1475	pgcache read: protect against reads past end of the vm object size If uio_offset is past end of the object size, calculated resid is negative. Delegate handling this case to the locked read, as any other non-trivial situation. PR: 253158 Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de> Tested by: cy Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-02-16 07:09:37 +02:00
Alex Richardson	0482d7c9e9	Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum check After `eaad8d1303` four additional capsicum-test tests started failing. It turns out this is because fget_only_user() was returning EBADF on a failed capsicum check instead of forwarding the return value of cap_check_inline() like fget_unlocked_seq(). capsicum-test failures before this: ``` [ FAILED ] 7 tests, listed below: [ FAILED ] Capability.OperationsForked [ FAILED ] Capability.NoBypassDAC [ FAILED ] Pdfork.OtherUserForked [ FAILED ] PipePdfork.WildcardWait [ FAILED ] OpenatTest.WithFlag [ FAILED ] ForkedOpenatTest_WithFlagInCapabilityMode._ [ FAILED ] Select.LotsOFileDescriptorsForked ``` After: ``` [ FAILED ] 3 tests, listed below: [ FAILED ] Capability.NoBypassDAC [ FAILED ] Pdfork.OtherUserForked [ FAILED ] PipePdfork.WildcardWait ``` Reviewed By: mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28691	2021-02-15 22:55:12 +00:00
Jason A. Harmening	41032835dc	Fix divide-by-zero panic when ASLR is enabled and superpages disabled When locating the anonymous memory region for a vm_map with ASLR enabled, we try to keep the slid base address aligned on a superpage boundary to minimize pagetable fragmentation and maximize the potential usage of superpage mappings. We can't (portably) do this if superpages have been disabled by loader tunable and pagesizes[1] is 0, and it would be less beneficial in that case anyway. PR: 253511 Reported by: johannes@jo-t.de MFC after: 1 week Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28678	2021-02-15 10:38:04 -08:00
Mateusz Guzik	eac22dd480	lockmgr: shrink struct lock by 8 bytes on LP64 Currently the struct has a 4 byte padding stemming from 3 ints. 1. prio comfortably fits in short, unfortunately there is no dedicated type for it and plumbing it throughout the codebase is not worth it right now, instead an assert is added which covers also flags for safety 2. lk_exslpfail can in principle exceed u_short, but the count is already not considered reliable and it only ever gets modified straight to 0. In other words it can be incrementing with an upper bound of USHRT_MAX With these in place struct lock shrinks from 48 to 40 bytes. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D28680	2021-02-15 13:57:25 +00:00
Konstantin Belousov	25c6318c79	procstat: distinguish vm map guards in procstat vm output. Requested and reviewed by: rwatson (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28658	2021-02-14 03:24:58 +02:00
Konstantin Belousov	adf28ab456	fifo: minor comment and assert improvements. In particular, replace a note that reload through vget() is obsoleted, with explanation why this code is required. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:22 +02:00
Konstantin Belousov	b59a8e63d6	Stop ignoring ERELOOKUP from VOP_INACTIVE() When possible, relock the vnode and retry inactivation. Only vunref() is required not to drop the vnode lock, so handle it specially by not retrying. This is a part of the efforts to ensure that unlinked not referenced vnode does not prevent inode from reusing. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	3b2aa36024	Use VOP_VPUT_PAIR() for eligible VFS syscalls. The current list is limited to the cases where UFS needs to handle vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(), VOP_LINK(), and VOP_SYMLINK(). Reviewed by: chs, mkcusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	49c117c193	Add VOP_VPUT_PAIR() with trivial default implementation. The VOP is intended to be used in situations where VFS has two referenced locked vnodes, typically a directory vnode dvp and a vnode vp that is linked from the directory, and at least dvp is vput(9)ed. The child vnode can be also vput-ed, but optionally left referenced and locked. There, at least UFS may need to do some actions with dvp which cannot be done while vp is also locked, so its lock might be dropped temporary. For instance, in some cases UFS needs to sync dvp to avoid filesystem state that is currently not handled by either kernel nor fsck. Having such VOP provides the neccessary context for filesystem which can do correct locking and handle potential reclamation of vp after relock. Trivial implementation does vput(dvp) and optionally vput(vp). Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	ee965dfa64	vn_open(): If the vnode is reclaimed during open(2), do not return error. Most future operations on the returned file descriptor will fail anyway, and application should be ready to handle that failures. Not forcing it to understand the transient failure mode on open, which is implementation-specific, should make us less special without loss of reporting of errors. Suggested by: chs Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	bf0db19339	buf SU hooks: track buf_start() calls with B_IOSTARTED flag and only call buf_complete() if previously started. Some error paths, like CoW failire, might skip buf_start() and do bufdone(), which itself call buf_complete(). Various SU handle_written_XXX() functions check that io was started and incomplete parts of the buffer data reverted before restoring them. This is a useful invariant that B_IO_STARTED on buffer layer allows to keep instead of changing check and panic into check and return. Reported by: pho Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundations	2021-02-12 03:02:19 +02:00
Mateusz Guzik	39e0c3f686	cache: assorted comment fixups	2021-02-09 17:09:44 +01:00
Kyle Evans	504ebd612e	kern: sonewconn: set so_options before pru_attach() Protocol attachment has historically been able to observe and modify so->so_options as needed, and it still can for newly created sockets. `779f106aa1` moved this to after pru_attach() when we re-acquire the lock on the listening socket. Restore the historical behavior so that pru_attach implementations can consistently use it. Note that some pru_attach() do currently rely on this, though that may change in the future. D28265 contains a change to remove the use in TCP and IB/SDP bits, as resetting the requested linger time on incoming connections seems questionable at best. This does move the assignment out from under the head's listen lock, but glebius notes that head won't be going away and applications cannot assume any specific ordering with a race between a connection coming in and the application changing socket options anyways. Discussed-with: glebius MFC-after: 1 week	2021-02-08 21:44:43 -06:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Mark Johnston	b5aa9ad43a	ktls: Make configuration sysctls available as tunables Reviewed by: gallatin, jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28499	2021-02-08 09:19:02 -05:00
Mark Johnston	1755b2b989	ktls: Use COUNTER_U64_DEFINE_EARLY This makes it a bit more straightforward to add new counters when debugging. No functional change intended. Reviewed by: jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28498	2021-02-08 09:18:51 -05:00
Mateusz Guzik	2f8a844635	cache: remove the largely obsolete general description Examples of inconsistencies with the current state: - references LRU of all entries, removed years ago - references a non-existent lock (neglist) - claims negative entries have a NULL target It will be replaced with a more accurate and more informative description. In the meantime take it out so it stops misleading.	2021-02-06 00:28:40 +01:00
Mateusz Guzik	0e1594e60e	cache: fix vfs:namecache:lookup:miss probe call sites	2021-02-06 00:28:40 +01:00
Mateusz Guzik	2e96132a7d	cache: drop spurious arg from panic in cache_validate vp is already reported when noting mismatch	2021-02-06 00:28:39 +01:00
Mateusz Guzik	b54ed778fe	cache: comment on FNV	2021-02-06 00:13:57 +01:00
Ed Maste	edc374e7c4	Correct description for kern.proc.proc_td kern.proc.proc_td returns the process table with an entry for each thread. Previously the description included "no threads", presumably a cut-and-pasteo in `2648efa621`. Description suggested by PauAmma. PR: 253146 MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-02-02 17:00:05 -05:00
Mateusz Guzik	45456abc4c	cache: fix trailing slash support in face of permission problems Reported by: Johan Hendriks <joh.hendriks gmail.com> Tested by: kevans	2021-02-02 18:13:51 +00:00
Mateusz Guzik	6f19dc2124	cache: add delayed degenerate path handling	2021-02-01 04:53:23 +00:00
Mateusz Guzik	bbfb1edd70	cache: move hash computation into the parsing loop	2021-02-01 04:36:45 +00:00
Mateusz Guzik	e027e24bfa	cache: add trailing slash support Tested by: pho	2021-01-31 12:02:46 +00:00
Mateusz Guzik	8cbd164a17	cache: handle NOFOLLOW requests for symlinks Tested by: pho	2021-01-31 12:02:46 +00:00
Gleb Smirnoff	3f43ada98c	Catch up with `6edfd179c8`: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG. Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was extended to array of such pointers. Then, such mbufs were augmented with header/trailer. Basically, extended mbufs are extended, and set of features is subject to change. The new name should be generic enough to avoid further renaming.	2021-01-29 11:46:24 -08:00
Mateusz Guzik	45e1f85414	poll: use fget_unlocked or fget_only_user when feasible This follows select by eleminating the use of filedesc lock. This is a win for single-threaded processes and a mixed bag for others as at small concurrency it is faster to take the lock instead of refing/unrefing each file descriptor. Nonetheless, removal of shared lock usage is a step towards a mtx-protected fd table.	2021-01-29 11:23:44 +00:00
Mateusz Guzik	6affe1b712	select: employ fget_only_user Since most select users are single-threaded this avoid a lot of work in the common case. For example select of 16 fds (ops/s): before: 2114536 after: 2991010	2021-01-29 11:23:44 +00:00
Mateusz Guzik	eaad8d1303	fd: add fget_only_user This can be used by single-threaded processes which don't share a file descriptor table to access their file objects without having to reference them. For example select consumers tend to match the requirement and have several file descriptors to inspect.	2021-01-29 11:23:43 +00:00
Jamie Gritton	c050ea803e	jail: Handle a parent jail when a child is added to it It's possible when adding a jail that its dying parent comes back to life. Only allow that to happen when JAIL_DYING is specified. And if it does happen, call PR_METHOD_CREATE on it.	2021-01-28 21:51:09 -08:00
Bryan Drewery	c926114f2f	Fix getblk() with GB_NOCREAT returning false-negatives. It is possible for a buf to be reassigned between the dirty and clean lists while gbincore_unlocked() looks in each list. Avoid creating a buffer in that case and fallback to a locked lookup. This fixes a regression from r363482. More discussion on potential improvements to the clean and dirty lists handling is in the review. Reviewed by: cem, kib, markj, vangyzen, rlibby Reported by: Suraj.Raju at dell.com Submitted by: Suraj.Raju at dell.com, cem, [based on both] MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D28375	2021-01-28 11:24:24 -08:00
Mateusz Guzik	5c325977b1	cache: add missing MNT_NOSYMFOLLOW check to symlink traversal	2021-01-27 15:08:38 +00:00
Mateusz Guzik	5fc384d181	cache: fallback when encountering a mount point during .. lookup The current abort is overzealous.	2021-01-27 16:00:31 +01:00
Bjoern A. Zeeb	6f65b50546	firmware(9): extend firmware_get() by a "no warn" flag. With the upcoming usage from LinuxKPI but also from drivers ported natively we are seeing more probing of various firmware (names). Add the ability to firmware(9) to silence the "firmware image loading/registering errors" by adding a new firmware_get_flags() functions extending firmware_get() and taking a flags argument as firmware_put() already does. Requested-by: zeising (for future LinuxKPI/DRM) Sponsored-by: The FreeBSD Foundation Sponsored-by: Rubicon Communications, LLC ("Netgate") MFC after: 3 days Reviewed-by: markj Differential Revision: https://reviews.freebsd.org/D27413	2021-01-27 13:51:26 +00:00
Mateusz Guzik	a098a831a1	cache: tidy up handling of foo/bar lookups where foo is not a directory The code was performing an avoidable check for doomed state to account for foo being a VDIR but turning VBAD. Now that dooming puts a vnode in a permanent "modify" state this is no longer necessary as the final status check will catch it.	2021-01-26 20:42:53 +00:00
Mateusz Guzik	a51eca7936	cache: stop referring to removing entries as invalidating them Said use is a remnant from the old code and clashes with the NCF_INVALID flag.	2021-01-26 20:42:53 +00:00
Brooks Davis	d89c1c461c	Reserve gaps in syscall numbers for local use It is best for auditing of syscalls.master if we only append to the file. Reserving unimplemented system call numbers for local use makes this policy and provides a large set of syscall numbers FreeBSD derivatives can use without risk of conflict. Reviewed by: jhb, kevans, kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27988	2021-01-26 18:27:45 +00:00
Brooks Davis	119fa6ee8a	syscalls.master: Add a new syscall type: RESERVED RESERVED syscall number are reserved for local/vendor use. RESERVED is identical to UNIMPL except that comments are ignored. Reviewed by: kevans Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27988	2021-01-26 18:27:44 +00:00
Brooks Davis	65a524b499	Remove documentation of unimplemented syscalls We have not been able to run binaries from other BSDs well over a decade. There is no need to document their allocation decisions here. We also don't need to reserve syscall numbers of never-implemented syscalls. Reviewed by: jhb, kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27988	2021-01-26 18:27:44 +00:00
Mateusz Guzik	6943671b48	cache: convert cache_fplookup_parse to void now that it always succeeds	2021-01-26 13:24:03 +01:00
Mateusz Guzik	e7cf562a40	cache: change ->v_cache_dd synchronisation rules Instead of resorting to seqc modification take advantage of immutability of entries and check if the entry still matches after everything got prepared.	2021-01-25 22:41:13 +00:00
Mateusz Guzik	6f08427649	cache: make ->v_cache_dd accesses atomic-clean for lockless usage	2021-01-25 22:41:13 +00:00
Mateusz Guzik	6ef8fede86	cache: make ->nc_flag accesses atomic-clean for lockless usage	2021-01-25 22:41:13 +00:00
Mateusz Guzik	ffcf8f97f8	cache: store vnodes in local vars in cache_zap_locked	2021-01-25 22:41:13 +00:00
Mateusz Guzik	868643e722	cache: assorted cleanups	2021-01-25 19:45:24 +00:00
Mateusz Guzik	1c7a65adb0	cache: track calls to cache_symlink_alloc with unsupported size While here assert on size passed to free.	2021-01-25 19:45:23 +00:00
Mateusz Guzik	02ec31bdf6	cache: add back target entry on rename	2021-01-23 18:10:16 +00:00
Mateusz Guzik	739ecbcf1c	cache: add symlink support to lockless lookup Reviewed by: kib (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D27488	2021-01-23 15:04:43 +00:00
Jamie Gritton	195cd6ae24	jail: fix dangling reference bug from `6754ae2572` The change to use refcounts for pr_uref was mishandled in prison_proc_free, so killing a jail's last process could add an extra reference, leaving it an unkillable zombie.	2021-01-22 10:56:24 -08:00
Jamie Gritton	39c8ef90f6	jail: A jail could be removed without calling OSD methods Fix a long-standing bug where setting nopersist on a process-less jail would remove it without calling the the OSD PR_METHOD_REMOVE methods.	2021-01-22 10:50:10 -08:00
Marius Strobl	679e4cdabd	kvprintf(9): add missing FALLTHROUGH Reported by: Coverity CID: 1005166	2021-01-22 00:18:40 +01:00
Konstantin Belousov	1ac7c34486	malloc_aligned: roundup allocation size up to next power of two to make it use the right aligned zone. Reported by: melifaro Reviewed by: alc, markj (previous version) Discussed with: jrtc27 Tested by: pho (previous version) MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28219	2021-01-21 23:34:10 +02:00
Konstantin Belousov	0781c79d48	Restrict supported alignment for malloc_domainset_aligned(9) to PAGE_SIZE. UMA page_alloc() does not take an alignment, so UMA can only handle alignment less then page size. Noted by: alc Reviewed by: alc, markj (previous version) Discussed with: jrtc27 Tested by: pho (previous version) MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28219	2021-01-21 23:34:10 +02:00
Jamie Gritton	6754ae2572	jail: Use refcount(9) for prison references. Use refcount(9) for both pr_ref and pr_uref in struct prison. This allows prisons to held and freed without requiring the prison mutex. An exception to this is that dropping the last reference will still lock the prison, to keep the guarantee that a locked prison remains valid and alive (provided it was at the time it was locked). Among other things, this honors the promise made in a comment in crcopy(9), that it will not block, which hasn't been true for two decades.	2021-01-20 15:08:27 -08:00
Vladimir Kondratyev	e3dd8ed77b	devinfo sysctl handler: Do not write zero-length strings in to sbuf twice This fixes missing PnPinfo and location strings in devinfo(8) output for devices with no attached drivers.	2021-01-21 02:06:16 +03:00
Alan Somers	2247f48941	aio: micro-optimize the lio_opcode assignments This allows slightly more efficient opcode testing in-kernel. It is transparent to userland, except to applications that sneakily submit aio fsync or aio mlock operations via lio_listio, which has never been documented, requires the use of deliberately undefined constants (LIO_SYNC and LIO_MLOCK), and is arguably a bug. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D27942	2021-01-20 09:02:25 -07:00
Alex Richardson	7e99c034f7	Emit uprintf() output for initproc if there is no controlling terminal This patch helped me debug why /sbin/init was not being loaded after making changes to the image activator in CheriBSD. Reviewed By: jhb (earlier version), kib Differential Revision: https://reviews.freebsd.org/D28121	2021-01-20 09:54:46 +00:00
Mateusz Guzik	2171b8e8a2	cache: augment sdt probe in cache_fplookup_dot Same as `6d386b4c` ("cache: save a branch in cache_fplookup_next")	2021-01-20 07:23:14 +00:00
Mateusz Guzik	aae03cfe64	cache: whitespace nit in cache_fplookup_modifying	2021-01-20 07:22:04 +00:00
Mark Johnston	4dc1b17dbb	ktls: Improve handling of the bind_threads tunable a bit - Only check for empty domains if we actually tried to configure domain affinity in the first place. Otherwise setting bind_threads=1 will always cause the sysctl value to be reported as zero. This is harmless since the threads end up being bound, but it's confusing. - Try to improve the sysctl description a bit. Reviewed by: gallatin, jhb Submitted by: Klara, Inc. Sponsored by: Ampere Computing Differential Revision: https://reviews.freebsd.org/D28161	2021-01-19 21:32:33 -05:00
Mateusz Guzik	38baca17e0	lockmgr: fix upgrade TRYUPGRADE requests kept failing when they should not have due to wrong macro used to count readers. Fixes: `f6b091fbbd` ("lockmgr: rewrite upgrade to stop always dropping the lock") Noted by: asomers Differential Revision: https://reviews.freebsd.org/D27947	2021-01-19 12:21:38 +00:00
Mateusz Guzik	57dab0292a	cache: fix some typos	2021-01-19 10:17:14 +01:00
Mateusz Guzik	84ab77ad27	cache: drop-write only var from cache_fplookup_preparse	2021-01-19 10:13:30 +01:00
Mateusz Guzik	6d386b4c8a	cache: save a branch in cache_fplookup_next Previously the code would branch on top find out whether it should branch on SDT probe and bumping the numposhits counter, depending on cache_fplookup_cross_mount. Arguably it should be done regardless of what said function returns.	2021-01-19 10:08:24 +01:00
Jamie Gritton	effad35ed1	jail: Clean up some function placement and improve comments. Move prison_hold, prison_hold_locked ,prison_proc_hold, and prison_proc_free to a more intuitive part of the file (together with with prison_free and prison_free_locked), and add or improve comments to these and others, to better describe what's going in the prison reference cycle. No functional changes.	2021-01-18 17:23:51 -08:00
Oleksandr Tymoshenko	248f0cabca	make maximum interrupt number tunable on ARM, ARM64, MIPS, and RISC-V Use a machdep.nirq tunable intead of compile-time constant NIRQ as a value for maximum number of interrupts. It allows keep a system footprint small by default with an option to increase the limit for large systems like server-grade ARM64 Reviewd by: mhorne Differential Revision: https://reviews.freebsd.org/D27844 Submitted by: Klara, Inc. Sponsored by: Ampere Computing	2021-01-18 16:36:39 -08:00
Jamie Gritton	83bc72a04e	jail: Fix a stray mutex from `76ad42abf9`.	2021-01-18 15:47:09 -08:00
Jamie Gritton	76ad42abf9	jail: Add prison_isvalid() and prison_isalive() prison_isvalid() checks if a prison record can be used at all, i.e. pr_ref > 0. This filters out prisons that aren't fully created, and those that are either in the process of being dismantled, or will be at the next opportunity. While the check for pr_ref > 0 is simple enough to make without a convenience function, this prepares the way for other measures of prison validity. prison_isalive() checks not only validity as far as the useablity of the prison structure, but also whether the prison is visible to user space. It replaces a test for pr_uref > 0, which is currently only used within kern_jail.c, and not often there. Both of these functions also assert that either the prison mutex or allprison_lock is held, since it's generally the case that unlocked prisons aren't guaranteed to remain useable for any length of time. This isn't entirely true, for example a thread can assume its own prison is good, but most exceptions will exist inside of kern_jail.c.	2021-01-18 10:56:20 -08:00
Konstantin Belousov	36bcc44e2c	Add ddb 'show timecounter' command. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-01-18 09:51:48 +02:00
Jamie Gritton	25c2c952e3	jail: Add proper prison locking in mqfs_prison_remove.	2021-01-17 17:41:09 -08:00
Konstantin Belousov	3b15beb30b	Implement malloc_domainset_aligned(9). Change the power-of-two malloc zones to require alignment equal to the size []. Current uma allocator already provides such alignment, so in fact this change does not change anything except providing future-proof setup. Suggested by: markj [] Reviewed by: andrew, jah, markj Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28147	2021-01-17 19:29:05 +02:00
Mateusz Guzik	fe258f23ef	Save on getpid in setproctitle by supporting -1 as curproc.	2021-01-16 09:36:54 +01:00
Kirk McKusick	79a5c790bd	Eliminate a locking panic when cleaning up UFS snapshots after a disk failure. Each vnode has an embedded lock that controls access to its contents. However vnodes describing a UFS snapshot all share a single snapshot lock to coordinate their access and update. As part of mounting a UFS filesystem with snapshots, each of the vnodes describing a snapshot has its individual lock replaced with the snapshot lock. When the filesystem is unmounted the vnode's original lock is returned replacing the snapshot lock. When a disk fails while the UFS filesystem it contains is still mounted (for example when a thumb drive is removed) UFS forcibly unmounts the filesystem. The loss of the drive causes the GEOM subsystem to orphan the provider, but the consumer remains until the filesystem has finished with the unmount. Information describing the snapshot locks was being prematurely cleared during the orphaning causing the return of the snapshot vnode's original locks to fail. The fix is to not clear the needed information prematurely. Sponsored by: Netflix	2021-01-15 16:36:42 -08:00
Mitchell Horne	818390ce0c	arm64: fix early devmap assertion The purpose of this KASSERT is to ensure that we do not run out of space in the early devmap. However, the devmap grew beyond its initial size of 2MB in r336519, and this assertion did not grow with it. A devmap mapping of a 1080p framebuffer requires 1920x1080 bytes, or 1.977 MB, so it is just barely able to fit without triggering the assertion, provided no other devices are mapped before it. With the addition of `options GDB` in GENERIC by `bbfa199cbc`, the uart is now mapped for the purposes of a debug port, before mapping the framebuffer. The presence of both these conditions pushes the selected virtual address just below the threshold, triggering the assertion. To fix this, use the correct size of the devmap, defined by PMAP_MAPDEV_EARLY_SIZE. Since this code is shared with RISC-V, define it for that platform as well (although it is a different size). PR: 25241 Reported by: gbe MFC after: 3 days Sponsored by: The FreeBSD Foundation	2021-01-13 17:27:44 -04:00
Mateusz Guzik	ef23df1354	vfs: set NC_KEEPPOSENTRY alongside NOCACHE when creating a file Arguably the entire NOCACHE logic should get retired, in the meantime at least prevent the code from evicting existing entries.	2021-01-13 15:29:34 +00:00
Mateusz Guzik	5753be8e43	fd: add refcount argument to falloc_noinstall This lets callers avoid atomic ops by initializing the count to required value from the get go. While here add falloc_abort to backpedal from this without having to fdrop.	2021-01-13 15:29:34 +00:00
Mateusz Guzik	5171310e66	vfs: use finstall_refed in openat This avoids 2 atomic ops in the common case: 1 to grab an extra reference and 1 to release it.	2021-01-13 03:30:38 +00:00
Mateusz Guzik	530b699a62	fd: add finstall_refed Can be used to consume an already existing reference and consequently avoid atomic ops.	2021-01-13 03:27:03 +01:00
Mateusz Guzik	4faa375cdd	fd: provide a dedicated closef variant for unix socket code This avoids testing for td != NULL.	2021-01-13 03:27:03 +01:00
Konstantin Belousov	0659df6fad	vm_map_protect: allow to set prot and max_prot in one go. This prevents a situation where other thread modifies map entries permissions between setting max_prot, then relocking, then setting prot, confusing the operation outcome. E.g. you can get an error that is not possible if operation is performed atomic. Also enable setting rwx for max_prot even if map does not allow to set effective rwx protection. Reviewed by: brooks, markj (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28117	2021-01-13 01:35:22 +02:00
Mateusz Guzik	70ba77706d	vfs: extend vfs:namei:lookup:return probe with nameidata	2021-01-12 13:35:27 +00:00
Mateusz Guzik	cdb62ab74e	vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers Check the comment above the routine for reasoning.	2021-01-12 13:16:10 +00:00
Mateusz Guzik	6b3a9a0f3d	Convert remaining cap_rights_init users to cap_rights_init_one semantic patch: @@ expression rights, r; @@ - cap_rights_init(&rights, r) + cap_rights_init_one(&rights, r)	2021-01-12 13:16:10 +00:00
Konstantin Belousov	57f22c828e	sigfastblock: do not skip cursig/postsig loop in ast() Even if sigfastblock block is non-zero, non-blockable signals must be checked on ast and delivered now. This also affects debugger ability to attach, because issignal() also calls ptracestop() if there is a pending stop for debugee. Instead of checking for sigfastblock, and either setting PENDING flag for usermode or doing signal delivery loop, always do the loop after checking, and then handle PENDING bit. issignal() already does the right thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to happen. Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>, markj Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28089	2021-01-12 12:45:26 +02:00
Konstantin Belousov	513320c0f1	sigfastblock_setpend(): do not set PEND user flag unless TDP_SIGFASTPENDING is set. User pending bit should not be set if kernel did not noted a pending signal. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28089	2021-01-12 12:43:34 +02:00
Alan Somers	ff1a307801	lio_listio: validate aio_lio_opcode Previously, we would accept any kind of LIO_* opcode, including ones that were intended for in-kernel use only like LIO_SYNC (which is not defined in userland). The situation became more serious with `022ca2fc7f`. After that revision, setting aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion. Note that POSIX does not specify what should happen if aio_lio_opcode is invalid. MFC-with: `022ca2fc7f` Reviewed by: jhb, tmunro, 0mp Differential Revision: <https://reviews.freebsd.org/D28078	2021-01-11 19:53:01 -07:00
Jason A. Harmening	e8a5a1ad71	rctl(4): support throttling resource usage to 0 For rate-based resources that support throttling (e.g. readiops/writeips), this fixes a divide-by-zero panic when rctl(8) passes 0 as the throttle value. For these resources, treat zero-throttle requests as requests to suspend forward progress as long as possible using the duration specified in kern.racct.rctl.throttle_max. PR: 251803 Reported by: chris@cretaforce.gr Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27858	2021-01-11 15:36:57 -08:00
Konstantin Belousov	4ea65707d3	exec_new_vmspace: print useful error message on ctty if stack cannot be mapped. After old vmspace is destroyed during execve(2), but before the new space is fully constructed, an error during image activation cannot be returned because there is no executing program to receive it. In the relatively common case of failure to map stack, print some hints on the control terminal. Note that user has enough knobs to cause stack mapping error, and this is the most common reason for execve(2) aborting the process. Requested by: jhb Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28050	2021-01-12 01:15:43 +02:00
Konstantin Belousov	2e1c94aa1f	Implement enforcing write XOR execute mapping policy. It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE \| PROT_EXEC are never specified together, if vm_map has MAP_WX flag set. FreeBSD control flag allows specific binary to request WX exempt, and there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/ disable globally. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28050	2021-01-12 01:15:43 +02:00
Robert Watson	30b68ecda8	Changes that improve DTrace FBT reliability on freebsd/arm64: - Implement a dtrace_getnanouptime(), matching the existing dtrace_getnanotime(), to avoid DTrace calling out to a potentially instrumentable function. (These should probably both be under KDTRACE_HOOKS. Also, it's not clear to me that they are correct implementations for the DTrace thread time functions they are used in .. fixes for another commit.) - Don't allow FBT to instrument functions involved in EL1 exception handling that are involved in FBT trap processing: handle_el1h_sync() and do_el1h_sync(). - Don't allow FBT to instrument DDB and KDB functions, as that makes it rather harder to debug FBT problems. Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel panics due to recursion in DTrace. Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to have the aarch64 instrumentor more carefully check that instructions it replaces are against the stack pointer, which can otherwise lead to memory corruption. That change remains under review. MFC after: 2 weeks Reviewed by: andrew, kp, markj (earlier version), jrtc27 (earlier version) Differential revision: https://reviews.freebsd.org/D27766	2021-01-11 15:42:22 +00:00
Robert Watson	4f2cbaf3cd	Track pipe(2) reads and writes as rusage message receives and sends, a feature misplaced during the transition from BSD 4.4's socket implementation to the optimised FreeBSD pipe implementation. MFC after: 1 week Reviewed by: arichardson, imp Differential Revision: https://reviews.freebsd.org/D27878	2021-01-10 12:16:39 +00:00
Jamie Gritton	2a4b225146	jail: Simplify handling of prison_deref() Track the the current lock/reference state in a single variable, rather than deducing the proper prison_deref() flags from a combination of equations and hard-coded values.	2021-01-09 21:05:06 -08:00
Konstantin Belousov	5844bd058a	jobc: rework detection of orphaned groups. Instead of trying to maintain pg_jobc counter on each process group update (and sometimes before), just calculate the counter when needed. Still, for the benefit of the signal delivery code, explicitly mark orphaned groups as such with the new process group flag. This way we prevent bugs in the corner cases where updates to the counter were missed due to complicated configuration of p_pptr/p_opptr/real_parent (debugger). Since we need to iterate over all children of the process on exit, this change mostly affects the process group entry and leave, where we need to iterate all process group members to detect orpaned status. (For MFC, keep pg_jobc around but unused). Reported by: jhb Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	cf4f802e77	kinfo_proc: move job-control related data collection into a new helper. This improves code structure and allows to put the lock asserts right into place where the locks are needed. Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only() to fill_kinfo_proc(), this looks more symmetrical. Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	4daea93813	Lock proctree in around fill_kinfo_proc(). Proctree lock is needed for correct calculation and collection of the job-control related data in kinfo_proc. There was even an XXX comment about it. Satisfy locking and lock ordering requirements by taking proctree lock around pass over each bucket in proc_iterate(), and in sysctl_kern_proc() and note_procstat_proc() for individual process reporting. Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	a008bdeda3	tty_wait_background: improve locking. Increase the scope of the process group lock ownership. This ensures that we are consistent in returning EIO for tty write from an orphan and delivery of TTYOUT signals. Reviewed by: jilles Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:20 +02:00
Konstantin Belousov	ef739c7373	pgrp: Prevent use after free. Often, we have a process locked and need to get locked process group. In this case, because progress group lock is before process lock, unlocking process allows the group to be freed. See for instance tty_wait_background(). Make pgrp structures allocated from nofree zone, and ensure type stability of the pgrp mutex. Reviewed by: jilles Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:19 +02:00
Konstantin Belousov	e0d83cd3e4	issignal(): when handling STOP-like signals, drop sigacts mutex earlier. Reviewed by: jilles Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:19 +02:00
Konstantin Belousov	993a1699b1	Style. Improve some KASSERTs messages. Reviewed by: jilles Tested by: pho MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27871	2021-01-10 04:41:19 +02:00
Michael Tuexen	6685e259e3	tcp: don't use KTLS socket option on listening sockets KTLS socket options make use of socket buffers, which are not available for listening sockets. Reported by: syzbot+a8829e888a93a4a04619@syzkaller.appspotmail.com Reviewed by: jhb@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D27948	2021-01-08 08:57:11 +01:00
Jan Kokemüller	4d0c33be63	kevent(2): Bugfix for wrong EVFILT_TIMER timeouts When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be used instead of NS_TO_SBT, otherwise the timeout results are misleading. PR: 252539 Reviewed by: kevans, kib Approved by: kevans MFC after: 3 weeks	2021-01-09 20:00:25 +01:00
Warner Losh	40e6e2c2f7	sysctl: improve debug.kdb.panic_str description Improve the wording for this sysctl. Submitted by: rpokala@	2021-01-09 11:10:42 -07:00
Warner Losh	936440560b	sysctl: implement debug.kdb.panic_str This is just like debug.kdb.panic, except the string that's passed in is reported in the panic message. This allows people with automated systems to collect kernel panics over a large fleet of machines to flag panics better. Strings like "Warner look at this hang" or "see JIRA ABC-1234 for details" allow these automated systems to route the forced panic to the appropriate engineers like you can with other types of panics. Other users are likely possible. Relnotes: Yes Sponsored by: Netflix Reviewed by: allanjude (earlier version) Suggestions from review folded in by: 0mp, emaste, lwhsu Differential Revision: https://reviews.freebsd.org/D28041	2021-01-08 14:30:28 -07:00
Andrew Gallatin	52cd25eb1a	mbuf: enable ext_pgs ("unmapped") mbufs by default Ext_pg mbufs allow carrying multiple pages per mbuf. This reduces mbuf linked list traversals, especially in socket buffers, thereby reducing cache misses and CPU use for applications using sendfile. Note that ext_pages use unmapped pages, eliminating KVA mapping costs on 32-bit platforms. Ext_pg mbufs are also required for ktls (KERN_TLS), and having them disabled by default is a stumbling block for those wishing to enable ktls. Reviewed-by: jhb, glebius Sponsored by: Netfix	2021-01-08 13:43:30 -05:00
Mateusz Guzik	8ddea0b127	cache: just assign ni_resflags = NIRES_ABS It is guaranteed to be 0 on entry.	2021-01-08 13:57:10 +00:00
Toomas Soome	742653ebd5	sysctl debug.dump_modinfo should recognize font module Add MODINFOMD_FONT to dump list.	2021-01-08 09:24:49 +02:00
Alan Somers	20321e6225	Regenerate syscall files after reallocation of aio_writev/aio_readv	2021-01-07 19:50:32 -07:00
Alan Somers	b3286afae3	Reallocate syscall numbers for aio_writev and aio_readv The originally chosen numbers interfere with downstream projects' syscalls. Move them to the end of the syscall table instead. Reported by: jrtc27 Reviewed by: brooks MFC-With: `022ca2fc7f` Differential Revision: `022ca2fc7f`	2021-01-07 19:49:27 -07:00
Thomas Munro	801ac943ea	aio_fsync(2): Support O_DSYNC. aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2). Reviewed by: kib, asomers, jhb Differential Review: https://reviews.freebsd.org/D25071	2021-01-08 13:15:56 +13:00
Thomas Munro	a5e284038e	open(2): Add O_DSYNC flag. POSIX O_DSYNC means that writes include an implicit fdatasync(2), just as O_SYNC implies fsync(2). VOP_WRITE() functions that understand the new IO_DATASYNC flag can act accordingly, but we'll still pass down IO_SYNC so that file systems that don't understand it will continue to provide the stronger O_SYNC behaviour. Flag also applies to fcntl(2). Reviewed by: kib, delphij Differential Revision: https://reviews.freebsd.org/D25090	2021-01-08 13:15:56 +13:00
Mateusz Guzik	71bd18d373	fd: use seqc_read_notmodify when translating fds	2021-01-07 23:30:04 +00:00
Mateusz Guzik	20ac5cda96	fd: make fd/fp mandatory They are both always passed anyway.	2021-01-07 23:30:04 +00:00
Mateusz Guzik	fee405e057	cache: stop checkpointing cn_flags They are only modified, if ever, for the last component.	2021-01-07 23:29:52 +00:00
Mateusz Guzik	ac7715471c	cache: stop checkpointing cn_nameptr For aborts cn_nameptr is the same as cn_pnbuf. For partial results the same cn_nameptr is to be used.	2021-01-07 23:29:38 +00:00
Mateusz Guzik	0f1fc3a31f	cache: stop manipulating pathlen It is a copy-pasto from regular lookup. Add debug to ensure the result is the same.	2021-01-07 23:26:53 +00:00
Chuck Silvers	11403bdeb4	vfs: fix rangelock range in vn_rdwr() for IO_APPEND vn_rdwr() must lock the entire file range for IO_APPEND just like vn_io_fault() does for O_APPEND. Reviewed by: kib, imp, mckusick Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28008	2021-01-07 13:37:35 -08:00
Mateusz Guzik	f2b794e1e9	cache: unengrish the comment in previous commit Reported by: rpokala, brd	2021-01-06 23:46:05 +00:00
Mateusz Guzik	deabdc6868	cache: stop pre-checking seqc when starting the lookup Tested by: pho	2021-01-06 07:28:07 +00:00
Mateusz Guzik	71a6a0b545	cache: skip checking for spurious slashes if possible Tested by: pho	2021-01-06 07:28:06 +00:00
Mateusz Guzik	33f3e81df5	cache: combine fast path enabled status into one flag Tested by: pho	2021-01-06 07:28:06 +00:00
Mateusz Guzik	dbbbc07cc3	cache: split handling of 0 and non-0 error codes Tested by: pho	2021-01-06 07:07:24 +01:00
Mateusz Guzik	a1a8f8ada1	cache: deinline state handling The intent is to reduce branchfest when finishing the lookup. Tested by: pho	2021-01-06 07:05:22 +01:00
Mateusz Guzik	05803be000	cache: stop setting cn_nameptr on entry as matches cn_pnbuf already While here tidy up other asserts.	2021-01-06 07:03:41 +01:00
Mateusz Guzik	3814bea00a	cache: drop the now spurious doomed check when crossing a mount point	2021-01-03 21:22:16 +00:00
Mateusz Guzik	33a195baf3	vfs: keep seqc unchanged as long as the vnode is accessible via SMR	2021-01-03 21:22:16 +00:00
Mark Johnston	214257da3a	sendfile: Clear page pointers when handling a pager error When INVARIANTS is configred, the sendfile_iodone() callback verifies that pages attached to the sendfile header are wired, but we unwire all such pages after a synchronous pager error, before calling sendfile_iodone(). Reported by: pho Tested by: pho Sponsored by: The FreeBSD Foundation	2021-01-03 11:50:31 -05:00
Mark Johnston	90f580b954	Ensure that dirent's d_off field is initialized We have the d_off field in struct dirent for providing the seek offset of the next directory entry. Several filesystems were not initializing the field, which ends up being copied out to userland. Reported by: Syed Faraz Abrar <faraz@elttam.com> Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27792	2021-01-03 11:50:31 -05:00
Mateusz Guzik	82397d7919	vfs: denote vnode being a mount point with VIRF_MOUNTPOINT Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27794	2021-01-03 06:50:06 +00:00
Mateusz Guzik	3e506a67bb	vfs: add v_irflag accessors Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27793	2021-01-03 06:50:06 +00:00
Mateusz Guzik	51bf55fa6c	cache: stop checkpointing cn_namelen The variable is recomputed by regular lookup from the get go.	2021-01-03 06:50:06 +00:00
Mateusz Guzik	7220a10b5b	cache: predict on no spurious slashes in cache_fpl_handle_root This is a step towards speculatively not handling them.	2021-01-03 06:50:06 +00:00
Mateusz Guzik	30a2fc91fa	cache: postpone NAME_MAX check as it may be unnecessary	2021-01-03 06:50:06 +00:00
Mateusz Guzik	eca899bd5d	cache: remove spurious null check in sdt probe	2021-01-03 06:50:06 +00:00
Alan Somers	1868a91fac	Regenerate syscall files after addition of aio_writev/aio_readv	2021-01-02 19:57:58 -07:00
Alan Somers	022ca2fc7f	Add aio_writev and aio_readv POSIX AIO is great, but it lacks vectored I/O functions. This commit fixes that shortcoming by adding aio_writev and aio_readv. They aren't part of the standard, but they're an obvious extension. They work just like their synchronous equivalents pwritev and preadv. It isn't yet possible to use vectored aiocbs with lio_listio, but that could be added in the future. Reviewed by: jhb, kib, bcr Relnotes: yes Differential Revision: https://reviews.freebsd.org/D27743	2021-01-02 19:57:58 -07:00
Jamie Gritton	b58a46347c	jail: revert the attachment part of `b4e87a6329` The change to kern_jail_set that was supposed to "also properly clean up when attachment fails" didn't fix a memory leak but actually caused a double free. Back that part out, and leave the part that manages allprison_lock state.	2020-12-31 19:55:49 -08:00
Mateusz Guzik	1365b5f86f	cache: fold NCF_WHITE check into the rest Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	d7c62d98c9	cache: call cache_fplookup_modifying in neg Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	6fe7de1a25	cache: refactor cache_fpl_handle_root to fit the rest of the code better Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	e17e01bd0e	cache: refactor dot handling Tested by: pho	2021-01-01 00:10:43 +00:00
Mateusz Guzik	4651db56c7	cache: remove a branch from mount point checking Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	0b5bd1afd8	cache: support lockless lookup of degenerate paths Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	1d6eb97677	cache: save on branching when parsing the path by inserting a sentinel Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	67297766b5	cache: hoist trailing slash and degenerate path handling out of the loop Tested by: pho	2021-01-01 00:10:42 +00:00
Mateusz Guzik	bb3a12f0e5	fd: inline pwd_get_smr Tested by: pho	2021-01-01 00:10:42 +00:00

... 2 3 4 5 6 ...

18364 Commits