freebsd-dev

Author	SHA1	Message	Date
Kyle Evans	0cd95859c8	[2/3] Add an initial seal argument to kern_shm_open() Now that flags may be set on posixshm, add an argument to kern_shm_open() for the initial seals. To maintain past behavior where callers of shm_open(2) are guaranteed to not have any seals applied to the fd they're given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special flag could be opened later for shm_open(2) to indicate that sealing should be allowed. We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a shmfd that already existed. A note's been added about the assumptions we've made here as a hint towards anyone wanting to allow other seals to be applied at creation. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21392	2019-09-25 17:35:03 +00:00
Kyle Evans	af755d3e48	[1/3] Add mostly Linux-compatible file sealing support File sealing applies protections against certain actions (currently: write, growth, shrink) at the inode level. New fileops are added to accommodate seals - EINVAL is returned by fcntl(2) if they are not implemented. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D21391	2019-09-25 17:32:43 +00:00
Kyle Evans	85c5f3cb57	Add COMPAT12 support to makesyscalls.sh Reviewed by: kib, imp, brooks (all without syscalls.master edits) Differential Revision: https://reviews.freebsd.org/D21366	2019-09-25 17:29:45 +00:00
Toomas Soome	3001e0c942	kernel: terminal_init() should check for teken colors from kenv Check for teken.fg_color and teken.bg_color and prepare the color attributes accordingly. When white background is used, make it light to improve visibility. When black background is used, make kernel messages light.	2019-09-25 13:21:07 +00:00
Alexander Motin	bb3dfc6ae9	Fix wrong assertion in r352658. MFC after: 1 month	2019-09-25 11:58:54 +00:00
Alexander Motin	c9205e3500	Fix/improve interrupt threads scheduling. Doing some tests with very high interrupt rates I've noticed that one of conditions I added in r232207 to make interrupt threads in most cases run on local CPU never worked as expected (worked only if previous time it was executed on some other CPU, that is quite opposite). It caused additional CPU usage to run full CPU search and could schedule interrupt threads to some other CPU. This patch removes that code and instead reuses existing non-interrupt code path with some tweaks for interrupt case: - On SMT systems, if current thread is idle, don't look on other threads. Even if they are busy, it may take more time to do fill search and bounce the interrupt thread to other core then execute it locally, even sharing CPU resources. It is other threads should migrate, not bound interrupts. - Try hard to keep interrupt threads within LLC of their original CPU. This improves scheduling cost and supposedly cache and memory locality. On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few percents of CPU time while adding few percents to IOPS. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-24 20:01:20 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Mateusz Guzik	93a85508ad	cache: tidy up handling of negative entries - track the total count of hot entries - pre-read the lock when shrinking since it is typically already taken - place the lock in its own cacheline - shorten the hold time of hot lock list when zapping Sponsored by: The FreeBSD Foundation	2019-09-23 20:50:04 +00:00
Mark Johnston	38dae42c26	Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations. This is required for DPCPU and VNET data variable definitions to work when KLDs are linked as DSOs. R_X86_64_RELATIVE relocations should not appear in object files, so assert this in elf_relocaddr(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21755	2019-09-23 14:14:43 +00:00
Mateusz Guzik	afe257e3ca	cache: count evictions of negatve entries Sponsored by: The FreeBSD Foundation	2019-09-23 08:53:14 +00:00
Sean Eric Fagan	ba7a55d934	Add two options to allow mount to avoid covering up existing mount points. The two options are * nocover/cover: Prevent/allow mounting over an existing root mountpoint. E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local is already a mountpoint. * emptydir/noemptydir: Prevent/allow mounting on a non-empty directory. E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail. Neither of these options is intended to be a default, for historical and compatibility reasons. Reviewed by: allanjude, kib Differential Revision: https://reviews.freebsd.org/D21458	2019-09-23 04:28:07 +00:00
Mateusz Guzik	7505cffa56	cache: try to avoid vhold if locks held Sponsored by: The FreeBSD Foundation	2019-09-22 20:50:24 +00:00
Mateusz Guzik	cd2112c305	cache: jump in negative success instead of positive Sponsored by: The FreeBSD Foundation	2019-09-22 20:49:17 +00:00
Mateusz Guzik	d2be3ef05c	lockprof: move per-cpu data to dpcpu Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21747	2019-09-22 20:44:24 +00:00
Konstantin Belousov	f33533da8c	kern.elf{32,64}.pie_base sysctl: enforce page alignment. Requested by: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 20:03:17 +00:00
Mateusz Guzik	cbba2cb367	lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks Sponsored by: The FreeBSD Foundation	2019-09-21 19:05:38 +00:00
Konstantin Belousov	95aafd6900	Make non-ASLR pie base tunable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 18:00:23 +00:00
Alexander Motin	36d151a237	Allocate callout wheel from the respective memory domain. MFC after: 1 week	2019-09-21 15:38:08 +00:00
Andrew Gallatin	61b8a4af71	remove redundant "ktls" in KTLS thr name This reducesthe string width of the ktls thread name and improves "ps" output. Glanced at by: jhb Event: EuroBSDCon hackathon Sponsored by: Netflix	2019-09-20 09:36:07 +00:00
Mateusz Guzik	b488246b45	vfs: group fields used for per-cpu ops in one cacheline Sponsored by: The FreeBSD Foundation	2019-09-19 21:23:14 +00:00
Konstantin Belousov	382e01c8dc	sysctl: use names instead of magic numbers. Replace magic numbers with symbols for internal sysctl operations. Convert in-kernel and libc consumers. Submitted by: Pawel Biernacki MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21693	2019-09-18 16:13:10 +00:00
Konstantin Belousov	55894117b1	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634	2019-09-17 18:32:18 +00:00
Kirk McKusick	100369071d	The VFS-level clustering code collects together sequential blocks by issuing delayed-writes (bdwrite()) until a non-sequential block is written or the maximum cluster size is reached. At that point it collects the delayed buffers together (using bread()) to write them in a single operation. The assumption was that since we just looked at them they will still be in memory so there is no need to check for a read error from bread(). Very occationally (apparently every 10-hours or so when being pounded by Peter Holm's tests) this assumption is wrong. The fix is to check for errors from bread() and fail the cluster write thus falling back to the default individual flushing of any still dirty buffers. Reported by: Peter Holm and Chuck Silvers Reviewed by: kib MFC after: 3 days	2019-09-17 17:44:50 +00:00
Mateusz Guzik	d245aa1e72	vfs: apply r352437 to the fast path as well This one is very hard to run into. If the filesystem is being unmounted or the mount point is freed the vfs_op_thread_enter will fail. For it to succeed the mount point itself would have to be reallocated in the time window between the initial read and the attempt to enter. Sponsored by: The FreeBSD Foundation	2019-09-17 15:53:40 +00:00
Mateusz Guzik	7f65185940	vfs: fix braino resulting in NULL pointer deref in r352424 The breakage was added after all the testing and the testing which followed was not sufficient to find it. Reported by: pho Sponsored by: The FreeBSD Foundation	2019-09-17 08:09:39 +00:00
Mateusz Guzik	4cace859c2	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637	2019-09-16 21:37:47 +00:00
Mateusz Guzik	e87f3f72f1	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575	2019-09-16 21:33:16 +00:00
Mateusz Guzik	ee831b2543	vfs: manage mnt_lockref with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21574	2019-09-16 21:32:21 +00:00
Mateusz Guzik	a8c8e44bf0	vfs: manage mnt_ref with atomics New primitive is introduced to denote sections can operate locklessly on aspects of struct mount, but which can also be disabled if necessary. This provides an opportunity to start scaling common case modifications while providing stable state of the struct when facing unmount, write suspendion or other events. mnt_ref is the first counter to start being managed in this manner with the intent to make it per-cpu. Reviewed by: kib, jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21425	2019-09-16 21:31:02 +00:00
Kyle Evans	3155f2f0e2	rangelock: add rangelock_cookie_assert A future change to posixshm to add file sealing (in DIFF_21391[0] and child) will move locking out of shm_dotruncate as kern_shm_open() will require the lock to be held across the dotruncate until the seal is actually applied. For this, the cookie is passed into shm_dotruncate_locked which asserts RCA_WLOCKED. [0] Name changed to protect the innocent, hopefully, from getting autoclosed due to this reference... Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21628	2019-09-15 02:59:53 +00:00
Mateusz Guzik	ce3ba63f67	vfs: release usecount using fetchadd 1. If we release the last usecount we take ownership of the hold count, which means the vnode will remain allocated until we vdrop it. 2. If someone else vrefs they will find no usecount and will proceed to add their own hold count. 3. No code has a problem with v_usecount transitioning to 0 without the interlock These facts combined mean we can fetchadd instead of having a cmpset loop. Reviewed by: kib (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21528	2019-09-13 15:49:04 +00:00
Mark Johnston	45cdd437ae	Remove a redundant NULL pointer check in cpuset_modify_domain(). cpuset_getroot() is guaranteed to return a non-NULL pointer. Reported by: Mark Millard <marklmi@yahoo.com> MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-09-12 16:47:38 +00:00
Hans Petter Selasky	11b57401e6	Use REFCOUNT_COUNT() to obtain refcount where appropriate. Refcount waiting will set some flag bits in the refcount value. Make sure these bits get cleared by using the REFCOUNT_COUNT() macro to obtain the actual refcount. Differential Revision: https://reviews.freebsd.org/D21620 Reviewed by: kib@, markj@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-12 16:26:59 +00:00
Kyle Evans	5163b1a75c	Follow up r352244: kenv: tighten up assertions As I like to forget: static kenv var formatting is actually such that an empty environment would be double null bytes. We should make sure that a non-zero buffer has at least enough for this, though most of the current usage is with a 4k buffer.	2019-09-12 14:34:46 +00:00
Kyle Evans	436c46875d	kenv: assert that an empty static buffer passed in is "empty" Garbage in the passed-in buffer can cause problems if any attempts to read the kenv are inadvertently made between init_static_kenv and the first kern_setenv -- assuming there is one. This is cheap and easy, so do it. This also helps rule out some class of bugs as one tries to debug; tunables fetch from the static environment up until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that rely on BSS clearing while others just grab a page of free memory and use it (e.g. xen).	2019-09-12 13:51:43 +00:00
Conrad Meyer	aaa3852435	buf: Add B_INVALONERR flag to discard data Setting the B_INVALONERR flag before a synchronous write causes the buf cache to forcibly invalidate contents if the write fails (BIO_ERROR). This is intended to be used to allow layers above the buffer cache to make more informed decisions about when discarding dirty buffers without successful write is acceptable. As a proof of concept, use in msdosfs to handle failures to mark the on-disk 'dirty' bit during rw mount or ro->rw update. Extending this to other filesystems is left as future work. PR: 210316 Reviewed by: kib (with objections) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21539	2019-09-11 21:24:14 +00:00
Mateusz Guzik	b088a4d6f9	cache: avoid excessive relocking on entry removal during lookup Due to lock ordering issues (bucket lock held, vnode locks wanted) the code starts with trylocking which in face of contention often fails. Prior to the change it would loop back with a possible yield. Instead note we know what locks are needed and can take them in the right order, avoiding retries. Then we can safely re-lookup and see if the entry we are looking for is still there. On a 104-way box poudriere would result in constant retries during an 11h run as seen in the vfs.cache.zap_and_exit_bucket_fail counter. before: 408866592 after : 0 However, a new stat reports: vfs.cache.zap_and_exit_bucket_relock_success: 32638 Note this is only a bandaid over current design issues. Tested by: pho Sponsored by: The FreeBSD Foundation	2019-09-10 20:19:29 +00:00
Mateusz Guzik	a6cacb0dca	cache: change the formula for calculating lock array sizes It used to be mp_ncpus * 64, but this gives unnecessarily big values for small machines and at the same time constraints bigger ones. In particular this helps on a 104-way box for which the count is now doubled. While here make cache_purgevfs less likely. Currently it is not efficient in face of contention due to lock ordering issues. These are fixable but not worth it at the moment. Sponsored by: The FreeBSD Foundation	2019-09-10 20:11:00 +00:00
Mateusz Guzik	1214618c05	cache: assorted cleanups Sponsored by: The FreeBSD Foundation	2019-09-10 20:08:24 +00:00
Jeff Roberson	c75757481f	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546	2019-09-10 19:08:01 +00:00
Jeff Roberson	4cdea4a853	Use the sleepq lock rather than the page lock to protect against wakeup races with page busy state. The object lock is still used as an interlock to ensure that the identity stays valid. Most callers should use vm_page_sleep_if_busy() to handle the locking particulars. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21255	2019-09-10 18:27:45 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Konstantin Belousov	6c46ce7ea3	Initialize timehands linkage much earlier. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-09 12:42:48 +00:00
Konstantin Belousov	4b23dec4c2	Make timehands count selectable at boottime. Tested by: O'Connor, Daniel <darius@dons.net.au> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21563	2019-09-09 11:29:58 +00:00
Konstantin Belousov	1040254b75	In do_execve(), use shared text vnode lock consistently. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:10:57 +00:00
Konstantin Belousov	1c36b72874	In do_execve(), clear imgp->textset when restarting for interpreter. Otherwise, we might left the boolean set, which would affect cleanup after an error on interpreter activation. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:05:17 +00:00
Konstantin Belousov	1073d17eeb	When loading ELF interpreter, initialize whole nested image_params with zero. Otherwise we could mishandle imgp->textset. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:03:26 +00:00
Philip Paeps	bdc786cc7c	riscv: restore default HZ=1000, keep QEMU at HZ=100 This reverts r351918 and r351919. Discussed with: br, ian, imp	2019-09-07 05:13:31 +00:00
Philip Paeps	7f0851ab19	riscv: default to HZ=100 Most current RISC-V development platforms are not fast enough to benefit from the increased granularity provided by HZ=1000. Sponsored by: Axiado	2019-09-06 01:19:31 +00:00
Conrad Meyer	a6935d085c	Remove long-dead BUF_ASSERT_{,UN}HELD assertions These were fully neutered in r177676 (2008), but not removed at the time for unclear reasons. They're totally dead code, so go ahead and yank them now. No functional change.	2019-09-05 21:43:33 +00:00
Mateusz Guzik	68c3c1abe1	vfs: temporarily revert r351825 There are 2 problems: - it introduces a funny bug where it can end up trylocking the same vnode [1] - it exposes a pre-existing softdep deadlock [2] Both are easier to run into that the bug which got fixed, so revert until a complete solution is worked out. Reported by: cy [1], pho [2] Sponsored by: The FreeBSD Foundation	2019-09-05 18:19:51 +00:00
Stephen J. Kiernan	d57cd5ccd3	Bump up the low range of cpuset numbers to account for the kernel cpuset. Reviewed by: jeff Obtained from: Juniper Networks, Inc.	2019-09-05 17:48:39 +00:00
Mateusz Guzik	c07d4a0a68	vfs: fully hold vnodes in vnlru_free_locked Currently the code only bumps holdcnt and clears the VI_FREE flag, not performing actual vhold. Since the vnode is still visible elsewhere, a potential new user can find it and incorrectly assume it is properly held. Use vholdl instead to correctly hold the vnode. Another place recycling (vlrureclaim) does this already. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21522	2019-09-04 19:23:18 +00:00
Andriy Gapon	387df3b805	shutdown_halt: make sure that watchdog timer is stopped The point of halt is to keep the machine in limbo. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D21222	2019-09-04 13:26:59 +00:00
Kyle Evans	dca52ab480	posixshm: start counting writeable mappings r351650 switched posixshm to using OBJT_SWAP for shm_object r351795 added support to the swap_pager for tracking writeable mappings Take advantage of this and start tracking writeable mappings; fd sealing will use this to reject a seal on writing with EBUSY if any such mapping exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:33:38 +00:00
Kyle Evans	fe7bcbaf50	vm pager: writemapping accounting for OBJT_SWAP Currently writemapping accounting is only done for vnode_pager which does some accounting on the underlying vnode. Extend this to allow accounting to be possible for any of the pager types. New pageops are added to update/release writecount that need to be implemented for any pager wishing to do said accounting, and we implement these methods now for both vnode_pager (unchanged) and swap_pager. The primary motivation for this is to allow other systems with OBJT_SWAP objects to check if their objects have any write mappings and reject operations with EBUSY if so. posixshm will be the first to do so in order to reject adding write seals to the shmfd if any writable mappings exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:31:48 +00:00
Konstantin Belousov	fe69291ff4	Add procctl(PROC_STACKGAP_CTL) It allows a process to request that stack gap was not applied to its stacks, retroactively. Also it is possible to control the gaps in the process after exec. PR: 239894 Reviewed by: alc Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21352	2019-09-03 18:56:25 +00:00
Mateusz Guzik	e3c3248cc7	vfs: implement usecount implying holdcnt vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting freed and usecount to denote it is actively used. Previously all operations bumping usecount would also bump holdcnt, which is not necessary. We can detect if usecount is already > 1 (in which case holdcnt is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves on atomic ops. Reviewed by: kib Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21471	2019-09-03 15:42:11 +00:00
Mateusz Guzik	d05b53e0ba	Add sysctlbyname system call Previously userspace would issue one syscall to resolve the sysctl and then another one to actually use it. Do it all in one trip. Fallback is provided in case newer libc happens to be running on an older kernel. Submitted by: Pawel Biernacki Reported by: kib, brooks Differential Revision: https://reviews.freebsd.org/D17282	2019-09-03 04:16:30 +00:00
Mateusz Guzik	1874c61b90	vfs: restore mp null check in vop_stdgetwritemount The initially read mount point can already be NULL. Reported by: markj Fixes: r351656 ("vfs: stop refing freed mount points in vop_stdgetwritemount") Sponsored by: The FreeBSD Foundation	2019-09-02 15:24:25 +00:00
Mateusz Guzik	9576ff5803	proc: clear pid bitmap entry after dropping proctree lock There is no correctness change here, but the procid lock is contended in the fork path and taking it while holding proctree avoidably extends its hold time. Note that there are other ids which can end up getting cleared with the lock. Sponsored by: The FreeBSD Foundation	2019-09-02 12:46:43 +00:00
Mark Johnston	08cfa56ea3	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667	2019-09-01 22:22:43 +00:00
Mark Johnston	63cdd18e40	Restrict the input domain set in cpuset_setdomain(2) to all_domains. To permit larger values of MAXMEMDOM, which is currently 8 on amd64, cpuset_setdomain(2) accepts a mask of size 256. In the kernel, domain set masks are 64 bits wide, but can only represent a set of MAXMEMDOM domains due to the use of the ds_order table. Domain sets passed to cpuset_setdomain(2) are restricted to a subset of their parent set, which is typically the root set, but before this happens we modify the input set to exclude empty domains. domainset_empty_vm() and other code which manipulates domain sets expect the mask to be a subset of all_domains, so enforce that when performing validation of cpuset_setdomain(2) parameters. Reported and tested by: pho Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21477	2019-09-01 21:38:08 +00:00
Mateusz Guzik	2796c209b0	vfs: stop refing freed mount points in vop_stdgetwritemount The code used blindly ref based on an unsafely red address and then would backpedal if necessary. This was safe in terms of memory access since mounts are type-stable, but made for a potential a bug where the mount was reused and had the count reset to 0 before this code decreased it. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21411	2019-09-01 14:01:09 +00:00
Kyle Evans	32287ea72b	posixshm: switch to OBJT_SWAP in advance of other changes Future changes to posixshm will start tracking writeable mappings in order to support file sealing. Tracking writeable mappings for an OBJT_DEFAULT object is complicated as it may be swapped out and converted to an OBJT_SWAP. One may generically add this tracking for vm_object, but this is difficult to do without increasing memory footprint of vm_object and blowing up memory usage by a significant amount. On the other hand, the swap pager can be expanded to track writeable mappings without increasing vm_object size. This change is currently in D21456. Switch over to OBJT_SWAP in advance of the other changes to the swap pager and posixshm.	2019-09-01 00:33:16 +00:00
Mateusz Guzik	c2b600f98f	vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked Sponsored by: The FreeBSD Foundation	2019-08-30 21:54:45 +00:00
Mateusz Guzik	3bb8d8d8c9	vfs: tidy up assertions in vfs_subr - assert unlocked vnode interlock in vref - assert right counts in vputx - print debug info for panic in vdrop Sponsored by: The FreeBSD Foundation	2019-08-30 00:45:53 +00:00
Konstantin Belousov	6470c8d3db	Rework v_object lifecycle for vnodes. Current implementation of vnode_create_vobject() and vnode_destroy_vobject() is written so that it prepared to handle the vm object destruction for live vnode. Practically, no filesystems use this, except for some remnants that were present in UFS till today. One of the consequences of that model is that each filesystem must call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result all of them get rid of the v_object in reclaim. Move the call to vnode_destroy_vobject() to vgonel() before VOP_RECLAIM(). This makes v_object stable: either the object is NULL, or it is valid vm object till the vnode reclamation. Remove code from vnode_create_vobject() to handle races with the parallel destruction. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21412	2019-08-29 07:50:25 +00:00
Mateusz Guzik	1e2f0ceb2f	vfs: add VOP_NEED_INACTIVE vnode usecount drops to 0 all the time (e.g. for directories during path lookup). When that happens the kernel would always lock the exclusive lock for the vnode in order to call vinactive(). This blocks other threads who want to use the vnode for looukp. vinactive is very rarely needed and can be tested for without the vnode lock held. This patch gives filesytems an opportunity to do it, sample total wait time for tmpfs over 500 minutes of poudriere -j 104: before: 557563641706 (lockmgr:tmpfs) after: 46309603301 (lockmgr:tmpfs) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21371	2019-08-28 20:34:24 +00:00
Mark Johnston	772dd133c6	Avoid direct accesses of the vm_page wire_count field. No functional change intended. Sponsored by: Netflix	2019-08-28 18:01:54 +00:00
Mateusz Guzik	88cc62e5a5	proc: eliminate the zombproc list It is not needed by anything in the kernel and it slightly drives up contention on both proctree and allproc locks. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21447	2019-08-28 16:18:23 +00:00
Mark Johnston	b5d239cb97	Wire pages in vm_page_grab() when appropriate. uiomove_object_page() and exec_map_first_page() would previously wire a page after having grabbed it. Ask vm_page_grab() to perform the wiring instead: this removes some redundant code, and is cheaper in the case where the requested page is not resident since the page allocator can be asked to initialize the page as wired, whereas a separate vm_page_wire() call requires the page lock. In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued before returning, but this is unnecessary since vm_page_free() will trigger a batched dequeue of the page. Reviewed by: alc, kib Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21440	2019-08-28 16:08:06 +00:00
Mateusz Guzik	2319489b6e	proc: remove zpfind It is not used by anything. If someone wants it back it should be reimplemented to use the proc hash. Sponsored by: The FreeBSD Foundation	2019-08-28 01:22:21 +00:00
John Baldwin	818d755318	Only define the 'tls' member of sfio in KERN_TLS is defined. This field was not initialized in the !KERN_TLS case triggering an assertion failure when using sendfile(2). Reported by: pho, asomers Sponsored by: Netflix	2019-08-27 22:21:18 +00:00
Mateusz Guzik	368cabbcb5	vfs: stop passing LK_INTERLOCK to VOP_UNLOCK The plan is to drop the flags argument. There is also a temporary bug now that nullfs ignores the flag. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21252	2019-08-27 20:30:56 +00:00
Mark Johnston	44e4def73b	Remove an extraneous + 1 in _domainset_create(). DOMAINSET_FLS, like our fls(), is 1-indexed. Reported by: alc MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-08-27 15:42:08 +00:00
Mark Johnston	8e6975047e	Fix several logic issues in domainset_empty_vm(). - Don't add 1 to the result of DOMAINSET_FLS. - Do not modify domainsets containing only empty domains. - Always flatten a _PREFER policy to _ROUNDROBIN if the preferred domain is empty. Previously we were doing this only when ds_cnt > 1. These bugs could cause hangs during boot if a VM domain is empty. Tested by: hselasky Reviewed by: hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21420	2019-08-27 14:06:34 +00:00
Konstantin Belousov	95acb40caa	vn_vget_ino_gen(): relock the lower vnode on error. The function' interface assumes that the lower vnode is passed and returned locked always. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-27 08:28:38 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Xin LI	4e8671dd78	GZIO: Update to use zlib 1.2.11. PR: 229763 Submitted by: Yoshihiro Ota <ota j email ne jp> Differential Revision: https://reviews.freebsd.org/D21408	2019-08-25 07:50:44 +00:00
Mateusz Guzik	0256405e98	vfs: add vholdnz (for already held vnodes) Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21358	2019-08-25 05:11:43 +00:00
Mateusz Guzik	5b596b9fa5	Remove the obsolete pcpu_zone_ptr zone. It was only used by flowtable (removed in r321618). Sponsored by: The FreeBSD Foundation	2019-08-24 00:01:19 +00:00
Konstantin Belousov	e671edac06	De-commision the MNTK_NOINSMNTQ kernel mount flag. After all the changes, its dynamic scope is same as for MNTK_UNMOUNT, but to allow the syncer vnode to be re-installed on unmount failure. But the case of syncer was already handled by using the VV_FORCEINSMQ flag for quite some time. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-23 19:40:10 +00:00
Xin LI	a11bf9a49b	INVARIANTS: treat LA_LOCKED as the same of LA_XLOCKED in mtx_assert. The Linux lockdep API assumes LA_LOCKED semantic in lockdep_assert_held(), meaning that either a shared lock or write lock is Ok. On the other hand, the timeout code uses lc_assert() with LA_XLOCKED, and we need both to work. For mutexes, because they can not be shared (this is unique among all lock classes, and it is unlikely that we would add new lock class anytime soon), it is easier to simply extend mtx_assert to handle LA_LOCKED there, despite the change itself can be viewed as a slight abstraction violation. Reviewed by: mjg, cem, jhb MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D21362	2019-08-23 06:39:40 +00:00
Brooks Davis	075ac3b446	Reorganise conditionals to reduce duplication. No functional change. Obtained from: CheriBSD MFC after: 3 days Sponsored by: DARPA, AFRL	2019-08-22 10:21:07 +00:00
Rick Macklem	df9bc7df42	Map ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE) on a file in a file system that does not have its own VOP_IOCTL(), the lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since ENOTTY is not listed as an error return by either the lseek(2) man page nor the POSIX draft for lseek(2). This was discussed on freebsd-current@ here: http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Reviewed by: markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21300	2019-08-22 01:15:06 +00:00
Mark Johnston	5b699f1614	Add lockmgr(9) probes to the lockstat DTrace provider. They follow the conventions set by rw and sx lock probes. There is an additional lockstat:::lockmgr-disown probe. Update lockstat(1) to report on contention and hold events for lockmgr locks. Document the new probes in dtrace_lockstat.4, and deduplicate some of the existing probe descriptions. Reviewed by: mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21355	2019-08-21 23:43:58 +00:00
Mark Johnston	9fb7c918ef	Remove manual wire_count adjustments from the unmapped mbuf code. The original code came from a desire to minimize the number of updates to v_wire_count, which prior to r329187 was updated using atomics. However, there is no significant benefit to batching today, so simply allocate pages using VM_ALLOC_WIRED and rely on system accounting. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D21323	2019-08-21 20:01:52 +00:00
Mark Johnston	6bc13e042f	Modify pipe_poll() to properly check for pending direct writes. With r349546, it is a responsibility of the writer to clear PIPE_DIRECTW after pinned data has been read. In particular, once a reader has drained this data, there is a small window where the pipe is empty but PIPE_DIRECTW is set. pipe_poll() was using the presence of PIPE_DIRECTW to determine whether to return POLLIN, so in this window it would claim that data was available to read when this was not the case. Fix this by modifying several checks for PIPE_DIRECTW to instead look at the number of residual bytes in data pinned by a direct writer. In some cases we really do want to check for PIPE_DIRECTW, since the presence of this flag indicates that any attempt to write to the pipe will block on the existing direct writer. Bisected and test case provided by: mav Tested by: pho Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21333	2019-08-21 19:35:04 +00:00
Ed Maste	f37192064a	mqueuefs: fix compat32 struct file leak In a compat32 error case we previously leaked a struct file. Submitted by: Karsten König, Secfault Security Security: CVE-2019-5603	2019-08-20 17:44:03 +00:00
Jeff Roberson	cf27e0d125	Use an atomic reference count for paging in progress so that callers do not require the object lock. Reviewed by: markj Tested by: pho (as part of a larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21311	2019-08-19 23:09:38 +00:00
Mateusz Guzik	4b3f767340	vfs: fix up r351193 ("stop always overwriting ->mnt_stat in VFS_STATFS") fs-specific part of vfs_statfs routines only fill in small portion of the structure. Previous code was always copying everything at a higher layer to acoomodate it and this patch does the same. 'df' (no arguments) worked fine because the caller uses mnt_stat itself as the target buffer, making all the copying a no-op for its own case. 'df /' and similar use a different consumer which passes its own buffer and this is where you can run into trouble. Reported by: cy Fixes: r351193 Sponsored by: The FreeBSD Foundation	2019-08-19 14:11:54 +00:00
Andrey V. Elsukov	75697b16b6	Use TAILQ_FOREACH_SAFE() macro to avoid use after free in soclose(). PR: 239893 MFC after: 1 week	2019-08-19 12:42:03 +00:00
Andriy Gapon	0db7afd0ae	assert that td_lk_slocks is not leaked upon return from kernel This is similar to checks for td_sx_slocks and td_rw_rlocks. Although td_lk_slocks is an implementation detail, it still makes sense to validate it. MFC after: 1 week Sponsored by: Panzura	2019-08-19 11:18:36 +00:00
Rick Macklem	2e1b32c0e3	Add a vop_stdioctl() that performs a trivial FIOSEEKDATA/FIOSEEKHOLE. Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE) on a file in a file system that does not have its own VOP_IOCTL(), the lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since ENOTTY is not listed as an error return by either the lseek(2) man page nor the POSIX draft for lseek(2). A discussion on freebsd-current@ seemed to indicate that implementing a trivial algorithm that returns the offset argument for FIOSEEKDATA and returns the file's size for FIOSEEKHOLE was the preferred fix. http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA The Linux kernel appears to implement this trivial algorithm as well. This patch adds a vop_stdioctl() that implements this trivial algorithm. It returns errors consistent with vn_bmap_seekhole() and, as such, will still return ENOTTY for non-regular files. I have proposed a separate patch that maps errors not described by the lseek(2) man page nor POSIX draft to EINVAL. This patch is under separate review. Reviewed by: kib Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21299	2019-08-19 00:29:05 +00:00
Konstantin Belousov	de4e1aeb21	Fix an issue with executing tmpfs binary. Suppose that a binary was executed from tmpfs mount, and the text vnode was reclaimed while the binary was still running. It is possible during even the normal operations since tmpfs vnode' vm_object has swap type, and no references on the vnode is held. Also assume that the text vnode was revived for some reason. Then, on the process exit or exec, unmapping of the text mapping tries to remove the text reference from the vnode, but since it went from recycle/instantiation cycle, there is no reference kept, and assertion in VOP_UNSET_TEXT_CHECKED() triggers. Fix this by keeping a use reference on the tmpfs vnode for each exec reference. This prevents the vnode reclamation while executable map entry is active. Do it by adding per-mount flag MNTK_TEXT_REFS that directs vop_stdset_text() to add use ref on first vnode text use, and per-vnode VI_TEXT_REF flag, to record the need on unref in vop_stdunset_text() on last vnode text use going away. Set MNTK_TEXT_REFS for tmpfs mounts. Reported by: bdrewery Tested by: sbruno, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-18 20:36:11 +00:00
Konstantin Belousov	bb9e2184f0	Change locking requirements for VOP_UNSET_TEXT(). Require the vnode to be locked for the VOP_UNSET_TEXT() call. This will be used by the following bug fix for a tmpfs issue. Tested by: sbruno, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-18 20:24:52 +00:00
Mateusz Guzik	e7c1709aaf	vfs: stop always overwriting ->mnt_stat in VFS_STATFS The struct is already populated on each mount (and remount). Fields are either constant or not used by filesystem in the first place. Some infrequently used functions use it to avoid having to allocate a new buffer and are left alone. The current code results in an avoidable copying single-threaded and significant cache line bouncing multithreaded While here deduplicate initial filling of the struct. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21317	2019-08-18 18:40:12 +00:00
Jeff Roberson	33205c60e7	Add a blocking wait bit to refcount. This allows refs to be used as a simple barrier. Reviewed by: markj, kib Discussed with: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21254	2019-08-18 11:43:58 +00:00
Mateusz Guzik	50c7615fb0	fork: rework locking around do_fork - move allproc lock into the func, it is of no use prior to it - the code would lock p1 and p2 while holding allproc to partially construct it after it gets added to the list. instead we can do the work prior to adding anything. - protect lastpid with procid_lock As a side effect we do less work with allproc held. Sponsored by: The FreeBSD Foundation	2019-08-17 18:19:49 +00:00
Mateusz Guzik	60cdcb644d	fork: bump process count before checking for permission to cross the limit The limit is almost never reached. Do the check only on failure to see if we can override it. No change in user-visible behavior. Sponsored by: The FreeBSD Foundation	2019-08-17 17:56:43 +00:00
Mateusz Guzik	b05641b6bd	fork: stop skipping < 100 ids on wrap around Code doing this is commented with a claim that these IDs are occupied by daemons, but that's demonstrably false. To an extent the range is used by init and kernel processes (and on sufficiently big machines it indeed is fully populated). On a sample box 40-way box the highest id in the range is 63. On a different one it is 23. Just use the range. Sponsored by: The FreeBSD Foundation	2019-08-17 17:42:01 +00:00
Alexander Motin	3a60f3dad0	Add support for 'j', 't' and 'z' flags to kernel sscanf(). MFC after: 2 weeks	2019-08-16 19:46:22 +00:00
Jeff Roberson	2194393787	Move phys_avail definition into MI code. It is consumed in the MI layer and doing so adds more flexibility with less redundant code. Reviewed by: jhb, markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21250	2019-08-16 00:45:14 +00:00
Rick Macklem	c61b14315f	Fix copy_file_range(2) so that unneeded blocks are not allocated to the output file. When the byte range for copy_file_range(2) doesn't go to EOF on the output file and there is a hole in the input file, a hole must be "punched" in the output file. This is done by writing a block of bytes all set to 0. Without this patch, the write is done unconditionally which means that, if the output file already has a hole in that byte range, a unneeded data block of all 0 bytes would be allocated. This patch adds code to check for a hole in the output file, so that it can skip doing the write if there is already a hole in that byte range of the output file. This avoids unnecessary allocation of blocks to the output file. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21155	2019-08-15 23:21:41 +00:00
Jeff Roberson	018ff6860f	Move scheduler state into the per-cpu area where it can be allocated on the correct NUMA domain. Reviewed by: markj, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19315	2019-08-13 04:54:02 +00:00
Konstantin Belousov	7e097daa93	Only enable COMPAT_43 changes for syscalls ABI for a.out processes. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21200	2019-08-11 19:16:07 +00:00
Jonathan T. Looney	afd959f332	In m_pulldown(), before trying to prepend bytes to the subsequent mbuf, ensure that the subsequent mbuf contains the remainder of the bytes the caller sought. If this is not the case, fall through to the code which gathers the bytes in a new mbuf. This fixes a bug where m_pulldown() could fail to gather all the desired bytes into consecutive memory. PR: 238787 Reported by: A reddit user Discussed with: emaste Obtained from: NetBSD MFC after: 3 days	2019-08-09 05:18:59 +00:00
Rick Macklem	6b1bc6f7dd	Remove some harmless cruft from vn_generic_copy_file_range(). An earlier version of the patch had code that set "error" between line#s 2797-2799. When that code was moved, the second check for "error != 0" could never be true and the check became harmless cruft. This patch removes the cruft, mainly to make Coverity happy. Reported by: asomers, cem	2019-08-08 20:07:38 +00:00
Rick Macklem	614633146f	Fix copy_file_range(2) for an unlikely race during hole finding. Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the vnode unlocked, it is possible for another thread to do: - truncate(), lseek(), write() between the two calls and create a hole where FIOSEEKDATA returned the start of data. For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for the hole location. This could result in an infinite loop in the copy code, since copylen is set to 0 and the copy doesn't advance. Usually, this race is avoided because of the use of rangelocks, but the NFS server does not do range locking and could do a sequence like the above to create the hole. This patch checks for this case and makes the hole search fail, to avoid the infinite loop. At this time, it is an open question as to whether or not the NFS server should do range locking to avoid this race.	2019-08-08 19:53:07 +00:00
Konstantin Belousov	b706be23b4	Update comment explaining create_init(). Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-08-08 16:42:53 +00:00
Xin LI	22bbc4b242	Convert DDB_CTF to use newer version of ZLIB. PR: 229763 Submitted by: Yoshihiro Ota <ota j email ne jp> Differential Revision: https://reviews.freebsd.org/D21176	2019-08-08 07:27:49 +00:00
Conrad Meyer	7d0658ad55	Fix !DDB kernel configurations after r350713 KDB is standard and the kdb_active variable is always available. So, de-conditionalize inclusion of sys/kdb.h in kern_sysctl.c. Reported by: Michael Butler <imb AT protected-networks.net> X-MFC-With: r350713 Sponsored by: Dell EMC Isilon	2019-08-08 01:37:41 +00:00
Conrad Meyer	088c17b46b	ddb(4): Add 'sysctl' command Implement `sysctl` in `ddb` by overriding `SYSCTL_OUT`. When handling the req, we install custom ddb in/out handlers. The out handler prints straight to the debugger, while the in handler ignores all input. This is intended to allow us to print just about any sysctl. There is a known issue when used from ddb(4) entered via 'sysctl debug.kdb.enter=1'. The DDB mode does not quite prevent all lock interactions, and it is possible for the recursive Giant lock to be unlocked when the ddb(4) 'sysctl' command is used. This may result in a panic on return from ddb(4) via 'c' (continue). Obviously, this is not a problem when debugging already-paniced systems. Submitted by: Travis Lane (formerly: <travis.lane AT isilon.com>) Reviewed by: vangyzen (earlier version), Don Morris <dgmorris AT earthlink.net> Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20219	2019-08-08 00:42:29 +00:00
Conrad Meyer	76cb1112da	sbuf(9): Add sbuf_nl_terminate() API The API is used to gracefully terminate text line(s) with a single \n. If the formatted buffer was empty or already ended in \n, it is unmodified. Otherwise, a newline character is appended to it. The API, like other sbuf-modifying routines, is only valid while the sbuf is not FINISHED. Reviewed by: rlibby Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21030	2019-08-07 19:27:14 +00:00
Conrad Meyer	d23813cdb9	sbuf(9): Refactor sbuf_newbuf into sbuf_new Code flow was somewhat difficult to read due to the combination of multiple return sites and the 4x possible dynamic constructions of an sbuf. (Future consideration: do we need all 4?) Refactored slightly to improve legibility. No functional change. Sponsored by: Dell EMC Isilon	2019-08-07 19:25:56 +00:00
Conrad Meyer	71db411eb6	sbuf(9): Add NOWAIT dynamic buffer extension mode The goal is to avoid some kinds of low-memory deadlock when formatting heap-allocated buffers. Reviewed by: vangyzen Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21015	2019-08-07 19:23:07 +00:00
Gleb Smirnoff	814f33aafb	Since r350426 this KASSERT doesn't serve any useful purpose.	2019-08-06 16:11:00 +00:00
Mariusz Zaborski	c878d1eb45	procdesc: fix the function name I changed name of the function r350429 and forgot to update the r350612 patch. Reported by: jenkins MFC after: 1 month	2019-08-05 20:31:17 +00:00
Mariusz Zaborski	9f5103abab	process: style We don't need to check if the parent is already set. This is done already in the proc_reparent. No functional behaviour changes intended. MFC after: 1 month	2019-08-05 20:26:01 +00:00
Mariusz Zaborski	a05cfdf479	exit1: fix style nits MFC after: 1 month	2019-08-05 20:20:14 +00:00
Mariusz Zaborski	fd631bcd95	procdesc: fix reparenting when the debugger is attached The process is reparented to the debugger while it is attached. B B / ----> \| A A D Every time when the process is reparented, it is added to the orphan list of the previous parent: A->orphan = B D->orphan = NULL When the A process will close the process descriptor to the B process, the B process will be reparented to the init process. B B - init \| ----> A D A D A->orphan = B D->orphan = B In this scenario, the B process is in the orphan list of A and D. When the last process descriptor is closed instead of reparenting it to the reaper let it stay with the debugger process and set our previews parent to the reaper. Add test case for this situation. Notice that without this patch the kernel will crash with this test case: panic: orphan 0xfffff8000e990530 of 0xfffff8000e990000 has unexpected oppid 1 Reviewed by: markj, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D20361	2019-08-05 20:15:46 +00:00
Mariusz Zaborski	799d92ab78	proc: introduce the proc_add_orphan function This API allows adding the process to its parent orphan list. Reviewed by: kib, markj MFC after: 1 month	2019-08-05 20:11:57 +00:00
Mariusz Zaborski	41fadb3fca	exit1: postpone clearing P_TRACED flag until the proctree lock is acquired In case of the process being debugged. The P_TRACED is cleared very early, which would make procdesc_close() not calling proc_clear_orphan(). That would result in the debugged process can not be able to collect status of the process with process descriptor. Reviewed by: markj, kib Tested by: pho MFC after: 1 month	2019-08-05 19:59:23 +00:00
Konstantin Belousov	a1549acbaf	Fix mis-merge. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-05 19:19:25 +00:00
Konstantin Belousov	01c3ba9752	Fix mis-merge Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-05 19:16:33 +00:00
Justin Hibbits	937a05ba81	Add necessary bits for Linux KPI to work correctly on powerpc PowerPC, and possibly other architectures, use different address ranges for PCI space vs physical address space, which is only mapped at resource activation time, when the BAR gets written. The DRM kernel modules do not activate the rman resources, soas not to waste KVA, instead only mapping parts of the PCI memory at a time. This introduces a BUS_TRANSLATE_RESOURCE() method, implemented in the Open Firmware/FDT PCI driver, to perform this necessary translation without activating the resource. In addition to system KPI changes, LinuxKPI is updated to handle a big-endian host, by adding proper endian swaps to the I/O functions. Submitted by: mmacy Reported by: hselasky Differential Revision: https://reviews.freebsd.org/D21096	2019-08-04 19:28:10 +00:00
John Baldwin	f422bc3092	Set ISOPEN in namei flags when opening executable interpreters. These vnodes are explicitly opened via VOP_OPEN via exec_check_permissions identical to the main exectuable image. Setting ISOPEN allows filesystems to perform suitable checks in VOP_LOOKUP (e.g. close-to-open consistency in the NFS client). Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D21129	2019-08-03 01:02:52 +00:00
Mark Johnston	8675f5f776	Only check the blessings table for known LORs. Previously we would check for blessings before marking a given lock pair as reversed, so each "reversed" lock acquisition would require a linear scan of the table. Instead, check the table after marking the pair as reversed but before generating a report. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21135	2019-08-02 18:01:47 +00:00
Konstantin Belousov	5cbdd18fd4	Make umtxq_check_susp() to correctly handle thread exit requests. The check for P_SINGLE_EXIT was shadowed by the (P_SHOULDSTOP \|\| traced) check. Reported by: bdrewery (might be) Reviewed by: markj Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21124	2019-08-01 14:34:27 +00:00
Konstantin Belousov	fc83c5a7d0	Make randomized stack gap between strings and pointers to argv/envs. This effectively makes the stack base on the csu _start entry randomized. The gap is enabled if ASLR is for the ABI is enabled, and then kern.elf{64,32}.aslr.stack_gap specify the max percentage of the initial stack size that can be wasted for gap. Setting it to zero disables the gap, and max is capped at 50%. Only amd64 for now. Reviewed by: cem, markj Discussed with: emaste MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21081	2019-07-31 20:23:10 +00:00
Konstantin Belousov	fd336e2ac0	Fix handling of transient casueword(9) failures in do_sem_wait(). In particular, restart should be only done when the failure is transient. For this, recheck the count1 value after the operation. Note that do_sem_wait() is older usem interface. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-07-31 19:16:49 +00:00
Kyle Evans	b5a7ac997f	kern_shm_open: push O_CLOEXEC into caller control The motivation for this change is to allow wrappers around shm to be written that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1(). Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from the context. sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX requirements, and a comment has been dropped in to kern_fd_open() to explain the situation and add a pointer to where O_CLOEXEC setting is maintained for shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally sets O_CLOEXEC to match previous behavior. This also has the side-effect of making flags correctly reflect the O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a glance-over leads me to believe that it didn't really matter. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21119	2019-07-31 15:16:51 +00:00
Mark Johnston	49c3e8c8d1	Enable witness(4) blessings. witness has long had a facility to "bless" designated lock pairs. Lock order reversals between a pair of blessed locks are not reported upon. We have a number of long-standing false positive LOR reports; start marking well-understood LORs as blessed. This change hides reports about UFS vnode locks and the UFS dirhash lock, and UFS vnode locks and buffer locks, since those are the two that I observe most often. In the long term it would be preferable to be able to limit blessings to a specific site where a lock is acquired, and/or extend witness to understand why some lock order reversals are valid (for example, if code paths with conflicting lock orders are serialized by a third lock), but in the meantime the false positives frequently confuse users and generate bug reports. Reviewed by: cem, kib, mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21039	2019-07-30 17:09:58 +00:00
Mark Johnston	ed13ff4549	Regenerate after r350447.	2019-07-30 16:01:16 +00:00
Mark Johnston	f30f7b9870	Enable copy_file_range(2) in capability mode. copy_file_range() operates on a pair of file descriptors; it requires CAP_READ for the source descriptor and CAP_WRITE for the destination descriptor. Reviewed by: kevans, oshogbo Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21113	2019-07-30 15:59:44 +00:00
Xin LI	d4565741c6	Remove gzip'ed a.out support. The current implementation of gzipped a.out support was based on a very old version of InfoZIP which ships with an ancient modified version of zlib, and was removed from the GENERIC kernel in 1999 when we moved to an ELF world. PR: 205822 Reviewed by: imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp> Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21099	2019-07-30 05:13:16 +00:00
Mark Johnston	98549e2dc6	Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986	2019-07-29 22:01:28 +00:00
Mariusz Zaborski	9db97ca0bd	proc: make clear_orphan an public API This will be useful for other patches with process descriptors. Change its name as well. Reviewed by: markj, kib	2019-07-29 21:42:57 +00:00
Alan Somers	0367bca479	sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error This is a partial merge of 350144 from projects/fuse2 PR: 236466 Reviewed by: markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21095	2019-07-29 20:50:26 +00:00
Mark Johnston	918988576c	Avoid relying on header pollution from sys/refcount.h. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-07-29 20:26:01 +00:00
Alan Somers	e7d8ebc8ca	Better comments for vlrureclaim MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2019-07-28 16:07:27 +00:00
Alan Somers	2240d8c465	Add v_inval_buf_range, like vtruncbuf but for a range of a file v_inval_buf_range invalidates all buffers within a certain LBA range of a file. It will be used by fusefs(5). This commit is a partial merge of r346162, r346606, and r346756 from projects/fuse2. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21032	2019-07-28 00:48:28 +00:00
Rick Macklem	bf499e87f5	Update the generated syscall files for copy_file_range(2) added by r350315.	2019-07-25 05:55:55 +00:00
Rick Macklem	bbbbeca3e9	Add kernel support for a Linux compatible copy_file_range(2) syscall. This patch adds support to the kernel for a Linux compatible copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9). This syscall/VOP can be used by the NFSv4.2 client to implement the Copy operation against an NFSv4.2 server to do file copies locally on the server. The vn_generic_copy_file_range() function in this patch can be used by the NFSv4.2 server to implement the Copy operation. Fuse may also me able to use the VOP_COPY_FILE_RANGE() method. vn_generic_copy_file_range() attempts to maintain holes in the output file in the range to be copied, but may fail to do so if the input and output files are on different file systems with different _PC_MIN_HOLE_SIZE values. Separate commits will be done for the generated syscall files and userland changes. A commit for a compat32 syscall will be done later. Reviewed by: kib, asomers (plus comments by brooks, jilles) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20584	2019-07-25 05:46:16 +00:00
Mark Johnston	2fb62b1a46	Fix the turnstile_lock() KPI. turnstile_{lock,unlock}() were added for use in epoch. turnstile_lock() returned NULL to indicate that the calling thread had lost a race and the turnstile was no longer associated with the given lock, or the lock owner. However, reader-writer locks may not have a designated owner, in which case turnstile_lock() would return NULL and epoch_block_handler_preempt() would leak spinlocks as a result. Apply a minimal fix: return the lock owner as a separate return value. Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21048	2019-07-24 23:04:59 +00:00
Mark Johnston	e020a35f5b	Remove a redundant offset computation in elf_load_section(). With r344705 the offset is always zero. Submitted by: Wuyang Chung <wuyang.chung1@gmail.com>	2019-07-24 15:18:05 +00:00
Ed Maste	051e692a99	mqueuefs: fix struct file leak In some error cases we previously leaked a stuct file. Submitted by: mjg, markj	2019-07-23 20:59:36 +00:00
Alan Somers	caaa7cee09	[skip ci] Fix the comment for cache_purge(9) This is a merge of r348738 from projects/fuse2 Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2019-07-22 21:03:52 +00:00
Konstantin Belousov	f1cf2b9dcb	Check and avoid overflow when incrementing fp->f_count in fget_unlocked() and fhold(). On sufficiently large machine, f_count can be legitimately very large, e.g. malicious code can dup same fd up to the per-process filedescriptors limit, and then fork as much as it can. On some smaller machine, I see kern.maxfilesperproc: 939132 kern.maxprocperuid: 34203 which already overflows u_int. More, the malicious code can create transient references by sending fds over unix sockets. I realized that this check is missed after reading https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html Reviewed by: markj (previous version), mjg Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20947	2019-07-21 15:07:12 +00:00
Konstantin Belousov	47c3450e50	Fix leak of memory and file refs with sendmsg(2) over unix domain sockets. When sendmsg(2) sucessfully internalized one SCM_RIGHTS control message, but failed to process some other control message later, both file references and filedescent memory needs to be freed. This was not done, only mbuf chain was freed. Noted, test case written, reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21000	2019-07-19 20:51:39 +00:00
Alan Somers	fca79580be	sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error PR: 236466 Sponsored by: The FreeBSD Foundation	2019-07-19 18:03:30 +00:00
Alan Somers	d26d63a4af	fusefs: multiple interruptility improvements 1) Don't explicitly not mask SIGKILL. kern_sigprocmask won't allow it to be masked, anyway. 2) Fix an infinite loop bug. If a process received both a maskable signal lower than 9 (like SIGINT) and then received SIGKILL, fticket_wait_answer would spin. msleep would immediately return EINTR, but cursig would return SIGINT, so the sleep would get retried. Fix it by explicitly checking whether SIGKILL has been received. 3) Abandon the sig_isfatal optimization introduced by r346357. That optimization would cause fticket_wait_answer to return immediately, without waiting for a response from the server, if the process were going to exit anyway. However, it's vulnerable to a race: 1) fatal signal is received while fticket_wait_answer is sleeping. 2) fticket_wait_answer sends the FUSE_INTERRUPT operation. 3) fticket_wait_answer determines that the signal was fatal and returns without waiting for a response. 4) Another thread changes the signal to non-fatal. 5) The first thread returns to userspace. Instead of exiting, the process continues. 6) The application receives EINTR, wrongly believes that the operation was successfully interrupted, and restarts it. This could cause problems for non-idempotent operations like FUSE_RENAME. Reported by: kib (the race part) Sponsored by: The FreeBSD Foundation	2019-07-17 22:45:43 +00:00
Alan Somers	0122532ee0	F_READAHEAD: Fix r349248's overflow protection, broken by r349391 I accidentally broke the main point of r349248 when making stylistic changes in r349391. Restore the original behavior, and also fix an additional overflow that was possible when uio->uio_resid was nearly SSIZE_MAX. Reported by: cem Reviewed by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation	2019-07-17 17:01:07 +00:00
Eric van Gyzen	9d3ecb7e62	Adds signal number format to kern.corefile Add format capability to core file names to include signal that generated the core. This can help various validation workflows where all cores should not be considered equally (SIGQUIT is often intentional and not an error unlike SIGSEGV or SIGBUS) Submitted by: David Leimbach (leimy2k@gmail.com) Reviewed by: markj MFC after: 1 week Relnotes: sysctl kern.corefile can now include the signal number Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20970	2019-07-16 15:51:09 +00:00
John Baldwin	32451fb9fc	Add ptrace op PT_GET_SC_RET. This ptrace operation returns a structure containing the error and return values from the current system call. It is only valid when a thread is stopped during a system call exit (PL_FLAG_SCX is set). The sr_error member holds the error value from the system call. Note that this error value is the native FreeBSD error value that has _not_ been translated to an ABI-specific error value similar to the values logged to ktrace. If sr_error is zero, then the return values of the system call will be set in sr_retval[0] and sr_retval[1]. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D20901	2019-07-15 21:48:02 +00:00
John Baldwin	c18ca74916	Don't pass error from syscallenter() to syscallret(). syscallret() doesn't use error anymore. Fix a few other places to permit removing the return value from syscallenter() entirely. - Remove a duplicated assertion from arm's syscall(). - Use td_errno for amd64_syscall_ret_flush_l1d. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D2090	2019-07-15 21:25:16 +00:00
John Baldwin	1af9474b26	Always set td_errno to the error value of a system call. Early errors prior to a system call did not set td_errno. This commit sets td_errno for all errors during syscallenter(). As a result, syscallret() can now always use td_errno without checking TDP_NERRNO. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D20898	2019-07-15 21:16:01 +00:00
Konstantin Belousov	9f7b7da5bf	In do_sem2_wait(), balance umtx_key_get() with umtx_key_release() on retry. Reported by: ler Bisected and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 12 days	2019-07-15 19:18:25 +00:00
Konstantin Belousov	ad038195bd	In do_lock_pi(), do not return prematurely. If umtxq_check_susp() indicates an exit, we should clean the resources before returning. Do it by breaking out of the loop and relying on post-loop cleanup. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 12 days Differential revision: https://reviews.freebsd.org/D20949	2019-07-15 08:39:52 +00:00
Konstantin Belousov	40bd868ba7	Correctly check for casueword(9) success in do_set_ceiling(). After r349951, the return code must be checked instead of old == new comparision. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 12 days Differential revision: https://reviews.freebsd.org/D20949	2019-07-15 08:38:01 +00:00
Michael Tuexen	a85b7f125b	Improve the input validation for l_linger. When using the SOL_SOCKET level socket option SO_LINGER, the structure struct linger is used as the option value. The component l_linger is of type int, but internally copied to the field so_linger of the structure struct socket. The type of so_linger is short, but it is assumed to be non-negative and the value is used to compute ticks to be stored in a variable of type int. Therefore, perform input validation on l_linger similar to the one performed by NetBSD and OpenBSD. Thanks to syzkaller for making me aware of this issue. Thanks to markj@ for pointing out that a similar check should be added to so_linger_set(). Reviewed by: markj@ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20948	2019-07-14 21:44:18 +00:00
Konstantin Belousov	30b3018d48	Provide protection against starvation of the ll/sc loops when accessing userpace. Casueword(9) on ll/sc architectures must be prepared for userspace constantly modifying the same cache line as containing the CAS word, and not loop infinitely. Otherwise, rogue userspace livelocks the kernel. To fix the issue, change casueword(9) interface to return new value 1 indicating that either comparision or store failed, instead of relying on the oldval == *oldvalp comparison. The primitive no longer retries the operation if it failed spuriously. Modify callers of casueword(9), all in kern_umtx.c, to handle retries, and react to stops and requests to terminate between retries. On x86, despite cmpxchg should not return spurious failures, we can take advantage of the new interface and just return PSL.ZF. Reviewed by: andrew (arm64, previous version), markj Tested by: pho Reported by: https://xenbits.xen.org/xsa/advisory-295.txt Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D20772	2019-07-12 18:43:24 +00:00
Doug Moore	3f3f7c056f	Address problems in blist_alloc introduced in r349777. The swap block allocator could become corrupted if a retry to allocate swap space, after a larger allocation attempt failed, allocated a smaller set of free blocks that ended on a 32- or 64-block boundary. Add tests to detect this kind of failure-to-extend-at-boundary and prevent the associated accounting screwup. Reported by: pho Tested by: pho Reviewed by: alc Approved by: markj (mentor) Discussed with: kib Differential Revision: https://reviews.freebsd.org/D20893	2019-07-11 20:52:39 +00:00
Mark Johnston	2ffee5c1b2	Inherit P2_PROTMAX_{ENABLE,DISABLE} across fork(). Thus, when using proccontrol(1) to disable implicit application of PROT_MAX within a process, child processes will inherit this setting. Discussed with: kib MFC with: r349609 Sponsored by: The FreeBSD Foundation	2019-07-10 19:57:48 +00:00
John Baldwin	c26541e315	Use 'retval' label for first error in syscallenter(). This is more consistent with the rest of the function and lets us unindent most of the function. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D20897	2019-07-09 23:58:12 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Doug Moore	31c82722c1	Change blist_next_leaf_alloc so that it can examine more than one leaf after the one where the possible block allocation begins, and allocate a larger number of blocks than the current limit. This does not affect the limit on minimum allocation size, which still cannot exceed BLIST_MAX_ALLOC. Use this change to modify swp_pager_getswapspace and its callers, so that they can allocate more than BLIST_MAX_ALLOC blocks if they are available. Tested by: pho Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20579	2019-07-06 06:15:03 +00:00
Mark Johnston	6a01874c5a	Defer funsetown() calls for a TTY to tty_rel_free(). We were otherwise failing to call funsetown() for some descriptors associated with a tty, such as pts descriptors. Then, if the descriptor is closed before the owner exits, we may get memory corruption. Reported by: syzbot+c9b6206303bf47bac87e@syzkaller.appspotmail.com Reviewed by: ed MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-07-04 15:42:02 +00:00
Eric van Gyzen	8c5a9161d1	Save the last callout function executed on each CPU Save the last callout function pointer (and its argument) executed on each CPU for inspection by a debugger. Add a ddb `show callout_last` command to show these pointers. Add a kernel module that I used for testing that command. Relocate `ce_migration_cpu` to reduce padding and therefore preserve the size of `struct callout_cpu` (320 bytes on amd64) despite the added members. This should help diagnose reference-after-free bugs where the callout's mutex has already been freed when `softclock_call_cc` tries to unlock it. You might hope that the pointer would still be available, but it isn't. The argument to that function is on the stack (because `softclock_call_cc` uses it later), and that might be enough in some cases, but even then, it's very laborious. A pointer to the callout is saved right before these newly added fields, but that callout might have been freed. We still have the pointer to its associated mutex, and the name within might be enough, but it might also have been freed. Reviewed by: markj jhb MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20794	2019-07-03 19:22:44 +00:00
John Baldwin	afa60c068e	Invoke ext_free function when freeing an unmapped mbuf. Fix a mis-merge when extracting the unmapped mbuf changes from Netflix's in-kernel TLS changes where the call to the function that freed the backing pages from an unmapped mbuf was missed. Sponsored by: Chelsio Communications	2019-07-02 22:58:21 +00:00
John Baldwin	9b2d70da33	Fix description of debug.obsolete_panic. MFC after: 1 week	2019-07-02 22:57:24 +00:00
Konstantin Belousov	7fde3c6b28	More style. Re-wrap long lines, reformat comments, remove excessive blank line. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-07-02 21:03:06 +00:00
Konstantin Belousov	4b8b28e130	Style. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-07-02 19:32:48 +00:00
Konstantin Belousov	5dc7e31a09	Control implicit PROT_MAX() using procctl(2) and the FreeBSD note feature bit. In particular, allocate the bit to opt-out the image from implicit PROTMAX enablement. Provide procctl(2) verbs to set and query implicit PROTMAX handling. The knobs mimic the same per-image flag and per-process controls for ASLR. Reviewed by: emaste, markj (previous version) Discussed with: brooks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D20795	2019-07-02 19:07:17 +00:00
Mark Johnston	6d958292f3	Fix handling of errors from sblock() in soreceive_stream(). Previously we would attempt to unlock the socket buffer despite having failed to lock it. Simply return an error instead: no resources need to be released at this point, and doing so is consistent with soreceive_generic(). PR: 238789 Submitted by: Greg Becker <greg@codeconcepts.com> MFC after: 1 week	2019-07-02 14:24:42 +00:00
Rick Macklem	555d8f2859	Factor out the code that does a VOP_SETATTR(size) from vn_truncate(). This patch factors the code in vn_truncate() that does the actual VOP_SETATTR() of size into a separate function called vn_truncate_locked(). This will allow the NFS server and the patch that adds a copy_file_range(2) syscall to call this function instead of duplicating the code and carrying over changes, such as the recent r347151. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D20808	2019-07-01 20:41:43 +00:00
Mark Johnston	7c3703a694	Use a consistent snapshot of the fd's rights in fget_mmap(). fget_mmap() translates rights on the descriptor to a VM protection mask. It was doing so without holding any locks on the descriptor table, so a writer could simultaneously be modifying those rights. Such a situation would be detected using a sequence counter, but not before an inconsistency could trigger assertion failures in the capability code. Fix the problem by copying the fd's rights to a structure on the stack, and perform the translation only once we know that that snapshot is consistent. Reported by: syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com Reviewed by: brooks, mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20800	2019-06-29 16:11:09 +00:00
Mark Johnston	02476c44c5	Fix mutual exclusion in pipe_direct_write(). We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where the reader copies data directly from pages mapped into the writer. However, when a reader finishes such a copy, it previously cleared PIPE_DIRECTW, allowing multiple writers to race and corrupt the state used to track wired pages belonging to the writer. Fix this by having the writer clear PIPE_DIRECTW and instead use the count of unread bytes to determine whether a write is finished. Reported by: syzbot+21811cc0a89b2a87a9e7@syzkaller.appspotmail.com Reviewed by: kib, mjg Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20784	2019-06-29 16:05:52 +00:00
John Baldwin	3807631b8e	Compress pending socket buffer data once it is marked ready. Apply similar logic from sbcompress to pending data in the socket buffer once it is marked ready via sbready. Normally sbcompress merges small mbufs to reduce the length of mbuf chains in the socket buffer. However, sbcompress cannot do this for mbufs marked M_NOTREADY. sbcompress_ready is now called from sbready when mbufs are marked ready to merge small mbuf chains once the data is available to copy. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:50:25 +00:00
John Baldwin	cec06a3edc	Add support for using unmapped mbufs with sendfile(2). This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl. It is disabled by default. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:49:35 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Konstantin Belousov	2d7a555294	Style. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-06-28 20:40:54 +00:00
Hans Petter Selasky	131b2b7658	Implement API for draining EPOCH(9) callbacks. The epoch_drain_callbacks() function is used to drain all pending callbacks which have been invoked by prior epoch_call() function calls on the same epoch. This function is useful when there are shared memory structure(s) referred to by the epoch callback(s) which are not refcounted and are rarely freed. The typical place for calling this function is right before freeing or invalidating the shared resource(s) used by the epoch callback(s). This function can sleep and is not optimized for performance. Differential Revision: https://reviews.freebsd.org/D20109 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-28 10:38:56 +00:00
Alan Somers	7f49ce7a0b	MFHead @349476 Sponsored by: The FreeBSD Foundation	2019-06-27 23:50:54 +00:00
Alan Somers	0cfc1ef38d	FIOBMAP2: inline vn_ioc_bmap2 Reported by: kib Reviewed by: kib MFC after: 2 weeks MFC-With: 349238 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20783	2019-06-27 23:39:06 +00:00
Rick Macklem	e368095437	Add non-blocking trylock variants for the rangelock functions. A future patch that will add a Linux compatible copy_file_range(2) syscall needs to be able to lock the byte ranges of two files concurrently. To do this without a risk of deadlock, a non-blocking variant of vn_rangelock_rlock() called vn_rangelock_tryrlock() was needed. This patch adds this, along with vn_rangelock_trywlock(), in order to do this. The patch also adds a couple of comments, that I hope clarify how the algorithm used in kern_rangelock.c works. Reviewed by: kib, asomers (previous version) Differential Revision: https://reviews.freebsd.org/D20645	2019-06-27 23:10:40 +00:00
John Baldwin	1db2626a9b	Fix comment in sofree() to reference sbdestroy(). r160875 added sbdestroy() as a wrapper around sbrelease_internal to be called from sofree(), yet the comment added in the same revision to sofree() still mentions sbrelease_internal(). Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20488	2019-06-27 22:50:11 +00:00
Alan Somers	4f53d57e8c	fcntl: style changes to r349248 Reported by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation	2019-06-25 19:44:22 +00:00
Warner Losh	7a3e3a2859	Remove a couple of harmless stray references to nandfs. Submitted by: tsoome@	2019-06-25 16:39:25 +00:00
Warner Losh	af9727f618	Add missing include of sys/boot.h This change was dropped out in a rebase and I didn't catch that before I committed.	2019-06-24 20:52:21 +00:00
Warner Losh	ec9abc1843	Move to using a common kernel path between the boot / laoder bits and the kernel.	2019-06-24 20:34:53 +00:00
Konstantin Belousov	89f2ab0608	Switch to check for effective user id in r349320, and disable dumping into existing files for sugid processes. Despite using real user id pronounces the intent, it actually breaks suid coredumps, while not making any difference for non-sugid processes. The reason for the breakage is that non-existent core file is created with the effective uid (unless weird hacks like SUIDDIR are configured). Then, if user enabled kern.sugid_coredump, core dumping should not overwrite core files owned by effective uid, but we cannot pretend to use real uid for dumping. PR: 68905 admbugs: 358 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-06-23 21:15:31 +00:00
Konstantin Belousov	7a29e0bf96	coredump: avoid writing to core files not owned by the real user. Reported by: blake frantz <trew@hick.org> PR: 68905 admbugs: 358 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-06-23 18:35:11 +00:00
Alan Somers	38b06f8ac4	fcntl: fix overflow when setting F_READAHEAD VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field. However, fcntl allows you to set the seqcount in bytes to any nonnegative 31-bit value. The result can be a 16-bit overflow, which will be sign-extended in functions like ffs_read. Fix this by sanitizing the argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather than INT16_MAX. Also, fifos have overloaded the f_seqcount field for a completely different purpose ever since r238936. Formalize that by using a union type. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20710	2019-06-20 23:07:20 +00:00
Alan Somers	e532a99901	MFHead @349234 Sponsored by: The FreeBSD Foundation	2019-06-20 15:56:08 +00:00
Alan Somers	d49b446bfb	Add FIOBMAP2 ioctl This ioctl exposes VOP_BMAP information to userland. It can be used by programs like fragmentation analyzers and optimized cp implementations. But I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block number instead of 32-bit, and it also returns runp and runb. Reviewed by: mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20705	2019-06-20 14:13:10 +00:00
Alan Somers	d01752c703	Add a VOP_BMAP(9) man page Reviewed by: mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20704	2019-06-20 13:59:46 +00:00
Alexander Motin	f91aa773be	Add wakeup_any(), cheaper wakeup_one() for taskqueue(9). wakeup_one() and underlying sleepq_signal() spend additional time trying to be fair, waking thread with highest priority, sleeping longest time. But in case of taskqueue there are many absolutely identical threads, and any fairness between them is quite pointless. It makes even worse, since round-robin wakeups not only make previous CPU affinity in scheduler quite useless, but also hide from user chance to see CPU bottlenecks, when sequential workload with one request at a time looks evenly distributed between multiple threads. This change adds new SLEEPQ_UNFAIR flag to sleepq_signal(), making it wakeup thread that went to sleep last, but no longer in context switch (to avoid immediate spinning on the thread lock). On top of that new wakeup_any() function is added, equivalent to wakeup_one(), but setting the flag. On top of that taskqueue(9) is switchied to wakeup_any() to wakeup its threads. As result, on 72-core Xeon v4 machine sequential ZFS write to 12 ZVOLs with 16KB block size spend 34% less time in wakeup_any() and descendants then it was spending in wakeup_one(), and total write throughput increased by ~10% with the same as before CPU usage. Reviewed by: markj, mmacy MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D20669	2019-06-20 01:15:33 +00:00
Alexander Motin	49ee0fcea5	Use sbuf_cat() in GEOM confxml generation. When it comes to megabytes of text, difference between sbuf_printf() and sbuf_cat() becomes substantial. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-06-19 15:36:02 +00:00
Alexander Motin	eeea0fcf0f	Fix typo in r349178. Reported by: ae MFC after: 1 week	2019-06-19 13:30:50 +00:00
Alexander Motin	5c32e9fcb2	Optimize kern.geom.conf* sysctls. On large systems those sysctls may generate megabytes of output. Before this change sbuf(9) code was resizing buffer by 4KB each time many times, generating tons of TLB shootdowns. Unfortunately in this case existing sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not applicable, since all the sbuf writes are done in different kernel thread. This change improves situation in two ways: - on first sysctl call, not providing any output buffer, it sets special sbuf drain function, just counting the data and so not needing big buffer; - on second sysctl call it uses as initial buffer size value saved on previous call, so that in most cases there will be no reallocation, unless GEOM topology changed significantly. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-06-18 21:05:10 +00:00
Xin LI	f89d207279	Separate kernel crc32() implementation to its own header (gsb_crc32.h) and rename the source to gsb_crc32.c. This is a prerequisite of unifying kernel zlib instances. PR: 229763 Submitted by: Yoshihiro Ota <ota at j.email.ne.jp> Differential Revision: https://reviews.freebsd.org/D20193	2019-06-17 19:49:08 +00:00
Alexander Motin	c80038a0a7	Update td_runtime of running thread on each statclock(). Normally td_runtime is updated on context switch, but there are some kernel threads that due to high absolute priority may run for many seconds without context switches (yes, that is bad, but that is true), which means their td_runtime was not updated all that time, that made them invisible for top other then as some general CPU usage. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-06-14 01:09:10 +00:00
Stephen Hurd	705aad98c6	Some devices take undesired actions when RTS and DTR are asserted. Some development boards for example will reset on DTR, and some radio interfaces will transmit on RTS. This patch allows "stty -f /dev/ttyu9.init -rtsdtr" to prevent RTS and DTR from being asserted on open(), allowing these devices to be used without problems. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D20031	2019-06-12 18:07:04 +00:00
John Baldwin	0f70218343	Make the warning intervals for deprecated crypto algorithms tunable. New sysctl/tunables can now set the interval (in seconds) between rate-limited crypto warnings. The new sysctls are: - kern.cryptodev_warn_interval for /dev/crypto - net.inet.ipsec.crypto_warn_interval for IPsec - kern.kgssapi_warn_interval for KGSSAPI Reviewed by: cem MFC after: 1 month Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D20555	2019-06-11 23:00:55 +00:00
John Baldwin	19c40c76e1	Trim an extra space.	2019-06-11 22:06:05 +00:00
Bjoern A. Zeeb	4c62bffef5	Fix dpcpu and vnet panics with complex types at the end of the section. Apply a linker script when linking i386 kernel modules to apply padding to a set_pcpu or set_vnet section. The padding value is kind-of random and is used to catch modules not compiled with the linker-script, so possibly still having problems leading to kernel panics. This is needed as the code generated on certain architectures for non-simple-types, e.g., an array can generate an absolute relocation on the edge (just outside) the section and thus will not be properly relocated. Adding the padding to the end of the section will ensure that even absolute relocations of complex types will be inside the section, if they are the last object in there and hence relocation will work properly and avoid panics such as observed with carp.ko or ipsec.ko. There is a rather lengthy discussion of various options to apply in the mentioned PRs and their depends/blocks, and the review. There seems no best solution working across multiple toolchains and multiple version of them, so I took the liberty of taking one, as currently our users (and our CI system) are hitting this on just i386 and we need some solution. I wish we would have a proper fix rather than another "hack". Also backout r340009 which manually, temporarily fixed CARP before 12.0-R "by chance" after a lead-up of various other link-elf.c and related fixes. PR: 230857,238012 With suggestions from: arichardson (originally last year) Tested by: lwhsu Event: Waterloo Hackathon 2019 Reported by: lwhsu, olivier MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D17512	2019-06-08 17:44:42 +00:00
Alan Somers	0269ae4c19	MFHead @348740 Sponsored by: The FreeBSD Foundation	2019-06-06 16:20:50 +00:00
Alan Somers	d10b757886	[skip ci] Better comments for vlrureclaim Sponsored by: The FreeBSD Foundation	2019-06-06 15:11:36 +00:00
Alan Somers	571908e2b4	[skip ci] Fix the comment for cache_purge(9) Sponsored by: The FreeBSD Foundation	2019-06-06 15:07:49 +00:00
Alan Somers	46f8169aea	Add a testing facility to manually reclaim a vnode Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it will be reclaimed, as long as it isn't already or doomed. The purpose is to gain test coverage for vnode reclamation, which is otherwise hard to achieve. Add the debug.ftry_reclaim_vnode sysctl. It does the same thing, except that its argument is a file descriptor instead of a pathname. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20519	2019-06-06 15:04:50 +00:00
Ed Maste	004caac2d8	style(9) / tidying for r348611 MFC with: r348611 Event: Waterloo Hackathon 2019	2019-06-04 13:45:30 +00:00
Ed Maste	74cd06b42e	Expose the kernel's build-ID through sysctl After our migration (of certain architectures) to lld the kernel is built with a unique build-ID. Make it available via a sysctl and uname(1) to allow the user to identify their running kernel. Submitted by: Ali Mashtizadeh <ali_mashtizadeh.com> MFC after: 2 weeks Relnotes: Yes Event: Waterloo Hackathon 2019 Differential Revision: https://reviews.freebsd.org/D20326	2019-06-04 13:07:10 +00:00
John Baldwin	e8d8cc0139	Warn about deprecated features on all major OS versions. Reviewed by: imp MFC after: 3 days Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D20490	2019-06-03 15:43:40 +00:00
Konstantin Belousov	f6e5ddff6b	Remove dead check. We already handled the case when symstrindex < 0 at line 680. Reported by: danfe using PVS-studio Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-06-03 15:23:37 +00:00
Brooks Davis	4af6033324	makesyscalls.sh: always use absolute path for syscalls.conf syscalls.conf is included using "." which per the Open Group: If file does not contain a <slash>, the shell shall use the search path specified by PATH to find the directory containing file. POSIX shells don't fall back to the current working directory. Submitted by: Nathaniel Wesley Filardo <nwf20@cl.cam.ac.uk> Reviewed by: bdrewery Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D20476	2019-05-30 20:56:23 +00:00
Dmitry Chagin	c8124e20e5	Remove wrong inline keyword. Reported by: markj MFC after: 1 week	2019-05-30 16:11:20 +00:00
Konstantin Belousov	5c066cd2e2	Remove TODO comment after posixshmcontrol(1) added. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-05-30 16:04:00 +00:00
Konstantin Belousov	5d993207da	Silence witness warning about duplicated mutex type. The order is correct, it is nullfs vnode interlock -> lower vnode interlock. vop_stdadd_writecount() is called from nullfs VOP_ADD_WRITECOUNT() and both take interlocks. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2019-05-30 15:04:09 +00:00
Dmitry Chagin	c5afec6e89	Complete LOCAL_PEERCRED support. Cache pid of the remote process in the struct xucred. Do not bump XUCRED_VERSION as struct layout is not changed. PR: 215202 Reviewed by: tijl MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20415	2019-05-30 14:24:26 +00:00
Konstantin Belousov	ab74c84333	Do not go into sleep in sleepq_catch_signals() when SIGSTOP from PT_ATTACH was consumed. In particular, do not clear TDP_FSTP in ptracestop() if td_wchan is non-NULL. Leave it to sleepq_catch_signal() to clear and convert zero return code to EINTR. Otherwise, per submitter report, if the PT_ATTACH SIGSTOP was delivered right after the thread was added to the sleepqueue but not yet really sleep, and cursig() caused debugger attach, the thread sleeps instead of returning to the userspace boundary with EINTR. PR: 231445 Reported by: Efi Weiss <valmarelox@gmail.com> Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D20381	2019-05-29 14:05:27 +00:00
Andrew Turner	5393cbce3d	Teach the kernel KUBSAN runtime about alignment_assumption This checks the alignment of a given pointer is sufficient for the requested alignment asked for. This fixes the build with a recent llvm/clang. Sponsored by: DARPA, AFRL	2019-05-28 09:12:15 +00:00
Justin Hibbits	a5868885fa	kern/CTF: link_elf_ctf_get() on big endian platforms Check the CTF magic number in big endian platforms. This lets DTrace FBT handle types correctly on these platforms. Submitted by: Brandon Bergren MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20413	2019-05-27 04:20:31 +00:00
Conrad Meyer	5d0e829978	Disable intr_storm_threshold mechanism by default The ixl.4 manual page has documented that the threshold falsely detects interrupt storms on 40Gbit NICs as long ago as 2015, and we have seen similar false positives with the ioat(4) DMA device (which can push GB/s). For example, synthetic load can be generated with tools/tools/ioat 'ioatcontrol 0 200 8192 1 1000' (allocate 200x8kB buffers, generate an interrupt for each one, and do this for 1000 milliseconds). With storm-detection disabled, the Broadwell-EP version of this device is capable of generating ~350k real interrupts per second. The following historical context comes from jhb@: Originally, the threshold worked around incorrect routing of PCI INTx interrupts on single-CPU systems which would end up in a hard hang during boot. Since the threshold was added, our PCI interrupt routing was improved, most PCI interrupts use edge-triggered MSI instead of level-triggered INTx, and typical systems have multiple CPUs available to service interrupts. On the off chance that the threshold is useful in the future, it remains available as a tunable and sysctl. Reviewed by: jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20401	2019-05-24 22:33:14 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Alan Somers	65417f5e27	Remove "struct ucred" argument from vtruncbuf vtruncbuf takes a "struct ucred" argument. AFAICT, it's been unused ever since that function was first added in r34611. Remove it. Also, remove some "struct ucred" arguments from fuse and nfs functions that were only used by vtruncbuf. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20377	2019-05-24 20:27:50 +00:00
Conrad Meyer	8298529226	EKCD: Add Chacha20 encryption mode Add Chacha20 mode to Encrypted Kernel Crash Dumps. Chacha20 does not require messages to be multiples of block size, so it is valid to use the cipher on non-block-sized messages without the explicit padding AES-CBC would require. Therefore, allow use with simultaneous dump compression. (Continue to disallow use of AES-CBC EKCD with compression.) dumpon(8) gains a -C cipher flag to select between chacha and aes-cbc. It defaults to chacha if no -C option is provided. The man page documents this behavior. Relnotes: sure Sponsored by: Dell EMC Isilon	2019-05-23 20:12:24 +00:00
Konstantin Belousov	56d0e33e7a	Add a kern.ipc.posix_shm_list sysctl. The sysctl provides the listing on named linked posix shared memory segments existing in the system. Reuse shm_fill_kinfo() for filling individual struct kinfo_file. Remove unneeded lock around reading of shmfd->shm_mode. Reviewed by: jilles, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20258	2019-05-23 12:35:40 +00:00
Konstantin Belousov	e4b7754848	Report ref count of the backing object as st_nlink for posix shm fd. Unless there are transient references to the object, the ref count is equal to the number of the shared memory segment mappings plus one. Reviewed by: jilles, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20258	2019-05-23 12:27:45 +00:00
Konstantin Belousov	bc2d137acb	Make pack_kinfo() available for external callers. Reviewed by: jilles, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20258	2019-05-23 12:25:03 +00:00
Conrad Meyer	45d314c556	mqueuefs: Do not allow manipulation of the pseudo-dirents "." and ".." "." and ".." names are not maintained in the mqueuefs dirent datastructure and cannot be opened as mqueues. Creating or removing them is invalid; return EINVAL instead of crashing. PR: 236836 Submitted by: Torbjørn Birch Moltu <t.b.moltu AT lyse.net> Discussed with: jilles (earlier version)	2019-05-21 21:26:14 +00:00
Conrad Meyer	daec92844e	Include ktr.h in more compilation units Similar to r348026, exhaustive search for uses of CTRn() and cross reference ktr.h includes. Where it was obvious that an OS compat header of some kind included ktr.h indirectly, .c files were left alone. Some of these files clearly got ktr.h via header pollution in some scenarios, or tinderbox would not be passing prior to this revision, but go ahead and explicitly include it in files using it anyway. Like r348026, these CUs did not show up in tinderbox as missing the include. Reported by: peterj (arm64/mp_machdep.c) X-MFC-With: r347984 Sponsored by: Dell EMC Isilon	2019-05-21 20:38:48 +00:00
Konstantin Belousov	e28fa55a08	NDFREE(): Fix unlocking for LOCKPARENT\|LOCKLEAF and ndp->ni_dvp == ndp->ni_vp. NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed out. This makes the comparision of ni_dvp with ni_vp always fail. Move the calculation of unlock_dvp right after unlock_vp, so that the code sees correct ni_vp value. Reproduced by chdir("/usr"); open("/..", O_BENEATH \| O_RDONLY); Reported by: syzkaller Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20304	2019-05-21 15:12:13 +00:00
Stephen J. Kiernan	1177d38ce1	The older detection methods (smbios.bios.vendor and smbios.system.product) are able to determine some virtual machines, but the vm_guest variable was still only being set to VM_GUEST_VM. Since we do know what some of them specifically are, we can set vm_guest appropriately. Also, if we see the CPUID has the HV flag, but we were unable to find a definitive vendor in the Hypervisor CPUID Information Leaf, fall back to the older detection methods, as they may be able to determine a specific HV type. Add VM_GUEST_PARALLELS value to VM_GUEST for Parallels. Approved by: cem Differential Revision: https://reviews.freebsd.org/D20305	2019-05-21 13:29:53 +00:00
Mark Johnston	a1fa04c067	kcov depends on eventhandler.h. MFC after: 3 days	2019-05-20 19:14:07 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Konstantin Belousov	5422b0e63d	Fix rw->ro remount when there is a text vnode mapping. Reported and tested by: hrs Sponsored by: The FreeBSD Foundation MFC after: 16 days	2019-05-19 09:18:09 +00:00
Mark Johnston	fd0be988cb	Update the DIAGNOSTIC-only vmem_check_sanity() after r347949. Cursor tags are special and shouldn't be subject to the existing checks. Reported by: kib, David Wolfskill MFC with: r347949	2019-05-18 14:19:23 +00:00
Mark Johnston	f1c592fb60	Implement the M_NEXTFIT allocation strategy for vmem(9). This is described in the vmem paper: "directs vmem to use the next free segment after the one previously allocated." The implementation adds a new boundary tag type, M_CURSOR, which is linked into the segment list and precedes the segment following the previous M_NEXTFIT allocation. The cursor is used to locate the next free segment satisfying the allocation constraints. This implementation isn't O(1) since busy tags aren't coalesced, and we may potentially scan the entire segment list during an M_NEXTFIT allocation. Reviewed by: alc MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D17226	2019-05-18 01:46:38 +00:00
Konstantin Belousov	f1f81d3b96	Grammar fixes for r347690. Submitted by: alc MFC after: 3 days	2019-05-17 21:18:11 +00:00
Stephen J. Kiernan	949f834a61	Instead of individual conditional statements to look for each hypervisor type, use a table to make it easier to add more in the future, if needed. Add VirtualBox detection to the table ("VBoxVBoxVBox" is the hypervisor vendor string to look for.) Also add VM_GUEST_VBOX to the VM_GUEST enumeration to indicate VirtualBox. Save the CPUID base for the hypervisor entry that we detected. Driver code may need to know about it in order to obtain additional CPUID features. Approved by: bryanv, jhb Differential Revision: https://reviews.freebsd.org/D16305	2019-05-17 17:21:32 +00:00
Konstantin Belousov	4d3b28bcdc	amd64 pmap: rework delayed invalidation, removing global mutex. For machines having cmpxcgh16b instruction, i.e. everything but very early Athlons, provide lockless implementation of delayed invalidation. The implementation maintains lock-less single-linked list with the trick from the T.L. Harris article about volatile mark of the elements being removed. Double-CAS is used to atomically update both link and generation. New thread starting DI appends itself to the end of the queue, setting the generation to the generation of the last element +1. On DI finish, thread donates its generation to the previous element. The generation of the fake head of the list is the last passed DI generation. Basically, the implementation is a queued spinlock but without spinlock. Many thanks both to Peter Holm and Mark Johnson for keeping with me while I produced intermediate versions of the patch. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 month MFC note: td_md.md_invl_gen should go to the end of struct thread Differential revision: https://reviews.freebsd.org/D19630	2019-05-16 13:28:48 +00:00
Konstantin Belousov	a9fd669b4a	subr_turnstile: Extract some common code to a helper. Code walks the list of contested turnstiles to calculate the priority to unlend. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-05-16 13:17:57 +00:00
Konstantin Belousov	0ddfdc60f8	imgact_elf.c: Add comment explaining the malloc/VOP_UNLOCK() dance from r347148. Requested by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-05-16 13:03:54 +00:00
Andrey V. Elsukov	2317067c31	Remove bpf interface lock, it is no longer exist.	2019-05-14 10:21:28 +00:00
Conrad Meyer	e199792d23	Revert r346292 (permit_nonrandom_stackcookies) We have a better, more comprehensive knob for this now: kern.random.initial_seeding.bypass_before_seeding=1. Requested by: delphij Sponsored by: Dell EMC Isilon	2019-05-13 23:37:44 +00:00
Alan Somers	7648bc9fee	MFHead @347527 Sponsored by: The FreeBSD Foundation	2019-05-13 18:25:55 +00:00
Mateusz Guzik	8ba6c1391b	cache: fix a brainfart in r347505 If bumping over the counter goes over the limit we have to decrement it back. Previous code would only bump the counter after adding the entry (thus allowing the cache to go over the limit). Sponsored by: The FreeBSD Foundation	2019-05-12 07:56:01 +00:00
Mateusz Guzik	5bf50787e6	cache: bump numcache on entry, while here fix lnumcache type Sponsored by: The FreeBSD Foundation	2019-05-12 06:59:22 +00:00

... 3 4 5 6 7 ...

17044 Commits