freebsd-skq

Author	SHA1	Message	Date
Kyle Evans	af755d3e48	[1/3] Add mostly Linux-compatible file sealing support File sealing applies protections against certain actions (currently: write, growth, shrink) at the inode level. New fileops are added to accommodate seals - EINVAL is returned by fcntl(2) if they are not implemented. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D21391	2019-09-25 17:32:43 +00:00
Kyle Evans	85c5f3cb57	Add COMPAT12 support to makesyscalls.sh Reviewed by: kib, imp, brooks (all without syscalls.master edits) Differential Revision: https://reviews.freebsd.org/D21366	2019-09-25 17:29:45 +00:00
Toomas Soome	3001e0c942	kernel: terminal_init() should check for teken colors from kenv Check for teken.fg_color and teken.bg_color and prepare the color attributes accordingly. When white background is used, make it light to improve visibility. When black background is used, make kernel messages light.	2019-09-25 13:21:07 +00:00
Alexander Motin	bb3dfc6ae9	Fix wrong assertion in r352658. MFC after: 1 month	2019-09-25 11:58:54 +00:00
Alexander Motin	c9205e3500	Fix/improve interrupt threads scheduling. Doing some tests with very high interrupt rates I've noticed that one of conditions I added in r232207 to make interrupt threads in most cases run on local CPU never worked as expected (worked only if previous time it was executed on some other CPU, that is quite opposite). It caused additional CPU usage to run full CPU search and could schedule interrupt threads to some other CPU. This patch removes that code and instead reuses existing non-interrupt code path with some tweaks for interrupt case: - On SMT systems, if current thread is idle, don't look on other threads. Even if they are busy, it may take more time to do fill search and bounce the interrupt thread to other core then execute it locally, even sharing CPU resources. It is other threads should migrate, not bound interrupts. - Try hard to keep interrupt threads within LLC of their original CPU. This improves scheduling cost and supposedly cache and memory locality. On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few percents of CPU time while adding few percents to IOPS. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-24 20:01:20 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Mateusz Guzik	93a85508ad	cache: tidy up handling of negative entries - track the total count of hot entries - pre-read the lock when shrinking since it is typically already taken - place the lock in its own cacheline - shorten the hold time of hot lock list when zapping Sponsored by: The FreeBSD Foundation	2019-09-23 20:50:04 +00:00
Mark Johnston	38dae42c26	Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations. This is required for DPCPU and VNET data variable definitions to work when KLDs are linked as DSOs. R_X86_64_RELATIVE relocations should not appear in object files, so assert this in elf_relocaddr(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21755	2019-09-23 14:14:43 +00:00
Mateusz Guzik	afe257e3ca	cache: count evictions of negatve entries Sponsored by: The FreeBSD Foundation	2019-09-23 08:53:14 +00:00
Sean Eric Fagan	ba7a55d934	Add two options to allow mount to avoid covering up existing mount points. The two options are * nocover/cover: Prevent/allow mounting over an existing root mountpoint. E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local is already a mountpoint. * emptydir/noemptydir: Prevent/allow mounting on a non-empty directory. E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail. Neither of these options is intended to be a default, for historical and compatibility reasons. Reviewed by: allanjude, kib Differential Revision: https://reviews.freebsd.org/D21458	2019-09-23 04:28:07 +00:00
Mateusz Guzik	7505cffa56	cache: try to avoid vhold if locks held Sponsored by: The FreeBSD Foundation	2019-09-22 20:50:24 +00:00
Mateusz Guzik	cd2112c305	cache: jump in negative success instead of positive Sponsored by: The FreeBSD Foundation	2019-09-22 20:49:17 +00:00
Mateusz Guzik	d2be3ef05c	lockprof: move per-cpu data to dpcpu Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21747	2019-09-22 20:44:24 +00:00
Konstantin Belousov	f33533da8c	kern.elf{32,64}.pie_base sysctl: enforce page alignment. Requested by: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 20:03:17 +00:00
Mateusz Guzik	cbba2cb367	lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks Sponsored by: The FreeBSD Foundation	2019-09-21 19:05:38 +00:00
Konstantin Belousov	95aafd6900	Make non-ASLR pie base tunable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 18:00:23 +00:00
Alexander Motin	36d151a237	Allocate callout wheel from the respective memory domain. MFC after: 1 week	2019-09-21 15:38:08 +00:00
Andrew Gallatin	61b8a4af71	remove redundant "ktls" in KTLS thr name This reducesthe string width of the ktls thread name and improves "ps" output. Glanced at by: jhb Event: EuroBSDCon hackathon Sponsored by: Netflix	2019-09-20 09:36:07 +00:00
Mateusz Guzik	b488246b45	vfs: group fields used for per-cpu ops in one cacheline Sponsored by: The FreeBSD Foundation	2019-09-19 21:23:14 +00:00
Konstantin Belousov	382e01c8dc	sysctl: use names instead of magic numbers. Replace magic numbers with symbols for internal sysctl operations. Convert in-kernel and libc consumers. Submitted by: Pawel Biernacki MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21693	2019-09-18 16:13:10 +00:00
Konstantin Belousov	55894117b1	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634	2019-09-17 18:32:18 +00:00
Kirk McKusick	100369071d	The VFS-level clustering code collects together sequential blocks by issuing delayed-writes (bdwrite()) until a non-sequential block is written or the maximum cluster size is reached. At that point it collects the delayed buffers together (using bread()) to write them in a single operation. The assumption was that since we just looked at them they will still be in memory so there is no need to check for a read error from bread(). Very occationally (apparently every 10-hours or so when being pounded by Peter Holm's tests) this assumption is wrong. The fix is to check for errors from bread() and fail the cluster write thus falling back to the default individual flushing of any still dirty buffers. Reported by: Peter Holm and Chuck Silvers Reviewed by: kib MFC after: 3 days	2019-09-17 17:44:50 +00:00
Mateusz Guzik	d245aa1e72	vfs: apply r352437 to the fast path as well This one is very hard to run into. If the filesystem is being unmounted or the mount point is freed the vfs_op_thread_enter will fail. For it to succeed the mount point itself would have to be reallocated in the time window between the initial read and the attempt to enter. Sponsored by: The FreeBSD Foundation	2019-09-17 15:53:40 +00:00
Mateusz Guzik	7f65185940	vfs: fix braino resulting in NULL pointer deref in r352424 The breakage was added after all the testing and the testing which followed was not sufficient to find it. Reported by: pho Sponsored by: The FreeBSD Foundation	2019-09-17 08:09:39 +00:00
Mateusz Guzik	4cace859c2	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637	2019-09-16 21:37:47 +00:00
Mateusz Guzik	e87f3f72f1	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575	2019-09-16 21:33:16 +00:00
Mateusz Guzik	ee831b2543	vfs: manage mnt_lockref with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21574	2019-09-16 21:32:21 +00:00
Mateusz Guzik	a8c8e44bf0	vfs: manage mnt_ref with atomics New primitive is introduced to denote sections can operate locklessly on aspects of struct mount, but which can also be disabled if necessary. This provides an opportunity to start scaling common case modifications while providing stable state of the struct when facing unmount, write suspendion or other events. mnt_ref is the first counter to start being managed in this manner with the intent to make it per-cpu. Reviewed by: kib, jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21425	2019-09-16 21:31:02 +00:00
Kyle Evans	3155f2f0e2	rangelock: add rangelock_cookie_assert A future change to posixshm to add file sealing (in DIFF_21391[0] and child) will move locking out of shm_dotruncate as kern_shm_open() will require the lock to be held across the dotruncate until the seal is actually applied. For this, the cookie is passed into shm_dotruncate_locked which asserts RCA_WLOCKED. [0] Name changed to protect the innocent, hopefully, from getting autoclosed due to this reference... Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21628	2019-09-15 02:59:53 +00:00
Mateusz Guzik	ce3ba63f67	vfs: release usecount using fetchadd 1. If we release the last usecount we take ownership of the hold count, which means the vnode will remain allocated until we vdrop it. 2. If someone else vrefs they will find no usecount and will proceed to add their own hold count. 3. No code has a problem with v_usecount transitioning to 0 without the interlock These facts combined mean we can fetchadd instead of having a cmpset loop. Reviewed by: kib (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21528	2019-09-13 15:49:04 +00:00
Mark Johnston	45cdd437ae	Remove a redundant NULL pointer check in cpuset_modify_domain(). cpuset_getroot() is guaranteed to return a non-NULL pointer. Reported by: Mark Millard <marklmi@yahoo.com> MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-09-12 16:47:38 +00:00
Hans Petter Selasky	11b57401e6	Use REFCOUNT_COUNT() to obtain refcount where appropriate. Refcount waiting will set some flag bits in the refcount value. Make sure these bits get cleared by using the REFCOUNT_COUNT() macro to obtain the actual refcount. Differential Revision: https://reviews.freebsd.org/D21620 Reviewed by: kib@, markj@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-12 16:26:59 +00:00
Kyle Evans	5163b1a75c	Follow up r352244: kenv: tighten up assertions As I like to forget: static kenv var formatting is actually such that an empty environment would be double null bytes. We should make sure that a non-zero buffer has at least enough for this, though most of the current usage is with a 4k buffer.	2019-09-12 14:34:46 +00:00
Kyle Evans	436c46875d	kenv: assert that an empty static buffer passed in is "empty" Garbage in the passed-in buffer can cause problems if any attempts to read the kenv are inadvertently made between init_static_kenv and the first kern_setenv -- assuming there is one. This is cheap and easy, so do it. This also helps rule out some class of bugs as one tries to debug; tunables fetch from the static environment up until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that rely on BSS clearing while others just grab a page of free memory and use it (e.g. xen).	2019-09-12 13:51:43 +00:00
Conrad Meyer	aaa3852435	buf: Add B_INVALONERR flag to discard data Setting the B_INVALONERR flag before a synchronous write causes the buf cache to forcibly invalidate contents if the write fails (BIO_ERROR). This is intended to be used to allow layers above the buffer cache to make more informed decisions about when discarding dirty buffers without successful write is acceptable. As a proof of concept, use in msdosfs to handle failures to mark the on-disk 'dirty' bit during rw mount or ro->rw update. Extending this to other filesystems is left as future work. PR: 210316 Reviewed by: kib (with objections) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21539	2019-09-11 21:24:14 +00:00
Mateusz Guzik	b088a4d6f9	cache: avoid excessive relocking on entry removal during lookup Due to lock ordering issues (bucket lock held, vnode locks wanted) the code starts with trylocking which in face of contention often fails. Prior to the change it would loop back with a possible yield. Instead note we know what locks are needed and can take them in the right order, avoiding retries. Then we can safely re-lookup and see if the entry we are looking for is still there. On a 104-way box poudriere would result in constant retries during an 11h run as seen in the vfs.cache.zap_and_exit_bucket_fail counter. before: 408866592 after : 0 However, a new stat reports: vfs.cache.zap_and_exit_bucket_relock_success: 32638 Note this is only a bandaid over current design issues. Tested by: pho Sponsored by: The FreeBSD Foundation	2019-09-10 20:19:29 +00:00
Mateusz Guzik	a6cacb0dca	cache: change the formula for calculating lock array sizes It used to be mp_ncpus * 64, but this gives unnecessarily big values for small machines and at the same time constraints bigger ones. In particular this helps on a 104-way box for which the count is now doubled. While here make cache_purgevfs less likely. Currently it is not efficient in face of contention due to lock ordering issues. These are fixable but not worth it at the moment. Sponsored by: The FreeBSD Foundation	2019-09-10 20:11:00 +00:00
Mateusz Guzik	1214618c05	cache: assorted cleanups Sponsored by: The FreeBSD Foundation	2019-09-10 20:08:24 +00:00
Jeff Roberson	c75757481f	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546	2019-09-10 19:08:01 +00:00
Jeff Roberson	4cdea4a853	Use the sleepq lock rather than the page lock to protect against wakeup races with page busy state. The object lock is still used as an interlock to ensure that the identity stays valid. Most callers should use vm_page_sleep_if_busy() to handle the locking particulars. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21255	2019-09-10 18:27:45 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Konstantin Belousov	6c46ce7ea3	Initialize timehands linkage much earlier. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-09 12:42:48 +00:00
Konstantin Belousov	4b23dec4c2	Make timehands count selectable at boottime. Tested by: O'Connor, Daniel <darius@dons.net.au> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21563	2019-09-09 11:29:58 +00:00
Konstantin Belousov	1040254b75	In do_execve(), use shared text vnode lock consistently. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:10:57 +00:00
Konstantin Belousov	1c36b72874	In do_execve(), clear imgp->textset when restarting for interpreter. Otherwise, we might left the boolean set, which would affect cleanup after an error on interpreter activation. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:05:17 +00:00
Konstantin Belousov	1073d17eeb	When loading ELF interpreter, initialize whole nested image_params with zero. Otherwise we could mishandle imgp->textset. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:03:26 +00:00
Philip Paeps	bdc786cc7c	riscv: restore default HZ=1000, keep QEMU at HZ=100 This reverts r351918 and r351919. Discussed with: br, ian, imp	2019-09-07 05:13:31 +00:00
Philip Paeps	7f0851ab19	riscv: default to HZ=100 Most current RISC-V development platforms are not fast enough to benefit from the increased granularity provided by HZ=1000. Sponsored by: Axiado	2019-09-06 01:19:31 +00:00
Conrad Meyer	a6935d085c	Remove long-dead BUF_ASSERT_{,UN}HELD assertions These were fully neutered in r177676 (2008), but not removed at the time for unclear reasons. They're totally dead code, so go ahead and yank them now. No functional change.	2019-09-05 21:43:33 +00:00
Mateusz Guzik	68c3c1abe1	vfs: temporarily revert r351825 There are 2 problems: - it introduces a funny bug where it can end up trylocking the same vnode [1] - it exposes a pre-existing softdep deadlock [2] Both are easier to run into that the bug which got fixed, so revert until a complete solution is worked out. Reported by: cy [1], pho [2] Sponsored by: The FreeBSD Foundation	2019-09-05 18:19:51 +00:00

1 2 3 4 5 ...

16843 Commits