freebsd-skq

Author	SHA1	Message	Date
vangyzen	580abaccd7	Add CTLFLAG_STATS to some vfs sysctl OIDs Add CTLFLAG_STATS to the following OIDs: vfs.altbufferflushes vfs.recursiveflushes vfs.barrierwrites vfs.flushwithdeps vfs.reassignbufcalls Refer to r353111. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2019-10-04 21:43:43 +00:00
emaste	3913895c3d	simplify path handling in sysctl_try_reclaim_vnode MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei verifies the presence of the NUL. Thus there is no need to increase the buffer size here. The sysctl passes the string excluding the NUL, so req->newlen equal to PATH_MAX is too long. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21876	2019-10-02 21:01:23 +00:00
markj	2b0a2a713e	Use OBJT_PHYS VM objects for kernel modules. OBJT_DEFAULT incurs some unnecessary overhead given that kernel module pages cannot be paged out. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21862	2019-10-02 16:34:42 +00:00
markj	20c5c1bf19	Disallow fcntl(F_READAHEAD) when the vnode is not a regular file. The mountpoint may not have defined an iosize parameter, so an attempt to configure readahead on a device file can lead to a divide-by-zero crash. The sequential heuristic is not applied to I/O to or from device files, and posix_fadvise(2) returns an error when v_type != VREG, so perform the same check here. Reported by: syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21864	2019-10-02 15:45:49 +00:00
kevans	e3f43eab06	shm_open2(2): completely unbreak kern_shm_open2(), since conception, completely fails to pass the mode along to kern_shm_open(). This breaks most uses of it. Add tests alongside this that actually check the mode of the returned files. PR: 240934 [pulseaudio breakage] Reported by: ler, Andrew Gierth [postgres breakage] Diagnosed by: Andrew Gierth (great catch) Tested by: ler, tmunro Pointy hat to: kevans	2019-10-02 02:37:34 +00:00
emaste	a079beb4de	sysalls.master: remove superfluous ellipsis in comment A single period is sufficient in this comment, and making this change lets us find references to varargs syscalls by searching for ...	2019-10-01 17:05:21 +00:00
brooks	1f0b5f9a65	Restore the ability to set capenabled directly in syscalls.conf. This fixes generation of cloudabi syscall tables broken in r340424. Reviewed by: kevans, emaste MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D21821	2019-09-30 20:58:29 +00:00
kevans	0c6303d3e7	syscalls.master: consistency, move ); to newline (no functional change)	2019-09-30 13:26:16 +00:00
markj	d46f64003d	Fix some problems with the SPARSE_MAPPING option in the kernel linker. - Ensure that the end of the mapping passed to vm_page_wire() is page-aligned. vm_page_wire() expects this. - Wire pages before reading data into them. - Apply protections specified in the segment descriptor using vm_map_protect() once relocation processing is done. - On amd64, ensure that we load KLDs above KERNBASE, since they are compiled with the "kernel" memory model by default. Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21756	2019-09-28 01:42:59 +00:00
gallatin	db240be427	kTLS: Fix a bug where we would not encrypt anon data inplace. Software Kernel TLS needs to allocate a new destination crypto buffer when encrypting data from the page cache, so as to avoid overwriting shared clear-text file data with encrypted data specific to a single socket. When the data is anonymous, eg, not tied to a file, then we can encrypt in place and avoid allocating a new page. This fixes a bug where the existing code always assumes the data is private, and never encrypts in place. This results in unneeded page allocations and potentially more memory bandwidth consumption when doing socket writes. When the code was written at Netflix, ktls_encrypt() looked at private sendfile flags to determine if the pages being encrypted where part of the page cache (coming from sendfile) or anonymous (coming from sosend). This was broken internally at Netflix when the sendfile flags were made private, and the M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will always be false for M_NOMAP mbufs, since one cannot just mtod() them. This change introduces a new flags field to the mbuf_ext_pgs struct by stealing a byte from the tls hdr. Note that the current header is still 2 bytes larger than the largest header we support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON when creating an unmapped mbuf in m_uiotombuf_nomap() (which is the path that socket writes take), and we check for that flag in ktls_encrypt() when looking for anon pages. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21796	2019-09-27 20:08:19 +00:00
gallatin	74ca068530	kTLS support for TLS 1.3 TLS 1.3 requires a few changes because 1.3 pretends to be 1.2 with a record type of application data. The "real" record type is then included at the end of the user-supplied plaintext data. This required adding a field to the mbuf_ext_pgs struct to save the record type, and passing the real record type to the sw_encrypt() ktls backend functions. Reviewed by: jhb, hselasky Sponsored by: Netflix Differential Revision: D21801	2019-09-27 19:17:40 +00:00
mjg	99543c9107	cache: decrease ncnegfactor to 5 The current mechanism is bogus in several ways: - the limit is a percentage of total entries added, which means negative entries get evicted all the time even if there are plenty of resources - evicting code is almost not concurrent, which makes it unable to remove entries fast enough when doing something as simple as -j 104 buildworld - there is no support for performing mass removal if necessary Vast majority of negative entries never get any hits. Only evicting them when the filesystem demands it results in a significant growth of the namecache with almost no improvement in the hit ratio. Sample result about afer 90 minutes of poudriere -j 104: current no evict % of the original numneg 219737 2013157 916 numneghits 266711906 263544562 98 [1] [1] this may look funny but there is a certain dose of variation to the build The number was chosen as something which mostly eliminates spurious evictions during lighter workloads but still keeps the total at bay. Sponsored by: The FreeBSD Foundation	2019-09-27 19:14:03 +00:00
mjg	306794b4a0	cache: stop requeuing negative entries on the hot list Turns out it does not improve hit ratio, but it does come with a cost induces stemming from dirtying hit entries. Sample result: hit counts of evicted entries after 2 buildworlds before: value ------------- Distribution ------------- count -1 \| 0 0 \|@@@@@@@@@@@@@@@@@@@@@@@@@ 180865 1 \|@@@@@@@ 49150 2 \|@@@ 19067 4 \|@ 9825 8 \|@ 7340 16 \|@ 5952 32 \|@ 5243 64 \|@ 4446 128 \| 3556 256 \| 3035 512 \| 1705 1024 \| 1078 2048 \| 365 4096 \| 95 8192 \| 34 16384 \| 26 32768 \| 23 65536 \| 8 131072 \| 6 262144 \| 0 after: value ------------- Distribution ------------- count -1 \| 0 0 \|@@@@@@@@@@@@@@@@@@@@@@@@@ 184004 1 \|@@@@@@ 47577 2 \|@@@ 19446 4 \|@ 10093 8 \|@ 7470 16 \|@ 5544 32 \|@ 5475 64 \|@ 5011 128 \| 3451 256 \| 3002 512 \| 1729 1024 \| 1086 2048 \| 363 4096 \| 86 8192 \| 26 16384 \| 25 32768 \| 24 65536 \| 7 131072 \| 5 262144 \| 0 Sponsored by: The FreeBSD Foundation	2019-09-27 19:13:22 +00:00
mjg	a67ed57908	cache: make negative list shrinking a little bit concurrent Continue protecting demotion from the hotlist and selection of the target list with the ncneg_shrink_lock lock, but drop it before relocking to zap the node. While here count how many times we skipped shrinking due to the lock being already taken. Sponsored by: The FreeBSD Foundation	2019-09-27 19:12:43 +00:00
mjg	0c1eba6ea2	cache: stop recalculating upper limit each time a new entry is added Sponsored by: The FreeBSD Foundation	2019-09-27 19:12:20 +00:00
kib	957270782d	Improve MD page fault handlers. Centralize calculation of signal and ucode delivered on unhandled page fault in new function vm_fault_trap(). MD trap_pfault() now almost always uses the signal numbers and error codes calculated in consistent MI way. This introduces the protection fault compatibility sysctls to all non-x86 architectures which did not have that bug, but apparently they were already much more wrong in selecting delivered signals on protection violations. Change the delivered signal for accesses to mapped area after the backing object was truncated. According to POSIX description for mmap(2): The system shall always zero-fill any partial page at the end of an object. Further, the system shall never write out any modified portions of the last page of an object which are beyond its end. References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. An implementation may generate SIGBUS signals when a reference would cause an error in the mapped object, such as out-of-space condition. Adjust according to the description, keeping the existing compatibility code for SIGSEGV/SIGBUS on protection failures. For situations where kernel cannot handle page fault due to resource limit enforcement, SIGBUS with a new error code BUS_OBJERR is delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on amd64 due to protection key access violation. vm_fault_hold() is renamed to vm_fault(). Fixed some nits in trap_pfault()s like mis-interpreting Mach errors as errnos. Removed unneeded truncations of the fault addresses reported by hardware. PR: 211924 Reviewed by: alc Discussed with: jilles, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21566	2019-09-27 18:43:36 +00:00
andrew	fc91d16ffa	Check the vfs option length is valid before accessing through When a VFS option passed to nmount is present but NULL the kernel will place an empty option in its internal list. This will have a NULL pointer and a length of 0. When we come to read one of these the kernel will try to load from the last address of virtual memory. This is normally invalid so will fault resulting in a kernel panic. Fix this by checking if the length is valid before dereferencing. MFC after: 3 days Sponsored by: DARPA, AFRL	2019-09-27 16:22:28 +00:00
dab	c7fb4709b2	sysent: regenerate after r352747. Sponsored by: Dell EMC Isilon	2019-09-26 15:41:10 +00:00
markj	e97c4bdc7f	Fix handling of invalid pages in exec_map_first_page(). exec_map_first_page() would unconditionally free an unbacked, invalid page from the executable image. However, it is possible that the page is wired, in which case it is incorrect to free the page, so check for additional wirings first. Reported by: syzkaller Tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21767	2019-09-26 15:35:35 +00:00
dab	edad331b44	Add an shm_rename syscall Add an atomic shm rename operation, similar in spirit to a file rename. Atomically unlink an shm from a source path and link it to a destination path. If an existing shm is linked at the destination path, unlink it as part of the same atomic operation. The caller needs the same permissions as shm_unlink to the shm being renamed, and the same permissions for the shm at the destination which is being unlinked, if it exists. If those fail, EACCES is returned, as with the other shm_* syscalls. truss support is included; audit support will come later. This commit includes only the implementation; the sysent-generated bits will come in a follow-on commit. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: jilles (earlier revision) Reviewed by: brueffer (manpages, earlier revision) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21423	2019-09-26 15:32:28 +00:00
tsoome	64ad58b605	kernel terminal should initialize fg and bg variables before calling TUNABLE_INT_FETCH We have two ways to check if kenv variable exists - either we check return value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check if this value did change. In terminal_init() it is more convinient to use pre-initialized variables. Problem was revealed by older loader.efi, which did not set teken.* variables. Reported by: tuexen	2019-09-26 07:19:26 +00:00
mav	19cf804564	Microoptimize sched_pickcpu() CPU affinity on SMT. Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5% of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-26 00:35:06 +00:00
mav	bd176423ea	Microoptimize sched_pickcpu() after r352658. I've noticed that I missed intr check at one more SCHED_AFFINITY(), so instead of adding one more branching I prefer to remove few. Profiler shows the function CPU time reduction from 0.24% to 0.16%. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-25 19:29:09 +00:00
kevans	245d8426fc	rfork(2): add RFSPAWN flag When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets signal handlers in the child during creation to avoid a point of corruption of parent state from the child. This flag will be used by posix_spawn(3) to handle potential signal issues. Reviewed by: jilles, kib Differential Revision: https://reviews.freebsd.org/D19058	2019-09-25 19:20:41 +00:00
glebius	e581476e1b	Add debugging facility EPOCH_TRACE that checks that epochs entered are properly nested and warns about recursive entrances. Unlike with locks, there is nothing fundamentally wrong with such use, the intent of tracer is to help to review complex epoch-protected code paths, and we mean the network stack here. Reviewed by: hselasky Sponsored by: Netflix Pull Request: https://reviews.freebsd.org/D21610	2019-09-25 18:26:31 +00:00
kevans	1d9983b221	sysent: regenerate after r352705 This also implements it, fixes kdump, and removes no longer needed bits from lib/libc/sys/shm_open.c for the interim.	2019-09-25 18:09:19 +00:00
kevans	df8ec3c155	Mark shm_open(2) as COMPAT12, succeeded by shm_open2 Implementation and regenerated files will follow.	2019-09-25 18:06:48 +00:00
kevans	dd20cd52c2	sysent: regenerate after r352700	2019-09-25 17:59:58 +00:00
kevans	61785cc3d4	Add a shm_open2 syscall to support upcoming memfd_create shm_open2 allows a little more flexibility than the original shm_open. shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate shmflag argument that can be expanded later. Currently the only shmflag is to allow file sealing on the returned fd. shm_open and memfd_create will both be implemented in libc to use this new syscall. __FreeBSD_version is bumped to indicate the presence. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21393	2019-09-25 17:59:15 +00:00
kevans	48e2e866d5	[2/3] Add an initial seal argument to kern_shm_open() Now that flags may be set on posixshm, add an argument to kern_shm_open() for the initial seals. To maintain past behavior where callers of shm_open(2) are guaranteed to not have any seals applied to the fd they're given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special flag could be opened later for shm_open(2) to indicate that sealing should be allowed. We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a shmfd that already existed. A note's been added about the assumptions we've made here as a hint towards anyone wanting to allow other seals to be applied at creation. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21392	2019-09-25 17:35:03 +00:00
kevans	13d4dfe478	[1/3] Add mostly Linux-compatible file sealing support File sealing applies protections against certain actions (currently: write, growth, shrink) at the inode level. New fileops are added to accommodate seals - EINVAL is returned by fcntl(2) if they are not implemented. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D21391	2019-09-25 17:32:43 +00:00
kevans	12110b8085	Add COMPAT12 support to makesyscalls.sh Reviewed by: kib, imp, brooks (all without syscalls.master edits) Differential Revision: https://reviews.freebsd.org/D21366	2019-09-25 17:29:45 +00:00
tsoome	ff5d84bffd	kernel: terminal_init() should check for teken colors from kenv Check for teken.fg_color and teken.bg_color and prepare the color attributes accordingly. When white background is used, make it light to improve visibility. When black background is used, make kernel messages light.	2019-09-25 13:21:07 +00:00
mav	967c071884	Fix wrong assertion in r352658. MFC after: 1 month	2019-09-25 11:58:54 +00:00
mav	1520ff79f0	Fix/improve interrupt threads scheduling. Doing some tests with very high interrupt rates I've noticed that one of conditions I added in r232207 to make interrupt threads in most cases run on local CPU never worked as expected (worked only if previous time it was executed on some other CPU, that is quite opposite). It caused additional CPU usage to run full CPU search and could schedule interrupt threads to some other CPU. This patch removes that code and instead reuses existing non-interrupt code path with some tweaks for interrupt case: - On SMT systems, if current thread is idle, don't look on other threads. Even if they are busy, it may take more time to do fill search and bounce the interrupt thread to other core then execute it locally, even sharing CPU resources. It is other threads should migrate, not bound interrupts. - Try hard to keep interrupt threads within LLC of their original CPU. This improves scheduling cost and supposedly cache and memory locality. On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few percents of CPU time while adding few percents to IOPS. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-24 20:01:20 +00:00
rrs	7648feb4d9	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
mjg	65396efbad	cache: tidy up handling of negative entries - track the total count of hot entries - pre-read the lock when shrinking since it is typically already taken - place the lock in its own cacheline - shorten the hold time of hot lock list when zapping Sponsored by: The FreeBSD Foundation	2019-09-23 20:50:04 +00:00
markj	9d221cb662	Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations. This is required for DPCPU and VNET data variable definitions to work when KLDs are linked as DSOs. R_X86_64_RELATIVE relocations should not appear in object files, so assert this in elf_relocaddr(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21755	2019-09-23 14:14:43 +00:00
mjg	58ff07f147	cache: count evictions of negatve entries Sponsored by: The FreeBSD Foundation	2019-09-23 08:53:14 +00:00
sef	f0e5ce5f10	Add two options to allow mount to avoid covering up existing mount points. The two options are * nocover/cover: Prevent/allow mounting over an existing root mountpoint. E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local is already a mountpoint. * emptydir/noemptydir: Prevent/allow mounting on a non-empty directory. E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail. Neither of these options is intended to be a default, for historical and compatibility reasons. Reviewed by: allanjude, kib Differential Revision: https://reviews.freebsd.org/D21458	2019-09-23 04:28:07 +00:00
mjg	cb473cf663	cache: try to avoid vhold if locks held Sponsored by: The FreeBSD Foundation	2019-09-22 20:50:24 +00:00
mjg	3275a01ba8	cache: jump in negative success instead of positive Sponsored by: The FreeBSD Foundation	2019-09-22 20:49:17 +00:00
mjg	441ee498a7	lockprof: move per-cpu data to dpcpu Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21747	2019-09-22 20:44:24 +00:00
kib	5fcd6fab09	kern.elf{32,64}.pie_base sysctl: enforce page alignment. Requested by: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 20:03:17 +00:00
mjg	7c14989404	lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks Sponsored by: The FreeBSD Foundation	2019-09-21 19:05:38 +00:00
kib	ac8689b233	Make non-ASLR pie base tunable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 18:00:23 +00:00
mav	7df90c3c5d	Allocate callout wheel from the respective memory domain. MFC after: 1 week	2019-09-21 15:38:08 +00:00
gallatin	34e9ee548b	remove redundant "ktls" in KTLS thr name This reducesthe string width of the ktls thread name and improves "ps" output. Glanced at by: jhb Event: EuroBSDCon hackathon Sponsored by: Netflix	2019-09-20 09:36:07 +00:00
mjg	843a2cac89	vfs: group fields used for per-cpu ops in one cacheline Sponsored by: The FreeBSD Foundation	2019-09-19 21:23:14 +00:00
kib	4e6c9714d2	sysctl: use names instead of magic numbers. Replace magic numbers with symbols for internal sysctl operations. Convert in-kernel and libc consumers. Submitted by: Pawel Biernacki MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21693	2019-09-18 16:13:10 +00:00

1 2 3 4 5 ...

16887 Commits