freebsd-skq

Author	SHA1	Message	Date
Mark Johnston	f822c9e287	Apply mapping protections to preloaded kernel modules on amd64. With an upcoming change the amd64 kernel will map preloaded files RW instead of RWX, so the kernel linker must adjust protections appropriately using pmap_change_prot(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21860	2019-10-18 13:56:45 +00:00
Mark Johnston	1d9eae9fb2	Apply mapping protections to .o kernel modules. Use the section flags to derive mapping protections. When multiple sections overlap within a page, the union of their protections must be applied. With r353701 the .text and .rodata sections are padded to ensure that this does not happen on amd64. Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21896	2019-10-18 13:53:14 +00:00
Conrad Meyer	dda17b3672	Implement NetGDB(4) NetGDB(4) is a component of a system using a panic-time network stack to remotely debug crashed FreeBSD kernels over the network, instead of traditional serial interfaces. There are three pieces in the complete NetGDB system. First, a dedicated proxy server must be running to accept connections from both NetGDB and gdb(1), and pass bidirectional traffic between the two protocols. Second, the NetGDB client is activated much like ordinary 'gdb' and similarly to 'netdump' in ddb(4) after a panic. Like other debugnet(4) clients (netdump(4)), the network interface on the route to the proxy server must be online and support debugnet(4). Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any other TCP remote) to connect to the proxy server. The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and uses a 1:1 relationship between GDB packets and sequences of debugnet packets (fragmented by MTU). There is no encryption utilized to keep debugging sessions private, so this is only appropriate for local segments or trusted networks. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed some with: emaste, markj Relnotes: sure Differential Revision: https://reviews.freebsd.org/D21568	2019-10-17 21:33:01 +00:00
Mark Johnston	092bacb2c4	Clean up some nits in link_elf_(un)load_file(). - Remove a redundant assignment of ef->address. - Don't return a Mach error number to the caller if vm_map_find() fails. - Use ptoa() and fix style. MFC after: 2 weeks Sponsored by: Netflix	2019-10-17 21:25:50 +00:00
Conrad Meyer	addccb8c51	Add a very limited DDB dumpon(8)-alike to MI dumper code This allows ddb(4) commands to construct a static dumperinfo during panic/debug and invoke doadump(false) using the provided dumper configuration (always inserted first in the list). The intended usecase is a ddb(4)-time netdump(4) command. Reviewed by: markj (earlier version) Differential Revision: https://reviews.freebsd.org/D21448	2019-10-17 18:29:44 +00:00
Conrad Meyer	7790c8c199	Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421	2019-10-17 16:23:03 +00:00
Andriy Gapon	5fdc2c044e	provide a way to assign taskqueue threads to a kernel process This can be used to group all threads belonging to a single logical entity under a common kernel process. I am planning to use the new interface for ZFS threads. MFC after: 4 weeks	2019-10-17 06:32:34 +00:00
Mark Johnston	6d775f0ba1	Use KOBJMETHOD_END in the kernel linker. MFC after: 1 week	2019-10-16 22:06:19 +00:00
Mark Johnston	01cef4caa7	Remove page locking from pmap_mincore(). After r352110 the page lock no longer protects a page's identity, so there is no purpose in locking the page in pmap_mincore(). Instead, if vm.mincore_mapped is set to the non-default value of 0, re-lookup the page after acquiring its object lock, which holds the page's identity stable. The change removes the last callers of vm_page_pa_tryrelock(), so remove it. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21823	2019-10-16 22:03:27 +00:00
Andrew Turner	9bb37c03fb	Stop leaking information from the kernel through timespec The timespec struct holds a seconds value in a time_t and a nanoseconds value in a long. On most architectures these are the same size, however on 32-bit architectures other than i386 time_t is 8 bytes and long is 4 bytes. Most ABIs will then pad a struct holding an 8 byte and 4 byte value to 16 bytes with 4 bytes of padding. When copying one of these structs the compiler is free to copy the padding if it wishes. In this case the padding may contain kernel data that is then leaked to userspace. Fix this by copying the timespec elements rather than the entire struct. This doesn't affect Tier-1 architectures so no SA is expected. admbugs: 651 MFC after: 1 week Sponsored by: DARPA, AFRL	2019-10-16 13:21:01 +00:00
Kristof Provost	1d95443818	Generalize ARM specific comments in devmap The comments in devmap are very ARM specific, this generalizes them for other architectures. Submitted by: Nicholas O'Brien <nickisobrien_gmail.com> Reviewed by: manu, philip Sponsored by: Axiado Differential Revision: https://reviews.freebsd.org/D22035	2019-10-15 23:21:52 +00:00
Gleb Smirnoff	4b25d1f2e3	Missing from r353596.	2019-10-15 21:32:38 +00:00
Gleb Smirnoff	bac060388f	When assertion for a thread not being in an epoch fails also print all entered epochs. Works with EPOCH_TRACE only. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D22017	2019-10-15 21:24:25 +00:00
Gleb Smirnoff	237c1f932b	Remove pfctlinput2(). It came from KAME and had never ever been in use.	2019-10-15 15:40:03 +00:00
Jeff Roberson	0012f373e4	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594	2019-10-15 03:45:41 +00:00
Jeff Roberson	63e9755548	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548	2019-10-15 03:35:11 +00:00
Leandro Lupori	0ecc478b74	[PPC64] Initial kernel minidump implementation Based on POWER9BSD implementation, with all POWER9 specific code removed and addition of new methods in PPC64 MMU interface, to isolate platform specific code. Currently, the new methods are implemented on pseries and PowerNV (D21643). Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D21551	2019-10-14 13:04:04 +00:00
Gleb Smirnoff	f6eccf96a0	Since EPOCH_TRACE had been moved to opt_global.h, we don't need to waste extra space in struct thread.	2019-10-14 04:17:56 +00:00
Mateusz Guzik	d1cbf3eeea	vfs: add MNTK_NOMSYNC On many filesystems the traversal is effectively a no-op. Add a way to avoid the overhead. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22009	2019-10-13 15:40:34 +00:00
Mateusz Guzik	737241cd51	vfs: return free vnode batches in sync instead of vfs_msync It is a more natural fit. vfs_msync only deals with active vnodes. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22008	2019-10-13 15:39:11 +00:00
Alexander Motin	a89a562b60	Allocate device softc from the device domain. Since we are trying to bind device interrupt threads to the device domain, it should have sense to make memory often accessed by them local. If domain is not known, fall back to round-robin. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-10-12 19:03:07 +00:00
Kristof Provost	85d1151f96	mountroot: run statfs after mounting devfs The usual flow for mounting a file system is to VFS_MOUNT() and then immediately VFS_STATFS(). That's not done in vfs_mountroot_devfs(), which means the mp->mnt_stat.f_iosize field is not correctly populated, which in turn causes us to mark valid aio operations as unsafe (because the io size is set to 0), ultimately causing the aio_test:md_waitcomplete test to fail. Reviewed by: mckusick MFC after: 1 week Sponsored by: Axiado Differential Revision: https://reviews.freebsd.org/D21897	2019-10-11 17:04:38 +00:00
Conrad Meyer	46d70077be	ddb: Add CSV option, sorting to 'show (malloc\|uma)' Add /i option for machine-parseable CSV output. This allows ready copy/ pasting into more sophisticated tooling outside of DDB. Add total zone size ("Memory Use") as a new column for UMA. For both, sort the displayed list on size (print the largest zones/types first). This is handy for quickly diagnosing "where has my memory gone?" at a high level. Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version) Sponsored by: Dell EMC Isilon	2019-10-11 01:31:31 +00:00
John Baldwin	97ecf6efa0	Don't free the cursor boundary tag during vmem_destroy(). The cursor boundary tag is statically allocated in the vmem instead of from the vmem_bt_zone. Explicitly remove it from the vmem's segment list in vmem_destroy before freeing all the segments from the vmem. Reviewed by: markj MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21953	2019-10-09 21:20:39 +00:00
Gleb Smirnoff	975b8f8462	Cleanup unneeded includes that crept in with r353292.	2019-10-09 16:59:42 +00:00
Gleb Smirnoff	ff3cfc330e	Enter network epoch in domain callouts.	2019-10-09 16:21:05 +00:00
Mark Johnston	4013d72684	Fix handling of empty SCM_RIGHTS messages. As unp_internalize() processes the input control messages, it builds an output mbuf chain containing the internalized representations of those messages. In one special case, that of an empty SCM_RIGHTS message, the message is simply discarded. However, the loop which appends mbufs to the output chain assumed that each iteration would produce an mbuf, resulting in a null pointer dereference if an empty SCM_RIGHTS message was followed by a non-empty message. Fix this by advancing the output mbuf chain tail pointer only if an internalized control message was produced. Reported by: syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-10-08 23:34:48 +00:00
John Baldwin	9e14430d46	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891	2019-10-08 21:34:06 +00:00
Doug Moore	2288078c5e	Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map. In case the implementation ever changes from using a chain of next pointers, then changing the macro definition will be necessary, but changing all the files that iterate over vm_map entries will not. Drop a counter in vm_object.c that would have an effect only if the vm_map entry count was wrong. Discussed with: alc Reviewed by: markj Tested by: pho (earlier version) Differential Revision: https://reviews.freebsd.org/D21882	2019-10-08 07:14:21 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Edward Tomasz Napierala	1a13f2e6b4	Introduce stats(3), a flexible statistics gathering API. This provides a framework to define a template describing a set of "variables of interest" and the intended way for the framework to maintain them (for example the maximum, sum, t-digest, or a combination thereof). Afterwards the user code feeds in the raw data, and the framework maintains these variables inside a user-provided, opaque stats blobs. The framework also provides a way to selectively extract the stats from the blobs. The stats(3) framework can be used in both userspace and the kernel. See the stats(3) manual page for details. This will be used by the upcoming TCP statistics gathering code, https://reviews.freebsd.org/D20655. The stats(3) framework is disabled by default for now, except in the NOTES kernel (for QA); it is expected to be enabled in amd64 GENERIC after a cool down period. Reviewed by: sef (earlier version) Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20477	2019-10-07 19:05:05 +00:00
Mateusz Guzik	dc20b834ca	vfs: add optional root vnode caching Root vnodes looekd up all the time, e.g. when crossing a mount point. Currently used routines always perform a costly lookup which can be trivially avoided. Reviewed by: jeff (previous version), kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21646	2019-10-06 22:14:32 +00:00
Kyle Evans	e3f35d562f	Remove the remnants of SI_CHEAPCLONE SI_CHEAPCLONE was introduced in r66067 for use with cloned bpfs. It was later also used in tty, tun, tap at points. The rough timeline for being removed in each of these is as follows: - r181690: bpf switched to use cdevpriv API by ed@ - r181905: ed@ rewrote the TTY later to be mpsafe - r204464: kib@ removes it from tun/tap, declaring it unused I've not yet been able to dig up any other consumers in the intervening 9 years. It is no longer set on any devices in the tree and leaves an interesting situation in make_dev_sv where we're ok with the device already being set SI_NAMED.	2019-10-05 21:52:06 +00:00
Kyle Evans	d42fecb5c1	kern_conf: fully initialize cloned devices with make_dev_args, too Attempting to initialize si_drv{1,2} with mda_si_drv{1,2} does not work if you are operating on cloned devices. clone_create must be called prior to the make_dev* family to create/return the device on the clonelist as needed. This device is later returned early in newdev(), prior to si_drv{0,1,2} initialization. This patch simply breaks out of the loop if we've found a device and finishes init. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21904	2019-10-05 21:44:18 +00:00
Mateusz Guzik	dfa8dae493	devfs: plug redundant bwillwrite avoidance vn_write already checks for vnode type to see if bwillwrite should be called. This effectively reverts r244643. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21905	2019-10-05 17:44:33 +00:00
Eric van Gyzen	e61e783b83	Add CTLFLAG_STATS to some vfs sysctl OIDs Add CTLFLAG_STATS to the following OIDs: vfs.altbufferflushes vfs.recursiveflushes vfs.barrierwrites vfs.flushwithdeps vfs.reassignbufcalls Refer to r353111. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2019-10-04 21:43:43 +00:00
Ed Maste	f91dd6091b	simplify path handling in sysctl_try_reclaim_vnode MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei verifies the presence of the NUL. Thus there is no need to increase the buffer size here. The sysctl passes the string excluding the NUL, so req->newlen equal to PATH_MAX is too long. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21876	2019-10-02 21:01:23 +00:00
Mark Johnston	5131cba6d6	Use OBJT_PHYS VM objects for kernel modules. OBJT_DEFAULT incurs some unnecessary overhead given that kernel module pages cannot be paged out. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21862	2019-10-02 16:34:42 +00:00
Mark Johnston	4a7b33ecf4	Disallow fcntl(F_READAHEAD) when the vnode is not a regular file. The mountpoint may not have defined an iosize parameter, so an attempt to configure readahead on a device file can lead to a divide-by-zero crash. The sequential heuristic is not applied to I/O to or from device files, and posix_fadvise(2) returns an error when v_type != VREG, so perform the same check here. Reported by: syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21864	2019-10-02 15:45:49 +00:00
Kyle Evans	5a391b572b	shm_open2(2): completely unbreak kern_shm_open2(), since conception, completely fails to pass the mode along to kern_shm_open(). This breaks most uses of it. Add tests alongside this that actually check the mode of the returned files. PR: 240934 [pulseaudio breakage] Reported by: ler, Andrew Gierth [postgres breakage] Diagnosed by: Andrew Gierth (great catch) Tested by: ler, tmunro Pointy hat to: kevans	2019-10-02 02:37:34 +00:00
Ed Maste	f403831e6c	sysalls.master: remove superfluous ellipsis in comment A single period is sufficient in this comment, and making this change lets us find references to varargs syscalls by searching for ...	2019-10-01 17:05:21 +00:00
Brooks Davis	3a94552174	Restore the ability to set capenabled directly in syscalls.conf. This fixes generation of cloudabi syscall tables broken in r340424. Reviewed by: kevans, emaste MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D21821	2019-09-30 20:58:29 +00:00
Kyle Evans	11fd6a60e7	syscalls.master: consistency, move ); to newline (no functional change)	2019-09-30 13:26:16 +00:00
Mark Johnston	1aa696babc	Fix some problems with the SPARSE_MAPPING option in the kernel linker. - Ensure that the end of the mapping passed to vm_page_wire() is page-aligned. vm_page_wire() expects this. - Wire pages before reading data into them. - Apply protections specified in the segment descriptor using vm_map_protect() once relocation processing is done. - On amd64, ensure that we load KLDs above KERNBASE, since they are compiled with the "kernel" memory model by default. Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21756	2019-09-28 01:42:59 +00:00
Andrew Gallatin	b2dba6634b	kTLS: Fix a bug where we would not encrypt anon data inplace. Software Kernel TLS needs to allocate a new destination crypto buffer when encrypting data from the page cache, so as to avoid overwriting shared clear-text file data with encrypted data specific to a single socket. When the data is anonymous, eg, not tied to a file, then we can encrypt in place and avoid allocating a new page. This fixes a bug where the existing code always assumes the data is private, and never encrypts in place. This results in unneeded page allocations and potentially more memory bandwidth consumption when doing socket writes. When the code was written at Netflix, ktls_encrypt() looked at private sendfile flags to determine if the pages being encrypted where part of the page cache (coming from sendfile) or anonymous (coming from sosend). This was broken internally at Netflix when the sendfile flags were made private, and the M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will always be false for M_NOMAP mbufs, since one cannot just mtod() them. This change introduces a new flags field to the mbuf_ext_pgs struct by stealing a byte from the tls hdr. Note that the current header is still 2 bytes larger than the largest header we support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON when creating an unmapped mbuf in m_uiotombuf_nomap() (which is the path that socket writes take), and we check for that flag in ktls_encrypt() when looking for anon pages. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21796	2019-09-27 20:08:19 +00:00
Andrew Gallatin	6554362c66	kTLS support for TLS 1.3 TLS 1.3 requires a few changes because 1.3 pretends to be 1.2 with a record type of application data. The "real" record type is then included at the end of the user-supplied plaintext data. This required adding a field to the mbuf_ext_pgs struct to save the record type, and passing the real record type to the sw_encrypt() ktls backend functions. Reviewed by: jhb, hselasky Sponsored by: Netflix Differential Revision: D21801	2019-09-27 19:17:40 +00:00
Mateusz Guzik	708cf7eb6c	cache: decrease ncnegfactor to 5 The current mechanism is bogus in several ways: - the limit is a percentage of total entries added, which means negative entries get evicted all the time even if there are plenty of resources - evicting code is almost not concurrent, which makes it unable to remove entries fast enough when doing something as simple as -j 104 buildworld - there is no support for performing mass removal if necessary Vast majority of negative entries never get any hits. Only evicting them when the filesystem demands it results in a significant growth of the namecache with almost no improvement in the hit ratio. Sample result about afer 90 minutes of poudriere -j 104: current no evict % of the original numneg 219737 2013157 916 numneghits 266711906 263544562 98 [1] [1] this may look funny but there is a certain dose of variation to the build The number was chosen as something which mostly eliminates spurious evictions during lighter workloads but still keeps the total at bay. Sponsored by: The FreeBSD Foundation	2019-09-27 19:14:03 +00:00
Mateusz Guzik	e643141838	cache: stop requeuing negative entries on the hot list Turns out it does not improve hit ratio, but it does come with a cost induces stemming from dirtying hit entries. Sample result: hit counts of evicted entries after 2 buildworlds before: value ------------- Distribution ------------- count -1 \| 0 0 \|@@@@@@@@@@@@@@@@@@@@@@@@@ 180865 1 \|@@@@@@@ 49150 2 \|@@@ 19067 4 \|@ 9825 8 \|@ 7340 16 \|@ 5952 32 \|@ 5243 64 \|@ 4446 128 \| 3556 256 \| 3035 512 \| 1705 1024 \| 1078 2048 \| 365 4096 \| 95 8192 \| 34 16384 \| 26 32768 \| 23 65536 \| 8 131072 \| 6 262144 \| 0 after: value ------------- Distribution ------------- count -1 \| 0 0 \|@@@@@@@@@@@@@@@@@@@@@@@@@ 184004 1 \|@@@@@@ 47577 2 \|@@@ 19446 4 \|@ 10093 8 \|@ 7470 16 \|@ 5544 32 \|@ 5475 64 \|@ 5011 128 \| 3451 256 \| 3002 512 \| 1729 1024 \| 1086 2048 \| 363 4096 \| 86 8192 \| 26 16384 \| 25 32768 \| 24 65536 \| 7 131072 \| 5 262144 \| 0 Sponsored by: The FreeBSD Foundation	2019-09-27 19:13:22 +00:00
Mateusz Guzik	312196df0f	cache: make negative list shrinking a little bit concurrent Continue protecting demotion from the hotlist and selection of the target list with the ncneg_shrink_lock lock, but drop it before relocking to zap the node. While here count how many times we skipped shrinking due to the lock being already taken. Sponsored by: The FreeBSD Foundation	2019-09-27 19:12:43 +00:00
Mateusz Guzik	95c6dd890a	cache: stop recalculating upper limit each time a new entry is added Sponsored by: The FreeBSD Foundation	2019-09-27 19:12:20 +00:00
Konstantin Belousov	df08823d07	Improve MD page fault handlers. Centralize calculation of signal and ucode delivered on unhandled page fault in new function vm_fault_trap(). MD trap_pfault() now almost always uses the signal numbers and error codes calculated in consistent MI way. This introduces the protection fault compatibility sysctls to all non-x86 architectures which did not have that bug, but apparently they were already much more wrong in selecting delivered signals on protection violations. Change the delivered signal for accesses to mapped area after the backing object was truncated. According to POSIX description for mmap(2): The system shall always zero-fill any partial page at the end of an object. Further, the system shall never write out any modified portions of the last page of an object which are beyond its end. References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. An implementation may generate SIGBUS signals when a reference would cause an error in the mapped object, such as out-of-space condition. Adjust according to the description, keeping the existing compatibility code for SIGSEGV/SIGBUS on protection failures. For situations where kernel cannot handle page fault due to resource limit enforcement, SIGBUS with a new error code BUS_OBJERR is delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on amd64 due to protection key access violation. vm_fault_hold() is renamed to vm_fault(). Fixed some nits in trap_pfault()s like mis-interpreting Mach errors as errnos. Removed unneeded truncations of the fault addresses reported by hardware. PR: 211924 Reviewed by: alc Discussed with: jilles, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21566	2019-09-27 18:43:36 +00:00
Andrew Turner	50bb04b750	Check the vfs option length is valid before accessing through When a VFS option passed to nmount is present but NULL the kernel will place an empty option in its internal list. This will have a NULL pointer and a length of 0. When we come to read one of these the kernel will try to load from the last address of virtual memory. This is normally invalid so will fault resulting in a kernel panic. Fix this by checking if the length is valid before dereferencing. MFC after: 3 days Sponsored by: DARPA, AFRL	2019-09-27 16:22:28 +00:00
David Bright	c4571256af	sysent: regenerate after r352747. Sponsored by: Dell EMC Isilon	2019-09-26 15:41:10 +00:00
Mark Johnston	55248d32f2	Fix handling of invalid pages in exec_map_first_page(). exec_map_first_page() would unconditionally free an unbacked, invalid page from the executable image. However, it is possible that the page is wired, in which case it is incorrect to free the page, so check for additional wirings first. Reported by: syzkaller Tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21767	2019-09-26 15:35:35 +00:00
David Bright	9afb12bab4	Add an shm_rename syscall Add an atomic shm rename operation, similar in spirit to a file rename. Atomically unlink an shm from a source path and link it to a destination path. If an existing shm is linked at the destination path, unlink it as part of the same atomic operation. The caller needs the same permissions as shm_unlink to the shm being renamed, and the same permissions for the shm at the destination which is being unlinked, if it exists. If those fail, EACCES is returned, as with the other shm_* syscalls. truss support is included; audit support will come later. This commit includes only the implementation; the sysent-generated bits will come in a follow-on commit. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: jilles (earlier revision) Reviewed by: brueffer (manpages, earlier revision) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21423	2019-09-26 15:32:28 +00:00
Toomas Soome	11fc80a098	kernel terminal should initialize fg and bg variables before calling TUNABLE_INT_FETCH We have two ways to check if kenv variable exists - either we check return value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check if this value did change. In terminal_init() it is more convinient to use pre-initialized variables. Problem was revealed by older loader.efi, which did not set teken.* variables. Reported by: tuexen	2019-09-26 07:19:26 +00:00
Alexander Motin	176dd236dc	Microoptimize sched_pickcpu() CPU affinity on SMT. Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5% of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-26 00:35:06 +00:00
Alexander Motin	c55dc51c37	Microoptimize sched_pickcpu() after r352658. I've noticed that I missed intr check at one more SCHED_AFFINITY(), so instead of adding one more branching I prefer to remove few. Profiler shows the function CPU time reduction from 0.24% to 0.16%. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-25 19:29:09 +00:00
Kyle Evans	079c5b9ed8	rfork(2): add RFSPAWN flag When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets signal handlers in the child during creation to avoid a point of corruption of parent state from the child. This flag will be used by posix_spawn(3) to handle potential signal issues. Reviewed by: jilles, kib Differential Revision: https://reviews.freebsd.org/D19058	2019-09-25 19:20:41 +00:00
Gleb Smirnoff	dd902d015a	Add debugging facility EPOCH_TRACE that checks that epochs entered are properly nested and warns about recursive entrances. Unlike with locks, there is nothing fundamentally wrong with such use, the intent of tracer is to help to review complex epoch-protected code paths, and we mean the network stack here. Reviewed by: hselasky Sponsored by: Netflix Pull Request: https://reviews.freebsd.org/D21610	2019-09-25 18:26:31 +00:00
Kyle Evans	a9ac5e1424	sysent: regenerate after r352705 This also implements it, fixes kdump, and removes no longer needed bits from lib/libc/sys/shm_open.c for the interim.	2019-09-25 18:09:19 +00:00
Kyle Evans	234879a7e3	Mark shm_open(2) as COMPAT12, succeeded by shm_open2 Implementation and regenerated files will follow.	2019-09-25 18:06:48 +00:00
Kyle Evans	460211e730	sysent: regenerate after r352700	2019-09-25 17:59:58 +00:00
Kyle Evans	20f7057685	Add a shm_open2 syscall to support upcoming memfd_create shm_open2 allows a little more flexibility than the original shm_open. shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate shmflag argument that can be expanded later. Currently the only shmflag is to allow file sealing on the returned fd. shm_open and memfd_create will both be implemented in libc to use this new syscall. __FreeBSD_version is bumped to indicate the presence. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21393	2019-09-25 17:59:15 +00:00
Kyle Evans	0cd95859c8	[2/3] Add an initial seal argument to kern_shm_open() Now that flags may be set on posixshm, add an argument to kern_shm_open() for the initial seals. To maintain past behavior where callers of shm_open(2) are guaranteed to not have any seals applied to the fd they're given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special flag could be opened later for shm_open(2) to indicate that sealing should be allowed. We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a shmfd that already existed. A note's been added about the assumptions we've made here as a hint towards anyone wanting to allow other seals to be applied at creation. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21392	2019-09-25 17:35:03 +00:00
Kyle Evans	af755d3e48	[1/3] Add mostly Linux-compatible file sealing support File sealing applies protections against certain actions (currently: write, growth, shrink) at the inode level. New fileops are added to accommodate seals - EINVAL is returned by fcntl(2) if they are not implemented. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D21391	2019-09-25 17:32:43 +00:00
Kyle Evans	85c5f3cb57	Add COMPAT12 support to makesyscalls.sh Reviewed by: kib, imp, brooks (all without syscalls.master edits) Differential Revision: https://reviews.freebsd.org/D21366	2019-09-25 17:29:45 +00:00
Toomas Soome	3001e0c942	kernel: terminal_init() should check for teken colors from kenv Check for teken.fg_color and teken.bg_color and prepare the color attributes accordingly. When white background is used, make it light to improve visibility. When black background is used, make kernel messages light.	2019-09-25 13:21:07 +00:00
Alexander Motin	bb3dfc6ae9	Fix wrong assertion in r352658. MFC after: 1 month	2019-09-25 11:58:54 +00:00
Alexander Motin	c9205e3500	Fix/improve interrupt threads scheduling. Doing some tests with very high interrupt rates I've noticed that one of conditions I added in r232207 to make interrupt threads in most cases run on local CPU never worked as expected (worked only if previous time it was executed on some other CPU, that is quite opposite). It caused additional CPU usage to run full CPU search and could schedule interrupt threads to some other CPU. This patch removes that code and instead reuses existing non-interrupt code path with some tweaks for interrupt case: - On SMT systems, if current thread is idle, don't look on other threads. Even if they are busy, it may take more time to do fill search and bounce the interrupt thread to other core then execute it locally, even sharing CPU resources. It is other threads should migrate, not bound interrupts. - Try hard to keep interrupt threads within LLC of their original CPU. This improves scheduling cost and supposedly cache and memory locality. On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few percents of CPU time while adding few percents to IOPS. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-24 20:01:20 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Mateusz Guzik	93a85508ad	cache: tidy up handling of negative entries - track the total count of hot entries - pre-read the lock when shrinking since it is typically already taken - place the lock in its own cacheline - shorten the hold time of hot lock list when zapping Sponsored by: The FreeBSD Foundation	2019-09-23 20:50:04 +00:00
Mark Johnston	38dae42c26	Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations. This is required for DPCPU and VNET data variable definitions to work when KLDs are linked as DSOs. R_X86_64_RELATIVE relocations should not appear in object files, so assert this in elf_relocaddr(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21755	2019-09-23 14:14:43 +00:00
Mateusz Guzik	afe257e3ca	cache: count evictions of negatve entries Sponsored by: The FreeBSD Foundation	2019-09-23 08:53:14 +00:00
Sean Eric Fagan	ba7a55d934	Add two options to allow mount to avoid covering up existing mount points. The two options are * nocover/cover: Prevent/allow mounting over an existing root mountpoint. E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local is already a mountpoint. * emptydir/noemptydir: Prevent/allow mounting on a non-empty directory. E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail. Neither of these options is intended to be a default, for historical and compatibility reasons. Reviewed by: allanjude, kib Differential Revision: https://reviews.freebsd.org/D21458	2019-09-23 04:28:07 +00:00
Mateusz Guzik	7505cffa56	cache: try to avoid vhold if locks held Sponsored by: The FreeBSD Foundation	2019-09-22 20:50:24 +00:00
Mateusz Guzik	cd2112c305	cache: jump in negative success instead of positive Sponsored by: The FreeBSD Foundation	2019-09-22 20:49:17 +00:00
Mateusz Guzik	d2be3ef05c	lockprof: move per-cpu data to dpcpu Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21747	2019-09-22 20:44:24 +00:00
Konstantin Belousov	f33533da8c	kern.elf{32,64}.pie_base sysctl: enforce page alignment. Requested by: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 20:03:17 +00:00
Mateusz Guzik	cbba2cb367	lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks Sponsored by: The FreeBSD Foundation	2019-09-21 19:05:38 +00:00
Konstantin Belousov	95aafd6900	Make non-ASLR pie base tunable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 18:00:23 +00:00
Alexander Motin	36d151a237	Allocate callout wheel from the respective memory domain. MFC after: 1 week	2019-09-21 15:38:08 +00:00
Andrew Gallatin	61b8a4af71	remove redundant "ktls" in KTLS thr name This reducesthe string width of the ktls thread name and improves "ps" output. Glanced at by: jhb Event: EuroBSDCon hackathon Sponsored by: Netflix	2019-09-20 09:36:07 +00:00
Mateusz Guzik	b488246b45	vfs: group fields used for per-cpu ops in one cacheline Sponsored by: The FreeBSD Foundation	2019-09-19 21:23:14 +00:00
Konstantin Belousov	382e01c8dc	sysctl: use names instead of magic numbers. Replace magic numbers with symbols for internal sysctl operations. Convert in-kernel and libc consumers. Submitted by: Pawel Biernacki MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21693	2019-09-18 16:13:10 +00:00
Konstantin Belousov	55894117b1	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634	2019-09-17 18:32:18 +00:00
Kirk McKusick	100369071d	The VFS-level clustering code collects together sequential blocks by issuing delayed-writes (bdwrite()) until a non-sequential block is written or the maximum cluster size is reached. At that point it collects the delayed buffers together (using bread()) to write them in a single operation. The assumption was that since we just looked at them they will still be in memory so there is no need to check for a read error from bread(). Very occationally (apparently every 10-hours or so when being pounded by Peter Holm's tests) this assumption is wrong. The fix is to check for errors from bread() and fail the cluster write thus falling back to the default individual flushing of any still dirty buffers. Reported by: Peter Holm and Chuck Silvers Reviewed by: kib MFC after: 3 days	2019-09-17 17:44:50 +00:00
Mateusz Guzik	d245aa1e72	vfs: apply r352437 to the fast path as well This one is very hard to run into. If the filesystem is being unmounted or the mount point is freed the vfs_op_thread_enter will fail. For it to succeed the mount point itself would have to be reallocated in the time window between the initial read and the attempt to enter. Sponsored by: The FreeBSD Foundation	2019-09-17 15:53:40 +00:00
Mateusz Guzik	7f65185940	vfs: fix braino resulting in NULL pointer deref in r352424 The breakage was added after all the testing and the testing which followed was not sufficient to find it. Reported by: pho Sponsored by: The FreeBSD Foundation	2019-09-17 08:09:39 +00:00
Mateusz Guzik	4cace859c2	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637	2019-09-16 21:37:47 +00:00
Mateusz Guzik	e87f3f72f1	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575	2019-09-16 21:33:16 +00:00
Mateusz Guzik	ee831b2543	vfs: manage mnt_lockref with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21574	2019-09-16 21:32:21 +00:00
Mateusz Guzik	a8c8e44bf0	vfs: manage mnt_ref with atomics New primitive is introduced to denote sections can operate locklessly on aspects of struct mount, but which can also be disabled if necessary. This provides an opportunity to start scaling common case modifications while providing stable state of the struct when facing unmount, write suspendion or other events. mnt_ref is the first counter to start being managed in this manner with the intent to make it per-cpu. Reviewed by: kib, jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21425	2019-09-16 21:31:02 +00:00
Kyle Evans	3155f2f0e2	rangelock: add rangelock_cookie_assert A future change to posixshm to add file sealing (in DIFF_21391[0] and child) will move locking out of shm_dotruncate as kern_shm_open() will require the lock to be held across the dotruncate until the seal is actually applied. For this, the cookie is passed into shm_dotruncate_locked which asserts RCA_WLOCKED. [0] Name changed to protect the innocent, hopefully, from getting autoclosed due to this reference... Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21628	2019-09-15 02:59:53 +00:00
Mateusz Guzik	ce3ba63f67	vfs: release usecount using fetchadd 1. If we release the last usecount we take ownership of the hold count, which means the vnode will remain allocated until we vdrop it. 2. If someone else vrefs they will find no usecount and will proceed to add their own hold count. 3. No code has a problem with v_usecount transitioning to 0 without the interlock These facts combined mean we can fetchadd instead of having a cmpset loop. Reviewed by: kib (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21528	2019-09-13 15:49:04 +00:00
Mark Johnston	45cdd437ae	Remove a redundant NULL pointer check in cpuset_modify_domain(). cpuset_getroot() is guaranteed to return a non-NULL pointer. Reported by: Mark Millard <marklmi@yahoo.com> MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-09-12 16:47:38 +00:00
Hans Petter Selasky	11b57401e6	Use REFCOUNT_COUNT() to obtain refcount where appropriate. Refcount waiting will set some flag bits in the refcount value. Make sure these bits get cleared by using the REFCOUNT_COUNT() macro to obtain the actual refcount. Differential Revision: https://reviews.freebsd.org/D21620 Reviewed by: kib@, markj@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-12 16:26:59 +00:00
Kyle Evans	5163b1a75c	Follow up r352244: kenv: tighten up assertions As I like to forget: static kenv var formatting is actually such that an empty environment would be double null bytes. We should make sure that a non-zero buffer has at least enough for this, though most of the current usage is with a 4k buffer.	2019-09-12 14:34:46 +00:00
Kyle Evans	436c46875d	kenv: assert that an empty static buffer passed in is "empty" Garbage in the passed-in buffer can cause problems if any attempts to read the kenv are inadvertently made between init_static_kenv and the first kern_setenv -- assuming there is one. This is cheap and easy, so do it. This also helps rule out some class of bugs as one tries to debug; tunables fetch from the static environment up until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that rely on BSS clearing while others just grab a page of free memory and use it (e.g. xen).	2019-09-12 13:51:43 +00:00
Conrad Meyer	aaa3852435	buf: Add B_INVALONERR flag to discard data Setting the B_INVALONERR flag before a synchronous write causes the buf cache to forcibly invalidate contents if the write fails (BIO_ERROR). This is intended to be used to allow layers above the buffer cache to make more informed decisions about when discarding dirty buffers without successful write is acceptable. As a proof of concept, use in msdosfs to handle failures to mark the on-disk 'dirty' bit during rw mount or ro->rw update. Extending this to other filesystems is left as future work. PR: 210316 Reviewed by: kib (with objections) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21539	2019-09-11 21:24:14 +00:00

1 2 3 4 5 ...

16958 Commits