freebsd-dev

Author	SHA1	Message	Date
Hans Petter Selasky	aa4612d133	Fix panic when loading kernel modules before root file system is mounted. Make sure the rootvnode is always NULL checked. Differential Revision: https://reviews.freebsd.org/D22545 PR: 241639 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-11-26 12:20:44 +00:00
Mariusz Zaborski	8e49361164	procdesc: allow to collect status through wait(1) if process is traced The debugger like truss(1) depends on the wait(2) syscall. This syscall waits for ALL children. When it is waiting for ALL child's the children created by process descriptors are not returned. This behavior was introduced because we want to implement libraries which may pdfork(1). The behavior of process descriptor brakes truss(1) because it will not be able to collect the status of processes with process descriptors. To address this problem the status is returned to parent when the child is traced. While the process is traced the debugger is the new parent. In case the original parent and debugger are the same process it means the debugger explicitly used pdfork() to create the child. In that case the debugger should be using kqueue()/pdwait() instead of wait(). Add test case to verify that. The test case was implemented by markj@. Reviewed by: kib, markj Discussed with: jhb MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D20362	2019-11-25 18:33:21 +00:00
Ryan Libby	43cefe8b19	sysctl sysctls: wire old buf before output with sysctl lock Several sysctl sysctls output to a user buffer while holding a non-sleepable lock that protects the sysctl topology. They need to wire the output buffer, or else they may try to sleep on a page fault. Reviewed by: cem, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22528	2019-11-25 07:38:27 +00:00
Konstantin Belousov	b631c36f0d	Record part of the owner struct thread pointer into busy_lock. Record as much bits from curthread into busy_lock as fits. Low bits for struct thread * representation are zero due to struct and zone alignment, and they leave space for busy flags (perhaps except statically allocated thread0). Upper bits are not very interesting for assert, and in most practical situations recorded value should allow to manually identify the owner with certainity. Assert that unbusy is performed by the owner, except few places where unbusy is done in io completion handler. For this case, add _unchecked variants of asserts and unbusy primitives. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22298	2019-11-24 19:12:23 +00:00
Warner Losh	a921c2003f	Add a warning about Giant Locked devices Add a warning when a device registers with devfs and requests D_NEEDGIANT. The warning says the device will go away before 13.0. This is needed to flush out the devices in the tree that are still Giant locked. This warning, or some variant of it, should have gone into the tree a long time ago... The intention is to require all devices be converted to not use automatic giant in this way, or remove any such devices that remain that we don't have the hardware to test a conversion of. kbd so far is the only device that can't leave the tree, yet needs something sensible done to avoid the auto giant lock (even if it is just doing the wrapping itself). There may be others added to this list... Any discussions of this topic will take place on arch@.	2019-11-23 23:57:26 +00:00
Conrad Meyer	7993a104a1	Add explicit SI_SUB_EPOCH Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP (EARLY_AP_STARTUP). Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH. epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND, but likely should allocate before SI_SUB_SMP. Prior to this change, consumers (well, epoch itself, and net/if.c) just open-coded the SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile. Reviewed by: mmacy Differential Revision: https://reviews.freebsd.org/D22503	2019-11-22 23:23:40 +00:00
Gleb Smirnoff	329377f44b	cc_ktr_event_name is used only with KTR	2019-11-21 23:55:43 +00:00
Alexander Motin	130fffa2a3	Add variant of root_mount_hold() without allocation. It allows to use this KPI in non-sleepable contexts. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-11-21 21:59:35 +00:00
Andrew Turner	a27ac4644a	Disable KCSAN within a panic. The kernel is single threaded at this point and the panic is more important. Sponsored by: DARPA, AFRL	2019-11-21 13:59:01 +00:00
Andrew Turner	68cad68149	Add kcsan_md_unsupported from NetBSD. It's used to ignore virtual addresses that may have a different physical address depending on the CPU. Sponsored by: DARPA, AFRL	2019-11-21 13:22:23 +00:00
Andrew Turner	bba0065f0d	Fix the bus_space functions with KCSAN on arm64. Arm64 doesn't define the bus_space_set_multi_stream and bus_space_set_region_stream functions. Don't try to define them there. Sponsored by: DARPA, AFRL	2019-11-21 13:12:58 +00:00
Andrew Turner	849aef496d	Port the NetBSD KCSAN runtime to FreeBSD. Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in the FreeBSD kernel. It is a useful tool for finding data races between threads executing on different CPUs. This can be enabled by enabling KCSAN in the kernel config, or by using the GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later needs a compiler change to allow -fsanitize=thread that KCSAN uses. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22315	2019-11-21 11:22:08 +00:00
Andrew Turner	0cb5357037	Import the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime. KCSAN is a tool to find concurrent memory access that may race each other. After a determined number of memory accesses a cell is created, this describes the current access. It will then delay for a short period to allow other CPUs a chance to race. If another CPU performs a memory access to an overlapping region during this delay the race is reported. This is a straight import of the NetBSD code, it will be adapted to FreeBSD in a future commit. Sponsored by: DARPA, AFRL	2019-11-20 14:37:48 +00:00
Mateusz Guzik	d578a4256e	cache: minor stat cleanup Remove duplicated stats and move numcachehv from debug to vfs.cache.	2019-11-20 12:08:32 +00:00
Mateusz Guzik	d957f3a4f0	vfs: perform a more racy check in vfs_notify_upper Locking mp does not buy anything interms of correctness and only contributes to contention.	2019-11-20 12:07:54 +00:00
Mateusz Guzik	1fccb43c39	vfs: change si_usecount management to count used vnodes Currently si_usecount is effectively a sum of usecounts from all associated vnodes. This is maintained by special-casing for VCHR every time usecount is modified. Apart from complicating the code a little bit, it has a scalability impact since it forces a read from a cacheline shared with said count. There are no consumers of the feature in the ports tree. In head there are only 2: revoke and devfs_close. Both can get away with a weaker requirement than the exact usecount, namely just the count of active vnodes. Changing the meaning to the latter means we only need to modify it on 0<->1 transitions, avoiding the check plenty of times (and entirely in something like vrefact). Reviewed by: kib, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D22202	2019-11-20 12:05:59 +00:00
Jeff Roberson	639676877b	Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates reudundant complicated checks and additional locking required only for anonymous memory. Introduce vm_object_allocate_anon() to create these objects. DEFAULT and SWAP objects now have the correct settings for non-anonymous consumers and so individual consumers need not modify the default flags to create super-pages and avoid ONEMAPPING/NOSPLIT. Reviewed by: alc, dougm, kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22119	2019-11-19 23:19:43 +00:00
Kyle Evans	4cc12fb848	sysent: regenerate after r354835 The lua-based makesyscalls produces slightly different output than its makesyscalls.sh predecessor, all whitespace differences more closely matching the source syscalls.master.	2019-11-18 23:31:12 +00:00
Kyle Evans	f22a592111	Convert in-tree sysent targets to use new makesyscalls.lua flua is bootstrapped as part of the build for those on older versions/revisions that don't yet have flua installed. Once upgraded past r354833, "make sysent" will again naturally work as expected. Reviewed by: brooks Differential Revision: https://reviews.freebsd.org/D21894	2019-11-18 23:28:23 +00:00
John Baldwin	03b0d68c72	Check for errors from copyout() and suword*() in sv_copyout_args/strings. Reviewed by: brooks, kib Tested on: amd64 (amd64, i386, linux64), i386 (i386, linux) Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D22401	2019-11-18 20:07:43 +00:00
David Bright	2d5603fe65	Jail and capability mode for shm_rename; add audit support for shm_rename Co-mingling two things here: * Addressing some feedback from Konstantin and Kyle re: jail, capability mode, and a few other things * Adding audit support as promised. The audit support change includes a partial refresh of OpenBSM from upstream, where the change to add shm_rename has already been accepted. Matthew doesn't plan to work on refreshing anything else to support audit for those new event types. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: kib Relnotes: Yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22083	2019-11-18 13:31:16 +00:00
Konstantin Belousov	01a2b5679b	kern_exec: p_osrel and p_fctl0 were obliterated by failed execve(2) attempt. Zeroing of them is needed so that an image activator can update the values as appropriate (or not set at all). Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22379	2019-11-17 14:52:45 +00:00
Scott Long	de890ea465	Create a new sysctl subtree, machdep.mitigations. Its purpose is to organize knobs and indicators for code that mitigates functional and security issues in the architecture/platform. Controls for regular operational policy should still go into places security, hw, kern, etc. The machdep root node is inherently architecture dependent, but mitigations tend to be architecture dependent as well. Some cases like Spectre do cross architectural boundaries, but the mitigation code for them tends to be architecture dependent anyways, and multiple architectures won't be active in the same image of the kernel. Many mitigation knobs already exist in the system, and they will be moved with compat naming in the future. Going forward, mitigations should collect in machdep.mitigations. Reviewed by: imp, brooks, rwatson, emaste, jhb Sponsored by: Intel	2019-11-15 23:27:17 +00:00
John Baldwin	e353233118	Add a sv_copyout_auxargs() hook in sysentvec. Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv instead of doing it in the sv_fixup hook. In particular, this new hook allows the stack space to be allocated at the same time the auxv values are copied out to userland. This allows us to avoid wasting space for unused auxv entries as well as not having to recalculate where the auxv vector is by walking back up over the argv and environment vectors. Reviewed by: brooks, emaste Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64 Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D22355	2019-11-15 18:42:13 +00:00
Brooks Davis	96c914ee97	Tidy syscall declerations. Pointer arguments should be of the form "<type> ..." and not "<type> ...". No functional change. Reviewed by: kevans Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22373	2019-11-14 17:11:52 +00:00
Mark Johnston	1cbfe73da5	Fix handling of PIPE_EOF in the direct write path. Suppose a writing thread has pinned its pages and gone to sleep with pipe_map.cnt > 0. Suppose that the thread is woken up by a signal (so error != 0) and the other end of the pipe has simultaneously been closed. In this case, to satisfy the assertion about pipe_map.cnt in pipe_destroy_write_buffer(), we must mark the buffer as empty. Reported by: syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com Reviewed by: kib Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22261	2019-11-11 20:44:30 +00:00
Rick Macklem	48e4857859	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The first of these semantic changes was changed to be compatible with linux5.n by r354564. For the second semantic change, the old linux man page stated that, if infd and outfd referred to the same file, EBADF should be returned. Now, the semantics is to allow infd and outfd to refer to the same file so long as the byte ranges defined by the input file offset, output file offset and len does not overlap. If the byte ranges do overlap, EINVAL should be returned. This patch modifies copy_file_range(2) to be linux5.n compatible for this semantic change.	2019-11-10 01:08:14 +00:00
Rick Macklem	15930ae180	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The old linux man page stated that, if the offset + len exceeded file_size for the input file, EINVAL should be returned. Now, the semantics is to copy up to at most file_size bytes and return that number of bytes copied. If the offset is at or beyond file_size, a return of 0 bytes is done. This patch modifies copy_file_range(2) to be linux compatible for this semantic change. A separate patch will change copy_file_range(2) for the other semantic change, which allows the infd and outfd to refer to the same file, so long as the byte ranges do not overlap.	2019-11-08 23:39:17 +00:00
Gleb Smirnoff	1a49612526	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). Remove few outdated comments and extraneous assertions. No functional change here.	2019-11-07 00:08:34 +00:00
Gleb Smirnoff	b8c923032f	If vm_pager_get_pages_async() returns an error synchronously we leak wired and busy pages. Add code that would carefully cleanups the state in case of synchronous error return. Cover a case when a first I/O went on asynchronously, but second or N-th returned error synchronously. In collaboration with: chs Reviewed by: jtl, kib	2019-11-06 23:45:43 +00:00
Bjoern A. Zeeb	28d7601989	m_pulldown(): Change an if () panic() into a KASSERT(). If we pass in a NULL mbuf to m_pulldown() we are in a bad situation already. There is no point in doing that check for production code. Change the if () panic() into a KASSERT. MFC after: 3 weeks Sponsored by: Netflix	2019-11-06 22:40:19 +00:00
Brooks Davis	89f34d4611	libstats: Improve ABI assertion. On platforms where pointers are larger than 64-bits, struct statsblob may be harmlessly padded out such that opaque[] always has some included space. Make the assertion more general by comparing to the offset of opaque rather than the size of struct statsblob. Discussed with: jhb, James Clarke Reviewed by: trasz, lstewart Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22188	2019-11-06 19:44:44 +00:00
Alexander Motin	3db35ffa2a	Some more taskqueue optimizations. - Optimize enqueue for two task priority values by adding new tq_hint field, pointing to the last task inserted into the middle of the list. In case of more then two priority values it should halve average search. - Move tq_active insert/remove out of the taskqueue_run_locked loop. Instead of dirtying few shared cache lines per task introduce different mechanism to drain active tasks, based on task sequence number counter, that uses only cache lines already present in cache. Since the new mechanism does not need ordering, switch tq_active from TAILQ to LIST. - Move static and dynamic struct taskqueue fields into different cache lines. Move lock into its own cache line, so that heavy lock spinning by multiple waiting threads would not affect the running thread. - While there, correct some TQ_SLEEP() wait messages. This change fixes certain ZFS write workloads, causing huge congestion on taskqueue lock. Those workloads combine some large block writes to saturate the pool and trigger allocation throttling, which uses higher priority tasks to requeue the delayed I/Os, with many small blocks to generate deep queue of small tasks for taskqueue to sort. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-11-01 22:49:44 +00:00
Ed Maste	2e5f9189bb	avoid kernel stack data leak in core dump thrmisc note bzero the entire thrmisc struct, not just the padding. Other core dump notes are already done this way. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reviewed by: markj MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-10-31 20:42:36 +00:00
Jeff Roberson	67d0e29304	Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY flag and use the same system. This enables further fault locking improvements by allowing more faults to proceed with a shared lock. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22116	2019-10-29 21:06:34 +00:00
Jeff Roberson	6ee653cfeb	Drop the object lock in vfs_bio and cluster where it is now safe to do so. Recent changes to busy/valid/dirty have enabled page based synchronization and the object lock is no longer required in many cases. Reviewed by: kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21597	2019-10-29 20:37:59 +00:00
Gleb Smirnoff	5757b59f3e	Merge td_epochnest with td_no_sleeping. Epoch itself doesn't rely on the counter and it is provided merely for sleeping subsystems to check it. - In functions that sleep use THREAD_CAN_SLEEP() to assert correctness. With EPOCH_TRACE compiled print epoch info. - _sleep() was a wrong place to put the assertion for epoch, right place is sleepq_add(), as there ways to call the latter bypassing _sleep(). - Do not increase td_no_sleeping in non-preemptible epochs. The critical section would trigger all possible safeguards, no sleeping counter is extraneous. Reviewed by: kib	2019-10-29 17:28:25 +00:00
Konstantin Belousov	5e921ff49e	amd64: move pcb out of kstack to struct thread. This saves 320 bytes of the precious stack space. The only negative aspect of the change I can think of is that the struct thread increased by 320 bytes obviously, and that 320 bytes are not swapped out anymore. I believe the freed stack space is much more important than that. Also, current struct thread size is 1392 bytes on amd64, so UMA will allocate two thread structures per (4KB) slab, which leaves a space for pcb without increasing zone memory use. Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D22138	2019-10-25 20:09:42 +00:00
Gleb Smirnoff	ed9d69b5e8	Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no functional change. Discussed with: kib	2019-10-24 21:55:19 +00:00
John Baldwin	7d29eb9a91	Use a counter with a random base for explicit IVs in GCM. This permits constructing the entire TLS header in ktls_frame() rather than ktls_seq(). This also matches the approach used by OpenSSL which uses an incrementing nonce as the explicit IV rather than the sequence number. Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22117	2019-10-24 18:13:26 +00:00
Konstantin Belousov	c92f130498	Fix undefined behavior. Create a sequence point by ending a full expression for call to vspace() and use of the globals which are modified by vspace(). Reported and reviewed by: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22126	2019-10-23 16:06:47 +00:00
Konstantin Belousov	8076c4e7d1	vn_printf(): Decode VI_TEXT_REF. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-10-23 15:51:26 +00:00
Gleb Smirnoff	080e9496b8	Allow epoch tracker to use the very last byte of the stack. Not sure this will help to avoid panic in this function, since it will also use some stack, but makes code more strict. Submitted by: hselasky	2019-10-22 18:05:15 +00:00
Gleb Smirnoff	77d70e515f	Assert that any epoch tracker belongs to the thread stack. Reviewed by: kib	2019-10-21 23:12:14 +00:00
Gleb Smirnoff	279b9aabe3	Remove epoch tracker from struct thread. It was an ugly crutch to emulate locking semantics for if_addr_rlock() and if_maddr_rlock().	2019-10-21 18:19:32 +00:00
Andriy Gapon	3ad1ce46d3	debug,kassert.warnings is a statistic, not a tunable MFC after: 1 week	2019-10-21 12:21:56 +00:00
Mark Johnston	f822c9e287	Apply mapping protections to preloaded kernel modules on amd64. With an upcoming change the amd64 kernel will map preloaded files RW instead of RWX, so the kernel linker must adjust protections appropriately using pmap_change_prot(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21860	2019-10-18 13:56:45 +00:00
Mark Johnston	1d9eae9fb2	Apply mapping protections to .o kernel modules. Use the section flags to derive mapping protections. When multiple sections overlap within a page, the union of their protections must be applied. With r353701 the .text and .rodata sections are padded to ensure that this does not happen on amd64. Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21896	2019-10-18 13:53:14 +00:00
Conrad Meyer	dda17b3672	Implement NetGDB(4) NetGDB(4) is a component of a system using a panic-time network stack to remotely debug crashed FreeBSD kernels over the network, instead of traditional serial interfaces. There are three pieces in the complete NetGDB system. First, a dedicated proxy server must be running to accept connections from both NetGDB and gdb(1), and pass bidirectional traffic between the two protocols. Second, the NetGDB client is activated much like ordinary 'gdb' and similarly to 'netdump' in ddb(4) after a panic. Like other debugnet(4) clients (netdump(4)), the network interface on the route to the proxy server must be online and support debugnet(4). Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any other TCP remote) to connect to the proxy server. The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and uses a 1:1 relationship between GDB packets and sequences of debugnet packets (fragmented by MTU). There is no encryption utilized to keep debugging sessions private, so this is only appropriate for local segments or trusted networks. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed some with: emaste, markj Relnotes: sure Differential Revision: https://reviews.freebsd.org/D21568	2019-10-17 21:33:01 +00:00
Mark Johnston	092bacb2c4	Clean up some nits in link_elf_(un)load_file(). - Remove a redundant assignment of ef->address. - Don't return a Mach error number to the caller if vm_map_find() fails. - Use ptoa() and fix style. MFC after: 2 weeks Sponsored by: Netflix	2019-10-17 21:25:50 +00:00
Conrad Meyer	addccb8c51	Add a very limited DDB dumpon(8)-alike to MI dumper code This allows ddb(4) commands to construct a static dumperinfo during panic/debug and invoke doadump(false) using the provided dumper configuration (always inserted first in the list). The intended usecase is a ddb(4)-time netdump(4) command. Reviewed by: markj (earlier version) Differential Revision: https://reviews.freebsd.org/D21448	2019-10-17 18:29:44 +00:00
Conrad Meyer	7790c8c199	Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421	2019-10-17 16:23:03 +00:00
Andriy Gapon	5fdc2c044e	provide a way to assign taskqueue threads to a kernel process This can be used to group all threads belonging to a single logical entity under a common kernel process. I am planning to use the new interface for ZFS threads. MFC after: 4 weeks	2019-10-17 06:32:34 +00:00
Mark Johnston	6d775f0ba1	Use KOBJMETHOD_END in the kernel linker. MFC after: 1 week	2019-10-16 22:06:19 +00:00
Mark Johnston	01cef4caa7	Remove page locking from pmap_mincore(). After r352110 the page lock no longer protects a page's identity, so there is no purpose in locking the page in pmap_mincore(). Instead, if vm.mincore_mapped is set to the non-default value of 0, re-lookup the page after acquiring its object lock, which holds the page's identity stable. The change removes the last callers of vm_page_pa_tryrelock(), so remove it. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21823	2019-10-16 22:03:27 +00:00
Andrew Turner	9bb37c03fb	Stop leaking information from the kernel through timespec The timespec struct holds a seconds value in a time_t and a nanoseconds value in a long. On most architectures these are the same size, however on 32-bit architectures other than i386 time_t is 8 bytes and long is 4 bytes. Most ABIs will then pad a struct holding an 8 byte and 4 byte value to 16 bytes with 4 bytes of padding. When copying one of these structs the compiler is free to copy the padding if it wishes. In this case the padding may contain kernel data that is then leaked to userspace. Fix this by copying the timespec elements rather than the entire struct. This doesn't affect Tier-1 architectures so no SA is expected. admbugs: 651 MFC after: 1 week Sponsored by: DARPA, AFRL	2019-10-16 13:21:01 +00:00
Kristof Provost	1d95443818	Generalize ARM specific comments in devmap The comments in devmap are very ARM specific, this generalizes them for other architectures. Submitted by: Nicholas O'Brien <nickisobrien_gmail.com> Reviewed by: manu, philip Sponsored by: Axiado Differential Revision: https://reviews.freebsd.org/D22035	2019-10-15 23:21:52 +00:00
Gleb Smirnoff	4b25d1f2e3	Missing from r353596.	2019-10-15 21:32:38 +00:00
Gleb Smirnoff	bac060388f	When assertion for a thread not being in an epoch fails also print all entered epochs. Works with EPOCH_TRACE only. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D22017	2019-10-15 21:24:25 +00:00
Gleb Smirnoff	237c1f932b	Remove pfctlinput2(). It came from KAME and had never ever been in use.	2019-10-15 15:40:03 +00:00
Jeff Roberson	0012f373e4	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594	2019-10-15 03:45:41 +00:00
Jeff Roberson	63e9755548	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548	2019-10-15 03:35:11 +00:00
Leandro Lupori	0ecc478b74	[PPC64] Initial kernel minidump implementation Based on POWER9BSD implementation, with all POWER9 specific code removed and addition of new methods in PPC64 MMU interface, to isolate platform specific code. Currently, the new methods are implemented on pseries and PowerNV (D21643). Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D21551	2019-10-14 13:04:04 +00:00
Gleb Smirnoff	f6eccf96a0	Since EPOCH_TRACE had been moved to opt_global.h, we don't need to waste extra space in struct thread.	2019-10-14 04:17:56 +00:00
Mateusz Guzik	d1cbf3eeea	vfs: add MNTK_NOMSYNC On many filesystems the traversal is effectively a no-op. Add a way to avoid the overhead. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22009	2019-10-13 15:40:34 +00:00
Mateusz Guzik	737241cd51	vfs: return free vnode batches in sync instead of vfs_msync It is a more natural fit. vfs_msync only deals with active vnodes. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22008	2019-10-13 15:39:11 +00:00
Alexander Motin	a89a562b60	Allocate device softc from the device domain. Since we are trying to bind device interrupt threads to the device domain, it should have sense to make memory often accessed by them local. If domain is not known, fall back to round-robin. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-10-12 19:03:07 +00:00
Kristof Provost	85d1151f96	mountroot: run statfs after mounting devfs The usual flow for mounting a file system is to VFS_MOUNT() and then immediately VFS_STATFS(). That's not done in vfs_mountroot_devfs(), which means the mp->mnt_stat.f_iosize field is not correctly populated, which in turn causes us to mark valid aio operations as unsafe (because the io size is set to 0), ultimately causing the aio_test:md_waitcomplete test to fail. Reviewed by: mckusick MFC after: 1 week Sponsored by: Axiado Differential Revision: https://reviews.freebsd.org/D21897	2019-10-11 17:04:38 +00:00
Conrad Meyer	46d70077be	ddb: Add CSV option, sorting to 'show (malloc\|uma)' Add /i option for machine-parseable CSV output. This allows ready copy/ pasting into more sophisticated tooling outside of DDB. Add total zone size ("Memory Use") as a new column for UMA. For both, sort the displayed list on size (print the largest zones/types first). This is handy for quickly diagnosing "where has my memory gone?" at a high level. Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version) Sponsored by: Dell EMC Isilon	2019-10-11 01:31:31 +00:00
John Baldwin	97ecf6efa0	Don't free the cursor boundary tag during vmem_destroy(). The cursor boundary tag is statically allocated in the vmem instead of from the vmem_bt_zone. Explicitly remove it from the vmem's segment list in vmem_destroy before freeing all the segments from the vmem. Reviewed by: markj MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21953	2019-10-09 21:20:39 +00:00
Gleb Smirnoff	975b8f8462	Cleanup unneeded includes that crept in with r353292.	2019-10-09 16:59:42 +00:00
Gleb Smirnoff	ff3cfc330e	Enter network epoch in domain callouts.	2019-10-09 16:21:05 +00:00
Mark Johnston	4013d72684	Fix handling of empty SCM_RIGHTS messages. As unp_internalize() processes the input control messages, it builds an output mbuf chain containing the internalized representations of those messages. In one special case, that of an empty SCM_RIGHTS message, the message is simply discarded. However, the loop which appends mbufs to the output chain assumed that each iteration would produce an mbuf, resulting in a null pointer dereference if an empty SCM_RIGHTS message was followed by a non-empty message. Fix this by advancing the output mbuf chain tail pointer only if an internalized control message was produced. Reported by: syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-10-08 23:34:48 +00:00
John Baldwin	9e14430d46	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891	2019-10-08 21:34:06 +00:00
Doug Moore	2288078c5e	Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map. In case the implementation ever changes from using a chain of next pointers, then changing the macro definition will be necessary, but changing all the files that iterate over vm_map entries will not. Drop a counter in vm_object.c that would have an effect only if the vm_map entry count was wrong. Discussed with: alc Reviewed by: markj Tested by: pho (earlier version) Differential Revision: https://reviews.freebsd.org/D21882	2019-10-08 07:14:21 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Edward Tomasz Napierala	1a13f2e6b4	Introduce stats(3), a flexible statistics gathering API. This provides a framework to define a template describing a set of "variables of interest" and the intended way for the framework to maintain them (for example the maximum, sum, t-digest, or a combination thereof). Afterwards the user code feeds in the raw data, and the framework maintains these variables inside a user-provided, opaque stats blobs. The framework also provides a way to selectively extract the stats from the blobs. The stats(3) framework can be used in both userspace and the kernel. See the stats(3) manual page for details. This will be used by the upcoming TCP statistics gathering code, https://reviews.freebsd.org/D20655. The stats(3) framework is disabled by default for now, except in the NOTES kernel (for QA); it is expected to be enabled in amd64 GENERIC after a cool down period. Reviewed by: sef (earlier version) Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20477	2019-10-07 19:05:05 +00:00
Mateusz Guzik	dc20b834ca	vfs: add optional root vnode caching Root vnodes looekd up all the time, e.g. when crossing a mount point. Currently used routines always perform a costly lookup which can be trivially avoided. Reviewed by: jeff (previous version), kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21646	2019-10-06 22:14:32 +00:00
Kyle Evans	e3f35d562f	Remove the remnants of SI_CHEAPCLONE SI_CHEAPCLONE was introduced in r66067 for use with cloned bpfs. It was later also used in tty, tun, tap at points. The rough timeline for being removed in each of these is as follows: - r181690: bpf switched to use cdevpriv API by ed@ - r181905: ed@ rewrote the TTY later to be mpsafe - r204464: kib@ removes it from tun/tap, declaring it unused I've not yet been able to dig up any other consumers in the intervening 9 years. It is no longer set on any devices in the tree and leaves an interesting situation in make_dev_sv where we're ok with the device already being set SI_NAMED.	2019-10-05 21:52:06 +00:00
Kyle Evans	d42fecb5c1	kern_conf: fully initialize cloned devices with make_dev_args, too Attempting to initialize si_drv{1,2} with mda_si_drv{1,2} does not work if you are operating on cloned devices. clone_create must be called prior to the make_dev* family to create/return the device on the clonelist as needed. This device is later returned early in newdev(), prior to si_drv{0,1,2} initialization. This patch simply breaks out of the loop if we've found a device and finishes init. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21904	2019-10-05 21:44:18 +00:00
Mateusz Guzik	dfa8dae493	devfs: plug redundant bwillwrite avoidance vn_write already checks for vnode type to see if bwillwrite should be called. This effectively reverts r244643. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21905	2019-10-05 17:44:33 +00:00
Eric van Gyzen	e61e783b83	Add CTLFLAG_STATS to some vfs sysctl OIDs Add CTLFLAG_STATS to the following OIDs: vfs.altbufferflushes vfs.recursiveflushes vfs.barrierwrites vfs.flushwithdeps vfs.reassignbufcalls Refer to r353111. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2019-10-04 21:43:43 +00:00
Ed Maste	f91dd6091b	simplify path handling in sysctl_try_reclaim_vnode MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei verifies the presence of the NUL. Thus there is no need to increase the buffer size here. The sysctl passes the string excluding the NUL, so req->newlen equal to PATH_MAX is too long. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21876	2019-10-02 21:01:23 +00:00
Mark Johnston	5131cba6d6	Use OBJT_PHYS VM objects for kernel modules. OBJT_DEFAULT incurs some unnecessary overhead given that kernel module pages cannot be paged out. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21862	2019-10-02 16:34:42 +00:00
Mark Johnston	4a7b33ecf4	Disallow fcntl(F_READAHEAD) when the vnode is not a regular file. The mountpoint may not have defined an iosize parameter, so an attempt to configure readahead on a device file can lead to a divide-by-zero crash. The sequential heuristic is not applied to I/O to or from device files, and posix_fadvise(2) returns an error when v_type != VREG, so perform the same check here. Reported by: syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21864	2019-10-02 15:45:49 +00:00
Kyle Evans	5a391b572b	shm_open2(2): completely unbreak kern_shm_open2(), since conception, completely fails to pass the mode along to kern_shm_open(). This breaks most uses of it. Add tests alongside this that actually check the mode of the returned files. PR: 240934 [pulseaudio breakage] Reported by: ler, Andrew Gierth [postgres breakage] Diagnosed by: Andrew Gierth (great catch) Tested by: ler, tmunro Pointy hat to: kevans	2019-10-02 02:37:34 +00:00
Ed Maste	f403831e6c	sysalls.master: remove superfluous ellipsis in comment A single period is sufficient in this comment, and making this change lets us find references to varargs syscalls by searching for ...	2019-10-01 17:05:21 +00:00
Brooks Davis	3a94552174	Restore the ability to set capenabled directly in syscalls.conf. This fixes generation of cloudabi syscall tables broken in r340424. Reviewed by: kevans, emaste MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D21821	2019-09-30 20:58:29 +00:00
Kyle Evans	11fd6a60e7	syscalls.master: consistency, move ); to newline (no functional change)	2019-09-30 13:26:16 +00:00
Mark Johnston	1aa696babc	Fix some problems with the SPARSE_MAPPING option in the kernel linker. - Ensure that the end of the mapping passed to vm_page_wire() is page-aligned. vm_page_wire() expects this. - Wire pages before reading data into them. - Apply protections specified in the segment descriptor using vm_map_protect() once relocation processing is done. - On amd64, ensure that we load KLDs above KERNBASE, since they are compiled with the "kernel" memory model by default. Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21756	2019-09-28 01:42:59 +00:00
Andrew Gallatin	b2dba6634b	kTLS: Fix a bug where we would not encrypt anon data inplace. Software Kernel TLS needs to allocate a new destination crypto buffer when encrypting data from the page cache, so as to avoid overwriting shared clear-text file data with encrypted data specific to a single socket. When the data is anonymous, eg, not tied to a file, then we can encrypt in place and avoid allocating a new page. This fixes a bug where the existing code always assumes the data is private, and never encrypts in place. This results in unneeded page allocations and potentially more memory bandwidth consumption when doing socket writes. When the code was written at Netflix, ktls_encrypt() looked at private sendfile flags to determine if the pages being encrypted where part of the page cache (coming from sendfile) or anonymous (coming from sosend). This was broken internally at Netflix when the sendfile flags were made private, and the M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will always be false for M_NOMAP mbufs, since one cannot just mtod() them. This change introduces a new flags field to the mbuf_ext_pgs struct by stealing a byte from the tls hdr. Note that the current header is still 2 bytes larger than the largest header we support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON when creating an unmapped mbuf in m_uiotombuf_nomap() (which is the path that socket writes take), and we check for that flag in ktls_encrypt() when looking for anon pages. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21796	2019-09-27 20:08:19 +00:00
Andrew Gallatin	6554362c66	kTLS support for TLS 1.3 TLS 1.3 requires a few changes because 1.3 pretends to be 1.2 with a record type of application data. The "real" record type is then included at the end of the user-supplied plaintext data. This required adding a field to the mbuf_ext_pgs struct to save the record type, and passing the real record type to the sw_encrypt() ktls backend functions. Reviewed by: jhb, hselasky Sponsored by: Netflix Differential Revision: D21801	2019-09-27 19:17:40 +00:00
Mateusz Guzik	708cf7eb6c	cache: decrease ncnegfactor to 5 The current mechanism is bogus in several ways: - the limit is a percentage of total entries added, which means negative entries get evicted all the time even if there are plenty of resources - evicting code is almost not concurrent, which makes it unable to remove entries fast enough when doing something as simple as -j 104 buildworld - there is no support for performing mass removal if necessary Vast majority of negative entries never get any hits. Only evicting them when the filesystem demands it results in a significant growth of the namecache with almost no improvement in the hit ratio. Sample result about afer 90 minutes of poudriere -j 104: current no evict % of the original numneg 219737 2013157 916 numneghits 266711906 263544562 98 [1] [1] this may look funny but there is a certain dose of variation to the build The number was chosen as something which mostly eliminates spurious evictions during lighter workloads but still keeps the total at bay. Sponsored by: The FreeBSD Foundation	2019-09-27 19:14:03 +00:00
Mateusz Guzik	e643141838	cache: stop requeuing negative entries on the hot list Turns out it does not improve hit ratio, but it does come with a cost induces stemming from dirtying hit entries. Sample result: hit counts of evicted entries after 2 buildworlds before: value ------------- Distribution ------------- count -1 \| 0 0 \|@@@@@@@@@@@@@@@@@@@@@@@@@ 180865 1 \|@@@@@@@ 49150 2 \|@@@ 19067 4 \|@ 9825 8 \|@ 7340 16 \|@ 5952 32 \|@ 5243 64 \|@ 4446 128 \| 3556 256 \| 3035 512 \| 1705 1024 \| 1078 2048 \| 365 4096 \| 95 8192 \| 34 16384 \| 26 32768 \| 23 65536 \| 8 131072 \| 6 262144 \| 0 after: value ------------- Distribution ------------- count -1 \| 0 0 \|@@@@@@@@@@@@@@@@@@@@@@@@@ 184004 1 \|@@@@@@ 47577 2 \|@@@ 19446 4 \|@ 10093 8 \|@ 7470 16 \|@ 5544 32 \|@ 5475 64 \|@ 5011 128 \| 3451 256 \| 3002 512 \| 1729 1024 \| 1086 2048 \| 363 4096 \| 86 8192 \| 26 16384 \| 25 32768 \| 24 65536 \| 7 131072 \| 5 262144 \| 0 Sponsored by: The FreeBSD Foundation	2019-09-27 19:13:22 +00:00
Mateusz Guzik	312196df0f	cache: make negative list shrinking a little bit concurrent Continue protecting demotion from the hotlist and selection of the target list with the ncneg_shrink_lock lock, but drop it before relocking to zap the node. While here count how many times we skipped shrinking due to the lock being already taken. Sponsored by: The FreeBSD Foundation	2019-09-27 19:12:43 +00:00
Mateusz Guzik	95c6dd890a	cache: stop recalculating upper limit each time a new entry is added Sponsored by: The FreeBSD Foundation	2019-09-27 19:12:20 +00:00
Konstantin Belousov	df08823d07	Improve MD page fault handlers. Centralize calculation of signal and ucode delivered on unhandled page fault in new function vm_fault_trap(). MD trap_pfault() now almost always uses the signal numbers and error codes calculated in consistent MI way. This introduces the protection fault compatibility sysctls to all non-x86 architectures which did not have that bug, but apparently they were already much more wrong in selecting delivered signals on protection violations. Change the delivered signal for accesses to mapped area after the backing object was truncated. According to POSIX description for mmap(2): The system shall always zero-fill any partial page at the end of an object. Further, the system shall never write out any modified portions of the last page of an object which are beyond its end. References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. An implementation may generate SIGBUS signals when a reference would cause an error in the mapped object, such as out-of-space condition. Adjust according to the description, keeping the existing compatibility code for SIGSEGV/SIGBUS on protection failures. For situations where kernel cannot handle page fault due to resource limit enforcement, SIGBUS with a new error code BUS_OBJERR is delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on amd64 due to protection key access violation. vm_fault_hold() is renamed to vm_fault(). Fixed some nits in trap_pfault()s like mis-interpreting Mach errors as errnos. Removed unneeded truncations of the fault addresses reported by hardware. PR: 211924 Reviewed by: alc Discussed with: jilles, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21566	2019-09-27 18:43:36 +00:00
Andrew Turner	50bb04b750	Check the vfs option length is valid before accessing through When a VFS option passed to nmount is present but NULL the kernel will place an empty option in its internal list. This will have a NULL pointer and a length of 0. When we come to read one of these the kernel will try to load from the last address of virtual memory. This is normally invalid so will fault resulting in a kernel panic. Fix this by checking if the length is valid before dereferencing. MFC after: 3 days Sponsored by: DARPA, AFRL	2019-09-27 16:22:28 +00:00
David Bright	c4571256af	sysent: regenerate after r352747. Sponsored by: Dell EMC Isilon	2019-09-26 15:41:10 +00:00
Mark Johnston	55248d32f2	Fix handling of invalid pages in exec_map_first_page(). exec_map_first_page() would unconditionally free an unbacked, invalid page from the executable image. However, it is possible that the page is wired, in which case it is incorrect to free the page, so check for additional wirings first. Reported by: syzkaller Tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21767	2019-09-26 15:35:35 +00:00
David Bright	9afb12bab4	Add an shm_rename syscall Add an atomic shm rename operation, similar in spirit to a file rename. Atomically unlink an shm from a source path and link it to a destination path. If an existing shm is linked at the destination path, unlink it as part of the same atomic operation. The caller needs the same permissions as shm_unlink to the shm being renamed, and the same permissions for the shm at the destination which is being unlinked, if it exists. If those fail, EACCES is returned, as with the other shm_* syscalls. truss support is included; audit support will come later. This commit includes only the implementation; the sysent-generated bits will come in a follow-on commit. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: jilles (earlier revision) Reviewed by: brueffer (manpages, earlier revision) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21423	2019-09-26 15:32:28 +00:00
Toomas Soome	11fc80a098	kernel terminal should initialize fg and bg variables before calling TUNABLE_INT_FETCH We have two ways to check if kenv variable exists - either we check return value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check if this value did change. In terminal_init() it is more convinient to use pre-initialized variables. Problem was revealed by older loader.efi, which did not set teken.* variables. Reported by: tuexen	2019-09-26 07:19:26 +00:00
Alexander Motin	176dd236dc	Microoptimize sched_pickcpu() CPU affinity on SMT. Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5% of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-26 00:35:06 +00:00
Alexander Motin	c55dc51c37	Microoptimize sched_pickcpu() after r352658. I've noticed that I missed intr check at one more SCHED_AFFINITY(), so instead of adding one more branching I prefer to remove few. Profiler shows the function CPU time reduction from 0.24% to 0.16%. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-25 19:29:09 +00:00
Kyle Evans	079c5b9ed8	rfork(2): add RFSPAWN flag When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets signal handlers in the child during creation to avoid a point of corruption of parent state from the child. This flag will be used by posix_spawn(3) to handle potential signal issues. Reviewed by: jilles, kib Differential Revision: https://reviews.freebsd.org/D19058	2019-09-25 19:20:41 +00:00
Gleb Smirnoff	dd902d015a	Add debugging facility EPOCH_TRACE that checks that epochs entered are properly nested and warns about recursive entrances. Unlike with locks, there is nothing fundamentally wrong with such use, the intent of tracer is to help to review complex epoch-protected code paths, and we mean the network stack here. Reviewed by: hselasky Sponsored by: Netflix Pull Request: https://reviews.freebsd.org/D21610	2019-09-25 18:26:31 +00:00
Kyle Evans	a9ac5e1424	sysent: regenerate after r352705 This also implements it, fixes kdump, and removes no longer needed bits from lib/libc/sys/shm_open.c for the interim.	2019-09-25 18:09:19 +00:00
Kyle Evans	234879a7e3	Mark shm_open(2) as COMPAT12, succeeded by shm_open2 Implementation and regenerated files will follow.	2019-09-25 18:06:48 +00:00
Kyle Evans	460211e730	sysent: regenerate after r352700	2019-09-25 17:59:58 +00:00
Kyle Evans	20f7057685	Add a shm_open2 syscall to support upcoming memfd_create shm_open2 allows a little more flexibility than the original shm_open. shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate shmflag argument that can be expanded later. Currently the only shmflag is to allow file sealing on the returned fd. shm_open and memfd_create will both be implemented in libc to use this new syscall. __FreeBSD_version is bumped to indicate the presence. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21393	2019-09-25 17:59:15 +00:00
Kyle Evans	0cd95859c8	[2/3] Add an initial seal argument to kern_shm_open() Now that flags may be set on posixshm, add an argument to kern_shm_open() for the initial seals. To maintain past behavior where callers of shm_open(2) are guaranteed to not have any seals applied to the fd they're given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special flag could be opened later for shm_open(2) to indicate that sealing should be allowed. We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a shmfd that already existed. A note's been added about the assumptions we've made here as a hint towards anyone wanting to allow other seals to be applied at creation. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21392	2019-09-25 17:35:03 +00:00
Kyle Evans	af755d3e48	[1/3] Add mostly Linux-compatible file sealing support File sealing applies protections against certain actions (currently: write, growth, shrink) at the inode level. New fileops are added to accommodate seals - EINVAL is returned by fcntl(2) if they are not implemented. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D21391	2019-09-25 17:32:43 +00:00
Kyle Evans	85c5f3cb57	Add COMPAT12 support to makesyscalls.sh Reviewed by: kib, imp, brooks (all without syscalls.master edits) Differential Revision: https://reviews.freebsd.org/D21366	2019-09-25 17:29:45 +00:00
Toomas Soome	3001e0c942	kernel: terminal_init() should check for teken colors from kenv Check for teken.fg_color and teken.bg_color and prepare the color attributes accordingly. When white background is used, make it light to improve visibility. When black background is used, make kernel messages light.	2019-09-25 13:21:07 +00:00
Alexander Motin	bb3dfc6ae9	Fix wrong assertion in r352658. MFC after: 1 month	2019-09-25 11:58:54 +00:00
Alexander Motin	c9205e3500	Fix/improve interrupt threads scheduling. Doing some tests with very high interrupt rates I've noticed that one of conditions I added in r232207 to make interrupt threads in most cases run on local CPU never worked as expected (worked only if previous time it was executed on some other CPU, that is quite opposite). It caused additional CPU usage to run full CPU search and could schedule interrupt threads to some other CPU. This patch removes that code and instead reuses existing non-interrupt code path with some tweaks for interrupt case: - On SMT systems, if current thread is idle, don't look on other threads. Even if they are busy, it may take more time to do fill search and bounce the interrupt thread to other core then execute it locally, even sharing CPU resources. It is other threads should migrate, not bound interrupts. - Try hard to keep interrupt threads within LLC of their original CPU. This improves scheduling cost and supposedly cache and memory locality. On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few percents of CPU time while adding few percents to IOPS. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-24 20:01:20 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Mateusz Guzik	93a85508ad	cache: tidy up handling of negative entries - track the total count of hot entries - pre-read the lock when shrinking since it is typically already taken - place the lock in its own cacheline - shorten the hold time of hot lock list when zapping Sponsored by: The FreeBSD Foundation	2019-09-23 20:50:04 +00:00
Mark Johnston	38dae42c26	Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations. This is required for DPCPU and VNET data variable definitions to work when KLDs are linked as DSOs. R_X86_64_RELATIVE relocations should not appear in object files, so assert this in elf_relocaddr(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21755	2019-09-23 14:14:43 +00:00
Mateusz Guzik	afe257e3ca	cache: count evictions of negatve entries Sponsored by: The FreeBSD Foundation	2019-09-23 08:53:14 +00:00
Sean Eric Fagan	ba7a55d934	Add two options to allow mount to avoid covering up existing mount points. The two options are * nocover/cover: Prevent/allow mounting over an existing root mountpoint. E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local is already a mountpoint. * emptydir/noemptydir: Prevent/allow mounting on a non-empty directory. E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail. Neither of these options is intended to be a default, for historical and compatibility reasons. Reviewed by: allanjude, kib Differential Revision: https://reviews.freebsd.org/D21458	2019-09-23 04:28:07 +00:00
Mateusz Guzik	7505cffa56	cache: try to avoid vhold if locks held Sponsored by: The FreeBSD Foundation	2019-09-22 20:50:24 +00:00
Mateusz Guzik	cd2112c305	cache: jump in negative success instead of positive Sponsored by: The FreeBSD Foundation	2019-09-22 20:49:17 +00:00
Mateusz Guzik	d2be3ef05c	lockprof: move per-cpu data to dpcpu Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21747	2019-09-22 20:44:24 +00:00
Konstantin Belousov	f33533da8c	kern.elf{32,64}.pie_base sysctl: enforce page alignment. Requested by: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 20:03:17 +00:00
Mateusz Guzik	cbba2cb367	lockprof: use CPUFOREACH and drop always false lp_cpu NULL checks Sponsored by: The FreeBSD Foundation	2019-09-21 19:05:38 +00:00
Konstantin Belousov	95aafd6900	Make non-ASLR pie base tunable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-21 18:00:23 +00:00
Alexander Motin	36d151a237	Allocate callout wheel from the respective memory domain. MFC after: 1 week	2019-09-21 15:38:08 +00:00
Andrew Gallatin	61b8a4af71	remove redundant "ktls" in KTLS thr name This reducesthe string width of the ktls thread name and improves "ps" output. Glanced at by: jhb Event: EuroBSDCon hackathon Sponsored by: Netflix	2019-09-20 09:36:07 +00:00
Mateusz Guzik	b488246b45	vfs: group fields used for per-cpu ops in one cacheline Sponsored by: The FreeBSD Foundation	2019-09-19 21:23:14 +00:00
Konstantin Belousov	382e01c8dc	sysctl: use names instead of magic numbers. Replace magic numbers with symbols for internal sysctl operations. Convert in-kernel and libc consumers. Submitted by: Pawel Biernacki MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21693	2019-09-18 16:13:10 +00:00
Konstantin Belousov	55894117b1	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634	2019-09-17 18:32:18 +00:00
Kirk McKusick	100369071d	The VFS-level clustering code collects together sequential blocks by issuing delayed-writes (bdwrite()) until a non-sequential block is written or the maximum cluster size is reached. At that point it collects the delayed buffers together (using bread()) to write them in a single operation. The assumption was that since we just looked at them they will still be in memory so there is no need to check for a read error from bread(). Very occationally (apparently every 10-hours or so when being pounded by Peter Holm's tests) this assumption is wrong. The fix is to check for errors from bread() and fail the cluster write thus falling back to the default individual flushing of any still dirty buffers. Reported by: Peter Holm and Chuck Silvers Reviewed by: kib MFC after: 3 days	2019-09-17 17:44:50 +00:00
Mateusz Guzik	d245aa1e72	vfs: apply r352437 to the fast path as well This one is very hard to run into. If the filesystem is being unmounted or the mount point is freed the vfs_op_thread_enter will fail. For it to succeed the mount point itself would have to be reallocated in the time window between the initial read and the attempt to enter. Sponsored by: The FreeBSD Foundation	2019-09-17 15:53:40 +00:00
Mateusz Guzik	7f65185940	vfs: fix braino resulting in NULL pointer deref in r352424 The breakage was added after all the testing and the testing which followed was not sufficient to find it. Reported by: pho Sponsored by: The FreeBSD Foundation	2019-09-17 08:09:39 +00:00
Mateusz Guzik	4cace859c2	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637	2019-09-16 21:37:47 +00:00
Mateusz Guzik	e87f3f72f1	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575	2019-09-16 21:33:16 +00:00
Mateusz Guzik	ee831b2543	vfs: manage mnt_lockref with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21574	2019-09-16 21:32:21 +00:00
Mateusz Guzik	a8c8e44bf0	vfs: manage mnt_ref with atomics New primitive is introduced to denote sections can operate locklessly on aspects of struct mount, but which can also be disabled if necessary. This provides an opportunity to start scaling common case modifications while providing stable state of the struct when facing unmount, write suspendion or other events. mnt_ref is the first counter to start being managed in this manner with the intent to make it per-cpu. Reviewed by: kib, jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21425	2019-09-16 21:31:02 +00:00
Kyle Evans	3155f2f0e2	rangelock: add rangelock_cookie_assert A future change to posixshm to add file sealing (in DIFF_21391[0] and child) will move locking out of shm_dotruncate as kern_shm_open() will require the lock to be held across the dotruncate until the seal is actually applied. For this, the cookie is passed into shm_dotruncate_locked which asserts RCA_WLOCKED. [0] Name changed to protect the innocent, hopefully, from getting autoclosed due to this reference... Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21628	2019-09-15 02:59:53 +00:00
Mateusz Guzik	ce3ba63f67	vfs: release usecount using fetchadd 1. If we release the last usecount we take ownership of the hold count, which means the vnode will remain allocated until we vdrop it. 2. If someone else vrefs they will find no usecount and will proceed to add their own hold count. 3. No code has a problem with v_usecount transitioning to 0 without the interlock These facts combined mean we can fetchadd instead of having a cmpset loop. Reviewed by: kib (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21528	2019-09-13 15:49:04 +00:00
Mark Johnston	45cdd437ae	Remove a redundant NULL pointer check in cpuset_modify_domain(). cpuset_getroot() is guaranteed to return a non-NULL pointer. Reported by: Mark Millard <marklmi@yahoo.com> MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-09-12 16:47:38 +00:00
Hans Petter Selasky	11b57401e6	Use REFCOUNT_COUNT() to obtain refcount where appropriate. Refcount waiting will set some flag bits in the refcount value. Make sure these bits get cleared by using the REFCOUNT_COUNT() macro to obtain the actual refcount. Differential Revision: https://reviews.freebsd.org/D21620 Reviewed by: kib@, markj@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-12 16:26:59 +00:00
Kyle Evans	5163b1a75c	Follow up r352244: kenv: tighten up assertions As I like to forget: static kenv var formatting is actually such that an empty environment would be double null bytes. We should make sure that a non-zero buffer has at least enough for this, though most of the current usage is with a 4k buffer.	2019-09-12 14:34:46 +00:00
Kyle Evans	436c46875d	kenv: assert that an empty static buffer passed in is "empty" Garbage in the passed-in buffer can cause problems if any attempts to read the kenv are inadvertently made between init_static_kenv and the first kern_setenv -- assuming there is one. This is cheap and easy, so do it. This also helps rule out some class of bugs as one tries to debug; tunables fetch from the static environment up until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that rely on BSS clearing while others just grab a page of free memory and use it (e.g. xen).	2019-09-12 13:51:43 +00:00
Conrad Meyer	aaa3852435	buf: Add B_INVALONERR flag to discard data Setting the B_INVALONERR flag before a synchronous write causes the buf cache to forcibly invalidate contents if the write fails (BIO_ERROR). This is intended to be used to allow layers above the buffer cache to make more informed decisions about when discarding dirty buffers without successful write is acceptable. As a proof of concept, use in msdosfs to handle failures to mark the on-disk 'dirty' bit during rw mount or ro->rw update. Extending this to other filesystems is left as future work. PR: 210316 Reviewed by: kib (with objections) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21539	2019-09-11 21:24:14 +00:00
Mateusz Guzik	b088a4d6f9	cache: avoid excessive relocking on entry removal during lookup Due to lock ordering issues (bucket lock held, vnode locks wanted) the code starts with trylocking which in face of contention often fails. Prior to the change it would loop back with a possible yield. Instead note we know what locks are needed and can take them in the right order, avoiding retries. Then we can safely re-lookup and see if the entry we are looking for is still there. On a 104-way box poudriere would result in constant retries during an 11h run as seen in the vfs.cache.zap_and_exit_bucket_fail counter. before: 408866592 after : 0 However, a new stat reports: vfs.cache.zap_and_exit_bucket_relock_success: 32638 Note this is only a bandaid over current design issues. Tested by: pho Sponsored by: The FreeBSD Foundation	2019-09-10 20:19:29 +00:00
Mateusz Guzik	a6cacb0dca	cache: change the formula for calculating lock array sizes It used to be mp_ncpus * 64, but this gives unnecessarily big values for small machines and at the same time constraints bigger ones. In particular this helps on a 104-way box for which the count is now doubled. While here make cache_purgevfs less likely. Currently it is not efficient in face of contention due to lock ordering issues. These are fixable but not worth it at the moment. Sponsored by: The FreeBSD Foundation	2019-09-10 20:11:00 +00:00
Mateusz Guzik	1214618c05	cache: assorted cleanups Sponsored by: The FreeBSD Foundation	2019-09-10 20:08:24 +00:00
Jeff Roberson	c75757481f	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546	2019-09-10 19:08:01 +00:00
Jeff Roberson	4cdea4a853	Use the sleepq lock rather than the page lock to protect against wakeup races with page busy state. The object lock is still used as an interlock to ensure that the identity stays valid. Most callers should use vm_page_sleep_if_busy() to handle the locking particulars. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21255	2019-09-10 18:27:45 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Konstantin Belousov	6c46ce7ea3	Initialize timehands linkage much earlier. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-09 12:42:48 +00:00
Konstantin Belousov	4b23dec4c2	Make timehands count selectable at boottime. Tested by: O'Connor, Daniel <darius@dons.net.au> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21563	2019-09-09 11:29:58 +00:00
Konstantin Belousov	1040254b75	In do_execve(), use shared text vnode lock consistently. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:10:57 +00:00
Konstantin Belousov	1c36b72874	In do_execve(), clear imgp->textset when restarting for interpreter. Otherwise, we might left the boolean set, which would affect cleanup after an error on interpreter activation. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:05:17 +00:00
Konstantin Belousov	1073d17eeb	When loading ELF interpreter, initialize whole nested image_params with zero. Otherwise we could mishandle imgp->textset. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:03:26 +00:00
Philip Paeps	bdc786cc7c	riscv: restore default HZ=1000, keep QEMU at HZ=100 This reverts r351918 and r351919. Discussed with: br, ian, imp	2019-09-07 05:13:31 +00:00
Philip Paeps	7f0851ab19	riscv: default to HZ=100 Most current RISC-V development platforms are not fast enough to benefit from the increased granularity provided by HZ=1000. Sponsored by: Axiado	2019-09-06 01:19:31 +00:00
Conrad Meyer	a6935d085c	Remove long-dead BUF_ASSERT_{,UN}HELD assertions These were fully neutered in r177676 (2008), but not removed at the time for unclear reasons. They're totally dead code, so go ahead and yank them now. No functional change.	2019-09-05 21:43:33 +00:00
Mateusz Guzik	68c3c1abe1	vfs: temporarily revert r351825 There are 2 problems: - it introduces a funny bug where it can end up trylocking the same vnode [1] - it exposes a pre-existing softdep deadlock [2] Both are easier to run into that the bug which got fixed, so revert until a complete solution is worked out. Reported by: cy [1], pho [2] Sponsored by: The FreeBSD Foundation	2019-09-05 18:19:51 +00:00
Stephen J. Kiernan	d57cd5ccd3	Bump up the low range of cpuset numbers to account for the kernel cpuset. Reviewed by: jeff Obtained from: Juniper Networks, Inc.	2019-09-05 17:48:39 +00:00
Mateusz Guzik	c07d4a0a68	vfs: fully hold vnodes in vnlru_free_locked Currently the code only bumps holdcnt and clears the VI_FREE flag, not performing actual vhold. Since the vnode is still visible elsewhere, a potential new user can find it and incorrectly assume it is properly held. Use vholdl instead to correctly hold the vnode. Another place recycling (vlrureclaim) does this already. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21522	2019-09-04 19:23:18 +00:00
Andriy Gapon	387df3b805	shutdown_halt: make sure that watchdog timer is stopped The point of halt is to keep the machine in limbo. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D21222	2019-09-04 13:26:59 +00:00
Kyle Evans	dca52ab480	posixshm: start counting writeable mappings r351650 switched posixshm to using OBJT_SWAP for shm_object r351795 added support to the swap_pager for tracking writeable mappings Take advantage of this and start tracking writeable mappings; fd sealing will use this to reject a seal on writing with EBUSY if any such mapping exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:33:38 +00:00
Kyle Evans	fe7bcbaf50	vm pager: writemapping accounting for OBJT_SWAP Currently writemapping accounting is only done for vnode_pager which does some accounting on the underlying vnode. Extend this to allow accounting to be possible for any of the pager types. New pageops are added to update/release writecount that need to be implemented for any pager wishing to do said accounting, and we implement these methods now for both vnode_pager (unchanged) and swap_pager. The primary motivation for this is to allow other systems with OBJT_SWAP objects to check if their objects have any write mappings and reject operations with EBUSY if so. posixshm will be the first to do so in order to reject adding write seals to the shmfd if any writable mappings exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:31:48 +00:00
Konstantin Belousov	fe69291ff4	Add procctl(PROC_STACKGAP_CTL) It allows a process to request that stack gap was not applied to its stacks, retroactively. Also it is possible to control the gaps in the process after exec. PR: 239894 Reviewed by: alc Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21352	2019-09-03 18:56:25 +00:00
Mateusz Guzik	e3c3248cc7	vfs: implement usecount implying holdcnt vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting freed and usecount to denote it is actively used. Previously all operations bumping usecount would also bump holdcnt, which is not necessary. We can detect if usecount is already > 1 (in which case holdcnt is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves on atomic ops. Reviewed by: kib Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21471	2019-09-03 15:42:11 +00:00
Mateusz Guzik	d05b53e0ba	Add sysctlbyname system call Previously userspace would issue one syscall to resolve the sysctl and then another one to actually use it. Do it all in one trip. Fallback is provided in case newer libc happens to be running on an older kernel. Submitted by: Pawel Biernacki Reported by: kib, brooks Differential Revision: https://reviews.freebsd.org/D17282	2019-09-03 04:16:30 +00:00
Mateusz Guzik	1874c61b90	vfs: restore mp null check in vop_stdgetwritemount The initially read mount point can already be NULL. Reported by: markj Fixes: r351656 ("vfs: stop refing freed mount points in vop_stdgetwritemount") Sponsored by: The FreeBSD Foundation	2019-09-02 15:24:25 +00:00
Mateusz Guzik	9576ff5803	proc: clear pid bitmap entry after dropping proctree lock There is no correctness change here, but the procid lock is contended in the fork path and taking it while holding proctree avoidably extends its hold time. Note that there are other ids which can end up getting cleared with the lock. Sponsored by: The FreeBSD Foundation	2019-09-02 12:46:43 +00:00
Mark Johnston	08cfa56ea3	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667	2019-09-01 22:22:43 +00:00
Mark Johnston	63cdd18e40	Restrict the input domain set in cpuset_setdomain(2) to all_domains. To permit larger values of MAXMEMDOM, which is currently 8 on amd64, cpuset_setdomain(2) accepts a mask of size 256. In the kernel, domain set masks are 64 bits wide, but can only represent a set of MAXMEMDOM domains due to the use of the ds_order table. Domain sets passed to cpuset_setdomain(2) are restricted to a subset of their parent set, which is typically the root set, but before this happens we modify the input set to exclude empty domains. domainset_empty_vm() and other code which manipulates domain sets expect the mask to be a subset of all_domains, so enforce that when performing validation of cpuset_setdomain(2) parameters. Reported and tested by: pho Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21477	2019-09-01 21:38:08 +00:00
Mateusz Guzik	2796c209b0	vfs: stop refing freed mount points in vop_stdgetwritemount The code used blindly ref based on an unsafely red address and then would backpedal if necessary. This was safe in terms of memory access since mounts are type-stable, but made for a potential a bug where the mount was reused and had the count reset to 0 before this code decreased it. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21411	2019-09-01 14:01:09 +00:00
Kyle Evans	32287ea72b	posixshm: switch to OBJT_SWAP in advance of other changes Future changes to posixshm will start tracking writeable mappings in order to support file sealing. Tracking writeable mappings for an OBJT_DEFAULT object is complicated as it may be swapped out and converted to an OBJT_SWAP. One may generically add this tracking for vm_object, but this is difficult to do without increasing memory footprint of vm_object and blowing up memory usage by a significant amount. On the other hand, the swap pager can be expanded to track writeable mappings without increasing vm_object size. This change is currently in D21456. Switch over to OBJT_SWAP in advance of the other changes to the swap pager and posixshm.	2019-09-01 00:33:16 +00:00
Mateusz Guzik	c2b600f98f	vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked Sponsored by: The FreeBSD Foundation	2019-08-30 21:54:45 +00:00
Mateusz Guzik	3bb8d8d8c9	vfs: tidy up assertions in vfs_subr - assert unlocked vnode interlock in vref - assert right counts in vputx - print debug info for panic in vdrop Sponsored by: The FreeBSD Foundation	2019-08-30 00:45:53 +00:00
Konstantin Belousov	6470c8d3db	Rework v_object lifecycle for vnodes. Current implementation of vnode_create_vobject() and vnode_destroy_vobject() is written so that it prepared to handle the vm object destruction for live vnode. Practically, no filesystems use this, except for some remnants that were present in UFS till today. One of the consequences of that model is that each filesystem must call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result all of them get rid of the v_object in reclaim. Move the call to vnode_destroy_vobject() to vgonel() before VOP_RECLAIM(). This makes v_object stable: either the object is NULL, or it is valid vm object till the vnode reclamation. Remove code from vnode_create_vobject() to handle races with the parallel destruction. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21412	2019-08-29 07:50:25 +00:00
Mateusz Guzik	1e2f0ceb2f	vfs: add VOP_NEED_INACTIVE vnode usecount drops to 0 all the time (e.g. for directories during path lookup). When that happens the kernel would always lock the exclusive lock for the vnode in order to call vinactive(). This blocks other threads who want to use the vnode for looukp. vinactive is very rarely needed and can be tested for without the vnode lock held. This patch gives filesytems an opportunity to do it, sample total wait time for tmpfs over 500 minutes of poudriere -j 104: before: 557563641706 (lockmgr:tmpfs) after: 46309603301 (lockmgr:tmpfs) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21371	2019-08-28 20:34:24 +00:00
Mark Johnston	772dd133c6	Avoid direct accesses of the vm_page wire_count field. No functional change intended. Sponsored by: Netflix	2019-08-28 18:01:54 +00:00
Mateusz Guzik	88cc62e5a5	proc: eliminate the zombproc list It is not needed by anything in the kernel and it slightly drives up contention on both proctree and allproc locks. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21447	2019-08-28 16:18:23 +00:00
Mark Johnston	b5d239cb97	Wire pages in vm_page_grab() when appropriate. uiomove_object_page() and exec_map_first_page() would previously wire a page after having grabbed it. Ask vm_page_grab() to perform the wiring instead: this removes some redundant code, and is cheaper in the case where the requested page is not resident since the page allocator can be asked to initialize the page as wired, whereas a separate vm_page_wire() call requires the page lock. In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued before returning, but this is unnecessary since vm_page_free() will trigger a batched dequeue of the page. Reviewed by: alc, kib Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21440	2019-08-28 16:08:06 +00:00
Mateusz Guzik	2319489b6e	proc: remove zpfind It is not used by anything. If someone wants it back it should be reimplemented to use the proc hash. Sponsored by: The FreeBSD Foundation	2019-08-28 01:22:21 +00:00
John Baldwin	818d755318	Only define the 'tls' member of sfio in KERN_TLS is defined. This field was not initialized in the !KERN_TLS case triggering an assertion failure when using sendfile(2). Reported by: pho, asomers Sponsored by: Netflix	2019-08-27 22:21:18 +00:00
Mateusz Guzik	368cabbcb5	vfs: stop passing LK_INTERLOCK to VOP_UNLOCK The plan is to drop the flags argument. There is also a temporary bug now that nullfs ignores the flag. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21252	2019-08-27 20:30:56 +00:00
Mark Johnston	44e4def73b	Remove an extraneous + 1 in _domainset_create(). DOMAINSET_FLS, like our fls(), is 1-indexed. Reported by: alc MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-08-27 15:42:08 +00:00
Mark Johnston	8e6975047e	Fix several logic issues in domainset_empty_vm(). - Don't add 1 to the result of DOMAINSET_FLS. - Do not modify domainsets containing only empty domains. - Always flatten a _PREFER policy to _ROUNDROBIN if the preferred domain is empty. Previously we were doing this only when ds_cnt > 1. These bugs could cause hangs during boot if a VM domain is empty. Tested by: hselasky Reviewed by: hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21420	2019-08-27 14:06:34 +00:00
Konstantin Belousov	95acb40caa	vn_vget_ino_gen(): relock the lower vnode on error. The function' interface assumes that the lower vnode is passed and returned locked always. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-27 08:28:38 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Xin LI	4e8671dd78	GZIO: Update to use zlib 1.2.11. PR: 229763 Submitted by: Yoshihiro Ota <ota j email ne jp> Differential Revision: https://reviews.freebsd.org/D21408	2019-08-25 07:50:44 +00:00
Mateusz Guzik	0256405e98	vfs: add vholdnz (for already held vnodes) Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21358	2019-08-25 05:11:43 +00:00
Mateusz Guzik	5b596b9fa5	Remove the obsolete pcpu_zone_ptr zone. It was only used by flowtable (removed in r321618). Sponsored by: The FreeBSD Foundation	2019-08-24 00:01:19 +00:00
Konstantin Belousov	e671edac06	De-commision the MNTK_NOINSMNTQ kernel mount flag. After all the changes, its dynamic scope is same as for MNTK_UNMOUNT, but to allow the syncer vnode to be re-installed on unmount failure. But the case of syncer was already handled by using the VV_FORCEINSMQ flag for quite some time. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-23 19:40:10 +00:00
Xin LI	a11bf9a49b	INVARIANTS: treat LA_LOCKED as the same of LA_XLOCKED in mtx_assert. The Linux lockdep API assumes LA_LOCKED semantic in lockdep_assert_held(), meaning that either a shared lock or write lock is Ok. On the other hand, the timeout code uses lc_assert() with LA_XLOCKED, and we need both to work. For mutexes, because they can not be shared (this is unique among all lock classes, and it is unlikely that we would add new lock class anytime soon), it is easier to simply extend mtx_assert to handle LA_LOCKED there, despite the change itself can be viewed as a slight abstraction violation. Reviewed by: mjg, cem, jhb MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D21362	2019-08-23 06:39:40 +00:00
Brooks Davis	075ac3b446	Reorganise conditionals to reduce duplication. No functional change. Obtained from: CheriBSD MFC after: 3 days Sponsored by: DARPA, AFRL	2019-08-22 10:21:07 +00:00
Rick Macklem	df9bc7df42	Map ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE) on a file in a file system that does not have its own VOP_IOCTL(), the lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since ENOTTY is not listed as an error return by either the lseek(2) man page nor the POSIX draft for lseek(2). This was discussed on freebsd-current@ here: http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Reviewed by: markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21300	2019-08-22 01:15:06 +00:00
Mark Johnston	5b699f1614	Add lockmgr(9) probes to the lockstat DTrace provider. They follow the conventions set by rw and sx lock probes. There is an additional lockstat:::lockmgr-disown probe. Update lockstat(1) to report on contention and hold events for lockmgr locks. Document the new probes in dtrace_lockstat.4, and deduplicate some of the existing probe descriptions. Reviewed by: mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21355	2019-08-21 23:43:58 +00:00
Mark Johnston	9fb7c918ef	Remove manual wire_count adjustments from the unmapped mbuf code. The original code came from a desire to minimize the number of updates to v_wire_count, which prior to r329187 was updated using atomics. However, there is no significant benefit to batching today, so simply allocate pages using VM_ALLOC_WIRED and rely on system accounting. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D21323	2019-08-21 20:01:52 +00:00
Mark Johnston	6bc13e042f	Modify pipe_poll() to properly check for pending direct writes. With r349546, it is a responsibility of the writer to clear PIPE_DIRECTW after pinned data has been read. In particular, once a reader has drained this data, there is a small window where the pipe is empty but PIPE_DIRECTW is set. pipe_poll() was using the presence of PIPE_DIRECTW to determine whether to return POLLIN, so in this window it would claim that data was available to read when this was not the case. Fix this by modifying several checks for PIPE_DIRECTW to instead look at the number of residual bytes in data pinned by a direct writer. In some cases we really do want to check for PIPE_DIRECTW, since the presence of this flag indicates that any attempt to write to the pipe will block on the existing direct writer. Bisected and test case provided by: mav Tested by: pho Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21333	2019-08-21 19:35:04 +00:00
Ed Maste	f37192064a	mqueuefs: fix compat32 struct file leak In a compat32 error case we previously leaked a struct file. Submitted by: Karsten König, Secfault Security Security: CVE-2019-5603	2019-08-20 17:44:03 +00:00
Jeff Roberson	cf27e0d125	Use an atomic reference count for paging in progress so that callers do not require the object lock. Reviewed by: markj Tested by: pho (as part of a larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21311	2019-08-19 23:09:38 +00:00
Mateusz Guzik	4b3f767340	vfs: fix up r351193 ("stop always overwriting ->mnt_stat in VFS_STATFS") fs-specific part of vfs_statfs routines only fill in small portion of the structure. Previous code was always copying everything at a higher layer to acoomodate it and this patch does the same. 'df' (no arguments) worked fine because the caller uses mnt_stat itself as the target buffer, making all the copying a no-op for its own case. 'df /' and similar use a different consumer which passes its own buffer and this is where you can run into trouble. Reported by: cy Fixes: r351193 Sponsored by: The FreeBSD Foundation	2019-08-19 14:11:54 +00:00
Andrey V. Elsukov	75697b16b6	Use TAILQ_FOREACH_SAFE() macro to avoid use after free in soclose(). PR: 239893 MFC after: 1 week	2019-08-19 12:42:03 +00:00
Andriy Gapon	0db7afd0ae	assert that td_lk_slocks is not leaked upon return from kernel This is similar to checks for td_sx_slocks and td_rw_rlocks. Although td_lk_slocks is an implementation detail, it still makes sense to validate it. MFC after: 1 week Sponsored by: Panzura	2019-08-19 11:18:36 +00:00
Rick Macklem	2e1b32c0e3	Add a vop_stdioctl() that performs a trivial FIOSEEKDATA/FIOSEEKHOLE. Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE) on a file in a file system that does not have its own VOP_IOCTL(), the lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since ENOTTY is not listed as an error return by either the lseek(2) man page nor the POSIX draft for lseek(2). A discussion on freebsd-current@ seemed to indicate that implementing a trivial algorithm that returns the offset argument for FIOSEEKDATA and returns the file's size for FIOSEEKHOLE was the preferred fix. http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA The Linux kernel appears to implement this trivial algorithm as well. This patch adds a vop_stdioctl() that implements this trivial algorithm. It returns errors consistent with vn_bmap_seekhole() and, as such, will still return ENOTTY for non-regular files. I have proposed a separate patch that maps errors not described by the lseek(2) man page nor POSIX draft to EINVAL. This patch is under separate review. Reviewed by: kib Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21299	2019-08-19 00:29:05 +00:00
Konstantin Belousov	de4e1aeb21	Fix an issue with executing tmpfs binary. Suppose that a binary was executed from tmpfs mount, and the text vnode was reclaimed while the binary was still running. It is possible during even the normal operations since tmpfs vnode' vm_object has swap type, and no references on the vnode is held. Also assume that the text vnode was revived for some reason. Then, on the process exit or exec, unmapping of the text mapping tries to remove the text reference from the vnode, but since it went from recycle/instantiation cycle, there is no reference kept, and assertion in VOP_UNSET_TEXT_CHECKED() triggers. Fix this by keeping a use reference on the tmpfs vnode for each exec reference. This prevents the vnode reclamation while executable map entry is active. Do it by adding per-mount flag MNTK_TEXT_REFS that directs vop_stdset_text() to add use ref on first vnode text use, and per-vnode VI_TEXT_REF flag, to record the need on unref in vop_stdunset_text() on last vnode text use going away. Set MNTK_TEXT_REFS for tmpfs mounts. Reported by: bdrewery Tested by: sbruno, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-18 20:36:11 +00:00
Konstantin Belousov	bb9e2184f0	Change locking requirements for VOP_UNSET_TEXT(). Require the vnode to be locked for the VOP_UNSET_TEXT() call. This will be used by the following bug fix for a tmpfs issue. Tested by: sbruno, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-18 20:24:52 +00:00
Mateusz Guzik	e7c1709aaf	vfs: stop always overwriting ->mnt_stat in VFS_STATFS The struct is already populated on each mount (and remount). Fields are either constant or not used by filesystem in the first place. Some infrequently used functions use it to avoid having to allocate a new buffer and are left alone. The current code results in an avoidable copying single-threaded and significant cache line bouncing multithreaded While here deduplicate initial filling of the struct. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21317	2019-08-18 18:40:12 +00:00
Jeff Roberson	33205c60e7	Add a blocking wait bit to refcount. This allows refs to be used as a simple barrier. Reviewed by: markj, kib Discussed with: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21254	2019-08-18 11:43:58 +00:00
Mateusz Guzik	50c7615fb0	fork: rework locking around do_fork - move allproc lock into the func, it is of no use prior to it - the code would lock p1 and p2 while holding allproc to partially construct it after it gets added to the list. instead we can do the work prior to adding anything. - protect lastpid with procid_lock As a side effect we do less work with allproc held. Sponsored by: The FreeBSD Foundation	2019-08-17 18:19:49 +00:00
Mateusz Guzik	60cdcb644d	fork: bump process count before checking for permission to cross the limit The limit is almost never reached. Do the check only on failure to see if we can override it. No change in user-visible behavior. Sponsored by: The FreeBSD Foundation	2019-08-17 17:56:43 +00:00
Mateusz Guzik	b05641b6bd	fork: stop skipping < 100 ids on wrap around Code doing this is commented with a claim that these IDs are occupied by daemons, but that's demonstrably false. To an extent the range is used by init and kernel processes (and on sufficiently big machines it indeed is fully populated). On a sample box 40-way box the highest id in the range is 63. On a different one it is 23. Just use the range. Sponsored by: The FreeBSD Foundation	2019-08-17 17:42:01 +00:00
Alexander Motin	3a60f3dad0	Add support for 'j', 't' and 'z' flags to kernel sscanf(). MFC after: 2 weeks	2019-08-16 19:46:22 +00:00
Jeff Roberson	2194393787	Move phys_avail definition into MI code. It is consumed in the MI layer and doing so adds more flexibility with less redundant code. Reviewed by: jhb, markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21250	2019-08-16 00:45:14 +00:00
Rick Macklem	c61b14315f	Fix copy_file_range(2) so that unneeded blocks are not allocated to the output file. When the byte range for copy_file_range(2) doesn't go to EOF on the output file and there is a hole in the input file, a hole must be "punched" in the output file. This is done by writing a block of bytes all set to 0. Without this patch, the write is done unconditionally which means that, if the output file already has a hole in that byte range, a unneeded data block of all 0 bytes would be allocated. This patch adds code to check for a hole in the output file, so that it can skip doing the write if there is already a hole in that byte range of the output file. This avoids unnecessary allocation of blocks to the output file. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21155	2019-08-15 23:21:41 +00:00
Jeff Roberson	018ff6860f	Move scheduler state into the per-cpu area where it can be allocated on the correct NUMA domain. Reviewed by: markj, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19315	2019-08-13 04:54:02 +00:00
Konstantin Belousov	7e097daa93	Only enable COMPAT_43 changes for syscalls ABI for a.out processes. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21200	2019-08-11 19:16:07 +00:00
Jonathan T. Looney	afd959f332	In m_pulldown(), before trying to prepend bytes to the subsequent mbuf, ensure that the subsequent mbuf contains the remainder of the bytes the caller sought. If this is not the case, fall through to the code which gathers the bytes in a new mbuf. This fixes a bug where m_pulldown() could fail to gather all the desired bytes into consecutive memory. PR: 238787 Reported by: A reddit user Discussed with: emaste Obtained from: NetBSD MFC after: 3 days	2019-08-09 05:18:59 +00:00
Rick Macklem	6b1bc6f7dd	Remove some harmless cruft from vn_generic_copy_file_range(). An earlier version of the patch had code that set "error" between line#s 2797-2799. When that code was moved, the second check for "error != 0" could never be true and the check became harmless cruft. This patch removes the cruft, mainly to make Coverity happy. Reported by: asomers, cem	2019-08-08 20:07:38 +00:00
Rick Macklem	614633146f	Fix copy_file_range(2) for an unlikely race during hole finding. Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the vnode unlocked, it is possible for another thread to do: - truncate(), lseek(), write() between the two calls and create a hole where FIOSEEKDATA returned the start of data. For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for the hole location. This could result in an infinite loop in the copy code, since copylen is set to 0 and the copy doesn't advance. Usually, this race is avoided because of the use of rangelocks, but the NFS server does not do range locking and could do a sequence like the above to create the hole. This patch checks for this case and makes the hole search fail, to avoid the infinite loop. At this time, it is an open question as to whether or not the NFS server should do range locking to avoid this race.	2019-08-08 19:53:07 +00:00
Konstantin Belousov	b706be23b4	Update comment explaining create_init(). Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-08-08 16:42:53 +00:00
Xin LI	22bbc4b242	Convert DDB_CTF to use newer version of ZLIB. PR: 229763 Submitted by: Yoshihiro Ota <ota j email ne jp> Differential Revision: https://reviews.freebsd.org/D21176	2019-08-08 07:27:49 +00:00
Conrad Meyer	7d0658ad55	Fix !DDB kernel configurations after r350713 KDB is standard and the kdb_active variable is always available. So, de-conditionalize inclusion of sys/kdb.h in kern_sysctl.c. Reported by: Michael Butler <imb AT protected-networks.net> X-MFC-With: r350713 Sponsored by: Dell EMC Isilon	2019-08-08 01:37:41 +00:00
Conrad Meyer	088c17b46b	ddb(4): Add 'sysctl' command Implement `sysctl` in `ddb` by overriding `SYSCTL_OUT`. When handling the req, we install custom ddb in/out handlers. The out handler prints straight to the debugger, while the in handler ignores all input. This is intended to allow us to print just about any sysctl. There is a known issue when used from ddb(4) entered via 'sysctl debug.kdb.enter=1'. The DDB mode does not quite prevent all lock interactions, and it is possible for the recursive Giant lock to be unlocked when the ddb(4) 'sysctl' command is used. This may result in a panic on return from ddb(4) via 'c' (continue). Obviously, this is not a problem when debugging already-paniced systems. Submitted by: Travis Lane (formerly: <travis.lane AT isilon.com>) Reviewed by: vangyzen (earlier version), Don Morris <dgmorris AT earthlink.net> Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20219	2019-08-08 00:42:29 +00:00
Conrad Meyer	76cb1112da	sbuf(9): Add sbuf_nl_terminate() API The API is used to gracefully terminate text line(s) with a single \n. If the formatted buffer was empty or already ended in \n, it is unmodified. Otherwise, a newline character is appended to it. The API, like other sbuf-modifying routines, is only valid while the sbuf is not FINISHED. Reviewed by: rlibby Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21030	2019-08-07 19:27:14 +00:00
Conrad Meyer	d23813cdb9	sbuf(9): Refactor sbuf_newbuf into sbuf_new Code flow was somewhat difficult to read due to the combination of multiple return sites and the 4x possible dynamic constructions of an sbuf. (Future consideration: do we need all 4?) Refactored slightly to improve legibility. No functional change. Sponsored by: Dell EMC Isilon	2019-08-07 19:25:56 +00:00
Conrad Meyer	71db411eb6	sbuf(9): Add NOWAIT dynamic buffer extension mode The goal is to avoid some kinds of low-memory deadlock when formatting heap-allocated buffers. Reviewed by: vangyzen Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21015	2019-08-07 19:23:07 +00:00
Gleb Smirnoff	814f33aafb	Since r350426 this KASSERT doesn't serve any useful purpose.	2019-08-06 16:11:00 +00:00
Mariusz Zaborski	c878d1eb45	procdesc: fix the function name I changed name of the function r350429 and forgot to update the r350612 patch. Reported by: jenkins MFC after: 1 month	2019-08-05 20:31:17 +00:00
Mariusz Zaborski	9f5103abab	process: style We don't need to check if the parent is already set. This is done already in the proc_reparent. No functional behaviour changes intended. MFC after: 1 month	2019-08-05 20:26:01 +00:00
Mariusz Zaborski	a05cfdf479	exit1: fix style nits MFC after: 1 month	2019-08-05 20:20:14 +00:00
Mariusz Zaborski	fd631bcd95	procdesc: fix reparenting when the debugger is attached The process is reparented to the debugger while it is attached. B B / ----> \| A A D Every time when the process is reparented, it is added to the orphan list of the previous parent: A->orphan = B D->orphan = NULL When the A process will close the process descriptor to the B process, the B process will be reparented to the init process. B B - init \| ----> A D A D A->orphan = B D->orphan = B In this scenario, the B process is in the orphan list of A and D. When the last process descriptor is closed instead of reparenting it to the reaper let it stay with the debugger process and set our previews parent to the reaper. Add test case for this situation. Notice that without this patch the kernel will crash with this test case: panic: orphan 0xfffff8000e990530 of 0xfffff8000e990000 has unexpected oppid 1 Reviewed by: markj, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D20361	2019-08-05 20:15:46 +00:00
Mariusz Zaborski	799d92ab78	proc: introduce the proc_add_orphan function This API allows adding the process to its parent orphan list. Reviewed by: kib, markj MFC after: 1 month	2019-08-05 20:11:57 +00:00
Mariusz Zaborski	41fadb3fca	exit1: postpone clearing P_TRACED flag until the proctree lock is acquired In case of the process being debugged. The P_TRACED is cleared very early, which would make procdesc_close() not calling proc_clear_orphan(). That would result in the debugged process can not be able to collect status of the process with process descriptor. Reviewed by: markj, kib Tested by: pho MFC after: 1 month	2019-08-05 19:59:23 +00:00
Konstantin Belousov	a1549acbaf	Fix mis-merge. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-05 19:19:25 +00:00
Konstantin Belousov	01c3ba9752	Fix mis-merge Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-05 19:16:33 +00:00
Justin Hibbits	937a05ba81	Add necessary bits for Linux KPI to work correctly on powerpc PowerPC, and possibly other architectures, use different address ranges for PCI space vs physical address space, which is only mapped at resource activation time, when the BAR gets written. The DRM kernel modules do not activate the rman resources, soas not to waste KVA, instead only mapping parts of the PCI memory at a time. This introduces a BUS_TRANSLATE_RESOURCE() method, implemented in the Open Firmware/FDT PCI driver, to perform this necessary translation without activating the resource. In addition to system KPI changes, LinuxKPI is updated to handle a big-endian host, by adding proper endian swaps to the I/O functions. Submitted by: mmacy Reported by: hselasky Differential Revision: https://reviews.freebsd.org/D21096	2019-08-04 19:28:10 +00:00
John Baldwin	f422bc3092	Set ISOPEN in namei flags when opening executable interpreters. These vnodes are explicitly opened via VOP_OPEN via exec_check_permissions identical to the main exectuable image. Setting ISOPEN allows filesystems to perform suitable checks in VOP_LOOKUP (e.g. close-to-open consistency in the NFS client). Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D21129	2019-08-03 01:02:52 +00:00
Mark Johnston	8675f5f776	Only check the blessings table for known LORs. Previously we would check for blessings before marking a given lock pair as reversed, so each "reversed" lock acquisition would require a linear scan of the table. Instead, check the table after marking the pair as reversed but before generating a report. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21135	2019-08-02 18:01:47 +00:00
Konstantin Belousov	5cbdd18fd4	Make umtxq_check_susp() to correctly handle thread exit requests. The check for P_SINGLE_EXIT was shadowed by the (P_SHOULDSTOP \|\| traced) check. Reported by: bdrewery (might be) Reviewed by: markj Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21124	2019-08-01 14:34:27 +00:00
Konstantin Belousov	fc83c5a7d0	Make randomized stack gap between strings and pointers to argv/envs. This effectively makes the stack base on the csu _start entry randomized. The gap is enabled if ASLR is for the ABI is enabled, and then kern.elf{64,32}.aslr.stack_gap specify the max percentage of the initial stack size that can be wasted for gap. Setting it to zero disables the gap, and max is capped at 50%. Only amd64 for now. Reviewed by: cem, markj Discussed with: emaste MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21081	2019-07-31 20:23:10 +00:00
Konstantin Belousov	fd336e2ac0	Fix handling of transient casueword(9) failures in do_sem_wait(). In particular, restart should be only done when the failure is transient. For this, recheck the count1 value after the operation. Note that do_sem_wait() is older usem interface. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-07-31 19:16:49 +00:00
Kyle Evans	b5a7ac997f	kern_shm_open: push O_CLOEXEC into caller control The motivation for this change is to allow wrappers around shm to be written that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1(). Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from the context. sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX requirements, and a comment has been dropped in to kern_fd_open() to explain the situation and add a pointer to where O_CLOEXEC setting is maintained for shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally sets O_CLOEXEC to match previous behavior. This also has the side-effect of making flags correctly reflect the O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a glance-over leads me to believe that it didn't really matter. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21119	2019-07-31 15:16:51 +00:00
Mark Johnston	49c3e8c8d1	Enable witness(4) blessings. witness has long had a facility to "bless" designated lock pairs. Lock order reversals between a pair of blessed locks are not reported upon. We have a number of long-standing false positive LOR reports; start marking well-understood LORs as blessed. This change hides reports about UFS vnode locks and the UFS dirhash lock, and UFS vnode locks and buffer locks, since those are the two that I observe most often. In the long term it would be preferable to be able to limit blessings to a specific site where a lock is acquired, and/or extend witness to understand why some lock order reversals are valid (for example, if code paths with conflicting lock orders are serialized by a third lock), but in the meantime the false positives frequently confuse users and generate bug reports. Reviewed by: cem, kib, mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21039	2019-07-30 17:09:58 +00:00
Mark Johnston	ed13ff4549	Regenerate after r350447.	2019-07-30 16:01:16 +00:00
Mark Johnston	f30f7b9870	Enable copy_file_range(2) in capability mode. copy_file_range() operates on a pair of file descriptors; it requires CAP_READ for the source descriptor and CAP_WRITE for the destination descriptor. Reviewed by: kevans, oshogbo Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21113	2019-07-30 15:59:44 +00:00
Xin LI	d4565741c6	Remove gzip'ed a.out support. The current implementation of gzipped a.out support was based on a very old version of InfoZIP which ships with an ancient modified version of zlib, and was removed from the GENERIC kernel in 1999 when we moved to an ELF world. PR: 205822 Reviewed by: imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp> Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21099	2019-07-30 05:13:16 +00:00
Mark Johnston	98549e2dc6	Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986	2019-07-29 22:01:28 +00:00
Mariusz Zaborski	9db97ca0bd	proc: make clear_orphan an public API This will be useful for other patches with process descriptors. Change its name as well. Reviewed by: markj, kib	2019-07-29 21:42:57 +00:00
Alan Somers	0367bca479	sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error This is a partial merge of 350144 from projects/fuse2 PR: 236466 Reviewed by: markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21095	2019-07-29 20:50:26 +00:00

... 3 4 5 6 7 ...

17154 Commits