freebsd-skq

Author	SHA1	Message	Date
kevans	77fb93e1b7	tty_pts: don't rely on tty header pollution for sys/mutex.h tty_pts.c relies on sys/tty.h for sys/mutex.h. Include it directly instead of relying on this pollution to ease the diff for anyone that wants to try converting the tty lock to anything other than a mutex.	2019-11-29 03:56:01 +00:00
jeff	a65d31ef2d	Handle large mallocs by going directly to kmem. Taking a detour through UMA does not provide any additional value. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22563	2019-11-29 03:14:10 +00:00
jeff	6e09ead90c	Fix DEBUG_REDZONE build after r355169	2019-11-28 08:56:14 +00:00
hselasky	79dc3a05bf	Factor out check for mounted root file system. Differential Revision: https://reviews.freebsd.org/D22571 PR: 241639 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-11-28 08:47:36 +00:00
jeff	049ad3955f	Garbage collect the mostly unused us_keg field. Use appropriately named union members in vm_page.h to store the zone and slab. Remove some nearby dead code. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22564	2019-11-28 07:49:25 +00:00
kib	60d99c176d	Requested and tested by: kevans Reviewed by: kevans (previous version), markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22546	2019-11-27 20:33:53 +00:00
rlibby	b5630a819f	witness: sleepable rm locks are not sleepable in read mode There are two classes of rm lock, one "sleepable" and one not. But even a "sleepable" rm lock is only sleepable in write mode, and is non-sleepable when taken in read mode. Warn about sleepable rm locks in read mode as non-sleepable locks. Do this by defining a new lock operation flag, LOP_NOSLEEP, to indicate that a lock is non-sleepable despite what the LO_SLEEPABLE flag would indicate, and defining a new witness lock instance flag, LI_SLEEPABLE, to track the product of LO_SLEEPABLE and LOP_NOSLEEP on the lock instance. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22527	2019-11-27 01:54:39 +00:00
mjg	d21b67186e	cache: stop reusing .. entries on enter It almost never happens in practice anyway. With this eliminated ->nc_vp cannot change vnodes, removing an obstacle on the road to lockless lookup.	2019-11-27 01:21:42 +00:00
mjg	5e8cfe32e0	cache: fix numcache accounting on entry . entries are never created and .. can reuse existing entries, meaning the early count bump is both spurious and leading to overcounting in certain cases.	2019-11-27 01:20:55 +00:00
mjg	a93204e206	cache: hide "doingcache" behind DEBUG_CACHE	2019-11-27 01:20:21 +00:00
hselasky	abea55f57f	Fix panic when loading kernel modules before root file system is mounted. Make sure the rootvnode is always NULL checked. Differential Revision: https://reviews.freebsd.org/D22545 PR: 241639 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-11-26 12:20:44 +00:00
oshogbo	4ae67fb7ab	procdesc: allow to collect status through wait(1) if process is traced The debugger like truss(1) depends on the wait(2) syscall. This syscall waits for ALL children. When it is waiting for ALL child's the children created by process descriptors are not returned. This behavior was introduced because we want to implement libraries which may pdfork(1). The behavior of process descriptor brakes truss(1) because it will not be able to collect the status of processes with process descriptors. To address this problem the status is returned to parent when the child is traced. While the process is traced the debugger is the new parent. In case the original parent and debugger are the same process it means the debugger explicitly used pdfork() to create the child. In that case the debugger should be using kqueue()/pdwait() instead of wait(). Add test case to verify that. The test case was implemented by markj@. Reviewed by: kib, markj Discussed with: jhb MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D20362	2019-11-25 18:33:21 +00:00
rlibby	32e5f65de4	sysctl sysctls: wire old buf before output with sysctl lock Several sysctl sysctls output to a user buffer while holding a non-sleepable lock that protects the sysctl topology. They need to wire the output buffer, or else they may try to sleep on a page fault. Reviewed by: cem, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22528	2019-11-25 07:38:27 +00:00
kib	404183d739	Record part of the owner struct thread pointer into busy_lock. Record as much bits from curthread into busy_lock as fits. Low bits for struct thread * representation are zero due to struct and zone alignment, and they leave space for busy flags (perhaps except statically allocated thread0). Upper bits are not very interesting for assert, and in most practical situations recorded value should allow to manually identify the owner with certainity. Assert that unbusy is performed by the owner, except few places where unbusy is done in io completion handler. For this case, add _unchecked variants of asserts and unbusy primitives. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22298	2019-11-24 19:12:23 +00:00
imp	3193ed06d2	Add a warning about Giant Locked devices Add a warning when a device registers with devfs and requests D_NEEDGIANT. The warning says the device will go away before 13.0. This is needed to flush out the devices in the tree that are still Giant locked. This warning, or some variant of it, should have gone into the tree a long time ago... The intention is to require all devices be converted to not use automatic giant in this way, or remove any such devices that remain that we don't have the hardware to test a conversion of. kbd so far is the only device that can't leave the tree, yet needs something sensible done to avoid the auto giant lock (even if it is just doing the wrapping itself). There may be others added to this list... Any discussions of this topic will take place on arch@.	2019-11-23 23:57:26 +00:00
cem	35d496b56a	Add explicit SI_SUB_EPOCH Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP (EARLY_AP_STARTUP). Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH. epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND, but likely should allocate before SI_SUB_SMP. Prior to this change, consumers (well, epoch itself, and net/if.c) just open-coded the SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile. Reviewed by: mmacy Differential Revision: https://reviews.freebsd.org/D22503	2019-11-22 23:23:40 +00:00
glebius	63e627ce4f	cc_ktr_event_name is used only with KTR	2019-11-21 23:55:43 +00:00
mav	7484143fd8	Add variant of root_mount_hold() without allocation. It allows to use this KPI in non-sleepable contexts. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-11-21 21:59:35 +00:00
andrew	d5bfef0bc3	Disable KCSAN within a panic. The kernel is single threaded at this point and the panic is more important. Sponsored by: DARPA, AFRL	2019-11-21 13:59:01 +00:00
andrew	e95c204297	Add kcsan_md_unsupported from NetBSD. It's used to ignore virtual addresses that may have a different physical address depending on the CPU. Sponsored by: DARPA, AFRL	2019-11-21 13:22:23 +00:00
andrew	34537aa902	Fix the bus_space functions with KCSAN on arm64. Arm64 doesn't define the bus_space_set_multi_stream and bus_space_set_region_stream functions. Don't try to define them there. Sponsored by: DARPA, AFRL	2019-11-21 13:12:58 +00:00
andrew	6e5970c8f4	Port the NetBSD KCSAN runtime to FreeBSD. Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in the FreeBSD kernel. It is a useful tool for finding data races between threads executing on different CPUs. This can be enabled by enabling KCSAN in the kernel config, or by using the GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later needs a compiler change to allow -fsanitize=thread that KCSAN uses. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22315	2019-11-21 11:22:08 +00:00
andrew	b2251a42aa	Import the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime. KCSAN is a tool to find concurrent memory access that may race each other. After a determined number of memory accesses a cell is created, this describes the current access. It will then delay for a short period to allow other CPUs a chance to race. If another CPU performs a memory access to an overlapping region during this delay the race is reported. This is a straight import of the NetBSD code, it will be adapted to FreeBSD in a future commit. Sponsored by: DARPA, AFRL	2019-11-20 14:37:48 +00:00
mjg	5f4e2edeab	cache: minor stat cleanup Remove duplicated stats and move numcachehv from debug to vfs.cache.	2019-11-20 12:08:32 +00:00
mjg	41890de334	vfs: perform a more racy check in vfs_notify_upper Locking mp does not buy anything interms of correctness and only contributes to contention.	2019-11-20 12:07:54 +00:00
mjg	b1e239e6e2	vfs: change si_usecount management to count used vnodes Currently si_usecount is effectively a sum of usecounts from all associated vnodes. This is maintained by special-casing for VCHR every time usecount is modified. Apart from complicating the code a little bit, it has a scalability impact since it forces a read from a cacheline shared with said count. There are no consumers of the feature in the ports tree. In head there are only 2: revoke and devfs_close. Both can get away with a weaker requirement than the exact usecount, namely just the count of active vnodes. Changing the meaning to the latter means we only need to modify it on 0<->1 transitions, avoiding the check plenty of times (and entirely in something like vrefact). Reviewed by: kib, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D22202	2019-11-20 12:05:59 +00:00
jeff	be1b482c07	Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates reudundant complicated checks and additional locking required only for anonymous memory. Introduce vm_object_allocate_anon() to create these objects. DEFAULT and SWAP objects now have the correct settings for non-anonymous consumers and so individual consumers need not modify the default flags to create super-pages and avoid ONEMAPPING/NOSPLIT. Reviewed by: alc, dougm, kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22119	2019-11-19 23:19:43 +00:00
kevans	b317a3c030	sysent: regenerate after r354835 The lua-based makesyscalls produces slightly different output than its makesyscalls.sh predecessor, all whitespace differences more closely matching the source syscalls.master.	2019-11-18 23:31:12 +00:00
kevans	60027726b9	Convert in-tree sysent targets to use new makesyscalls.lua flua is bootstrapped as part of the build for those on older versions/revisions that don't yet have flua installed. Once upgraded past r354833, "make sysent" will again naturally work as expected. Reviewed by: brooks Differential Revision: https://reviews.freebsd.org/D21894	2019-11-18 23:28:23 +00:00
jhb	81f62ee15e	Check for errors from copyout() and suword*() in sv_copyout_args/strings. Reviewed by: brooks, kib Tested on: amd64 (amd64, i386, linux64), i386 (i386, linux) Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D22401	2019-11-18 20:07:43 +00:00
dab	4faee8fc9d	Jail and capability mode for shm_rename; add audit support for shm_rename Co-mingling two things here: * Addressing some feedback from Konstantin and Kyle re: jail, capability mode, and a few other things * Adding audit support as promised. The audit support change includes a partial refresh of OpenBSM from upstream, where the change to add shm_rename has already been accepted. Matthew doesn't plan to work on refreshing anything else to support audit for those new event types. Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: kib Relnotes: Yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22083	2019-11-18 13:31:16 +00:00
kib	11ac3a4ad9	kern_exec: p_osrel and p_fctl0 were obliterated by failed execve(2) attempt. Zeroing of them is needed so that an image activator can update the values as appropriate (or not set at all). Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22379	2019-11-17 14:52:45 +00:00
scottl	f7f4a6b4a2	Create a new sysctl subtree, machdep.mitigations. Its purpose is to organize knobs and indicators for code that mitigates functional and security issues in the architecture/platform. Controls for regular operational policy should still go into places security, hw, kern, etc. The machdep root node is inherently architecture dependent, but mitigations tend to be architecture dependent as well. Some cases like Spectre do cross architectural boundaries, but the mitigation code for them tends to be architecture dependent anyways, and multiple architectures won't be active in the same image of the kernel. Many mitigation knobs already exist in the system, and they will be moved with compat naming in the future. Going forward, mitigations should collect in machdep.mitigations. Reviewed by: imp, brooks, rwatson, emaste, jhb Sponsored by: Intel	2019-11-15 23:27:17 +00:00
jhb	3f50cb7491	Add a sv_copyout_auxargs() hook in sysentvec. Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv instead of doing it in the sv_fixup hook. In particular, this new hook allows the stack space to be allocated at the same time the auxv values are copied out to userland. This allows us to avoid wasting space for unused auxv entries as well as not having to recalculate where the auxv vector is by walking back up over the argv and environment vectors. Reviewed by: brooks, emaste Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64 Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D22355	2019-11-15 18:42:13 +00:00
brooks	7f81c60b0a	Tidy syscall declerations. Pointer arguments should be of the form "<type> ..." and not "<type> ...". No functional change. Reviewed by: kevans Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22373	2019-11-14 17:11:52 +00:00
markj	88b25bcb9d	Fix handling of PIPE_EOF in the direct write path. Suppose a writing thread has pinned its pages and gone to sleep with pipe_map.cnt > 0. Suppose that the thread is woken up by a signal (so error != 0) and the other end of the pipe has simultaneously been closed. In this case, to satisfy the assertion about pipe_map.cnt in pipe_destroy_write_buffer(), we must mark the buffer as empty. Reported by: syzbot+5cce271bf2cb1b1e1876@syzkaller.appspotmail.com Reviewed by: kib Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22261	2019-11-11 20:44:30 +00:00
rmacklem	f9b312bf8a	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The first of these semantic changes was changed to be compatible with linux5.n by r354564. For the second semantic change, the old linux man page stated that, if infd and outfd referred to the same file, EBADF should be returned. Now, the semantics is to allow infd and outfd to refer to the same file so long as the byte ranges defined by the input file offset, output file offset and len does not overlap. If the byte ranges do overlap, EINVAL should be returned. This patch modifies copy_file_range(2) to be linux5.n compatible for this semantic change.	2019-11-10 01:08:14 +00:00
rmacklem	fd4b12ce42	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The old linux man page stated that, if the offset + len exceeded file_size for the input file, EINVAL should be returned. Now, the semantics is to copy up to at most file_size bytes and return that number of bytes copied. If the offset is at or beyond file_size, a return of 0 bytes is done. This patch modifies copy_file_range(2) to be linux compatible for this semantic change. A separate patch will change copy_file_range(2) for the other semantic change, which allows the infd and outfd to refer to the same file, so long as the byte ranges do not overlap.	2019-11-08 23:39:17 +00:00
glebius	62dc620e39	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). Remove few outdated comments and extraneous assertions. No functional change here.	2019-11-07 00:08:34 +00:00
glebius	9012fc643e	If vm_pager_get_pages_async() returns an error synchronously we leak wired and busy pages. Add code that would carefully cleanups the state in case of synchronous error return. Cover a case when a first I/O went on asynchronously, but second or N-th returned error synchronously. In collaboration with: chs Reviewed by: jtl, kib	2019-11-06 23:45:43 +00:00
bz	6d77ad290d	m_pulldown(): Change an if () panic() into a KASSERT(). If we pass in a NULL mbuf to m_pulldown() we are in a bad situation already. There is no point in doing that check for production code. Change the if () panic() into a KASSERT. MFC after: 3 weeks Sponsored by: Netflix	2019-11-06 22:40:19 +00:00
brooks	325c38b94e	libstats: Improve ABI assertion. On platforms where pointers are larger than 64-bits, struct statsblob may be harmlessly padded out such that opaque[] always has some included space. Make the assertion more general by comparing to the offset of opaque rather than the size of struct statsblob. Discussed with: jhb, James Clarke Reviewed by: trasz, lstewart Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22188	2019-11-06 19:44:44 +00:00
mav	ed6ee7405b	Some more taskqueue optimizations. - Optimize enqueue for two task priority values by adding new tq_hint field, pointing to the last task inserted into the middle of the list. In case of more then two priority values it should halve average search. - Move tq_active insert/remove out of the taskqueue_run_locked loop. Instead of dirtying few shared cache lines per task introduce different mechanism to drain active tasks, based on task sequence number counter, that uses only cache lines already present in cache. Since the new mechanism does not need ordering, switch tq_active from TAILQ to LIST. - Move static and dynamic struct taskqueue fields into different cache lines. Move lock into its own cache line, so that heavy lock spinning by multiple waiting threads would not affect the running thread. - While there, correct some TQ_SLEEP() wait messages. This change fixes certain ZFS write workloads, causing huge congestion on taskqueue lock. Those workloads combine some large block writes to saturate the pool and trigger allocation throttling, which uses higher priority tasks to requeue the delayed I/Os, with many small blocks to generate deep queue of small tasks for taskqueue to sort. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-11-01 22:49:44 +00:00
emaste	685a165c4c	avoid kernel stack data leak in core dump thrmisc note bzero the entire thrmisc struct, not just the padding. Other core dump notes are already done this way. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reviewed by: markj MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-10-31 20:42:36 +00:00
jeff	bff69757f0	Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY flag and use the same system. This enables further fault locking improvements by allowing more faults to proceed with a shared lock. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22116	2019-10-29 21:06:34 +00:00
jeff	d122abaabb	Drop the object lock in vfs_bio and cluster where it is now safe to do so. Recent changes to busy/valid/dirty have enabled page based synchronization and the object lock is no longer required in many cases. Reviewed by: kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21597	2019-10-29 20:37:59 +00:00
glebius	8bb52bf920	Merge td_epochnest with td_no_sleeping. Epoch itself doesn't rely on the counter and it is provided merely for sleeping subsystems to check it. - In functions that sleep use THREAD_CAN_SLEEP() to assert correctness. With EPOCH_TRACE compiled print epoch info. - _sleep() was a wrong place to put the assertion for epoch, right place is sleepq_add(), as there ways to call the latter bypassing _sleep(). - Do not increase td_no_sleeping in non-preemptible epochs. The critical section would trigger all possible safeguards, no sleeping counter is extraneous. Reviewed by: kib	2019-10-29 17:28:25 +00:00
kib	b01d1a3a2f	amd64: move pcb out of kstack to struct thread. This saves 320 bytes of the precious stack space. The only negative aspect of the change I can think of is that the struct thread increased by 320 bytes obviously, and that 320 bytes are not swapped out anymore. I believe the freed stack space is much more important than that. Also, current struct thread size is 1392 bytes on amd64, so UMA will allocate two thread structures per (4KB) slab, which leaves a space for pcb without increasing zone memory use. Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D22138	2019-10-25 20:09:42 +00:00
glebius	74a423d9ac	Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no functional change. Discussed with: kib	2019-10-24 21:55:19 +00:00
jhb	7622bc9ddb	Use a counter with a random base for explicit IVs in GCM. This permits constructing the entire TLS header in ktls_frame() rather than ktls_seq(). This also matches the approach used by OpenSSL which uses an incrementing nonce as the explicit IV rather than the sequence number. Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22117	2019-10-24 18:13:26 +00:00

1 2 3 4 5 ...

16978 Commits