freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	fd94177c70	Add sysctl debug.kdb.stack_overflow to conveniently test kernel handling of the kstack overflow. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-13 11:59:49 +00:00
Mateusz Guzik	1b54ffc8d2	sx: retry hard shared unlock just like in r327905 for rwlocks	2018-01-13 09:26:24 +00:00
Mateusz Guzik	84f2a8a4b4	rwlock: try regular read unlock even in the hard path Saves on turnstile trips if the lock got more readers.	2018-01-13 00:05:31 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Jeff Roberson	7a469c8ef3	Implement NUMA policy for kmem_*(9). This maintains compatibility with reservations by giving each memory domain its own KVA space in vmem that is naturally aligned on superpage boundaries. Reviewed by: alc, markj, kib (some objections) Sponsored by: Netflix, Dell/EMC Isilon Tested by; pho Differential Revision: https://reviews.freebsd.org/D13289	2018-01-12 23:13:55 +00:00
Jeff Roberson	af80820a57	Regenerate auto-generated files	2018-01-12 23:06:35 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Mateusz Guzik	310f24d72a	mtx: use fcmpset to cover setting MTX_CONTESTED	2018-01-12 13:40:50 +00:00
Mateusz Guzik	31c2c6e95e	vfs: tidy up vdrop Skip vfs_refcount_release_if_not_last if the interlock is held and just go straight to refcount_release. While here do cosmetic rearrangement of _vhold to better show it contains equivalent behaviour.	2018-01-12 13:39:02 +00:00
Michael Tuexen	ce076a1f58	Ensure that the vnet is set when calling pru_sockaddr() and pru_peeraddr(). This is already true when called via kern_getsockname() and kern_getpeername(). This patch sets it also, when they arecalled via soo_fill_kinfo(). This is necessary, since the corresponding functions for SCTP require the vnet to be set. Without this, if a process having an wildcard bound SCTP socket is terminated and a core is written, the kernel panics. Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D13652	2018-01-11 20:26:17 +00:00
Conrad Meyer	c02fc9607a	mallocarray(9): panic if the requested allocation would overflow Additionally, move the overflow check logic out to WOULD_OVERFLOW() for consumers to have a common means of testing for overflowing allocations. WOULD_OVERFLOW() should be a secondary check -- on 64-bit platforms, just because an allocation won't overflow size_t does not mean it is a sane size to request. Callers should be imposing reasonable allocation limits far, far, below overflow. Discussed with: emaste, jhb, kp Sponsored by: Dell EMC Isilon	2018-01-10 21:49:45 +00:00
John Baldwin	86bbef4379	Don't store shadow copies of per-process AIO limits. Previously the AIO subsystem would save a snapshot of the currently configured per-process limits the first time a process used AIO. The process would continue to use the snapshotted limits ignoring any changes to the global limits during the rest of its lifetime. This change removes the snapshotted values and changes the AIO code to always check the global values which can be toggled at runtime. This means an administrator can now change the effective limits of existing processes. This is more consistent with how other limits configured via sysctl work in FreeBSD. Reviewed by: asomers, kib MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D13819	2018-01-10 21:18:46 +00:00
John Baldwin	f54c5606b3	Allow the fast-path for disk AIO requests to fail requests. - If aio_qphysio() returns a non-zero error code, fail the request rather than queueing it to the AIO kproc pool to be retried via the slow path. Currently this means that if vm_fault_quick_hold_pages() reports an error, EFAULT is returned from the fast-path rather than retrying the request in the slow path where it will still fail with EFAULT. - If aio_qphysio() wishes to use the fast path for a device that doesn't support unmapped I/O but there are already the maximum number of such requests in flight, fail with EAGAIN as we do for other AIO resource limits rather than queueing the request to the AIO kproc pool. - Move the opcode check for aio_qphysio() out of the caller and into aio_qphysio() to simplify some logic and remove two goto's while here. It also uses a whitelist (only supported for LIO_READ / LIO_WRITE) rather than a blacklist (skipped for LIO_SYNC). PR: 217261 Submitted by: jkim (an earlier version) MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-10 00:18:47 +00:00
John Baldwin	7e40918452	Simplify some logic by merging an if test with a subsequent switch. Specifically, in aio_queue_file() the code was doing this: if (opcode == LIO_SYNC) { ... } switch (opcode) { ... case LIO_SYNC: ... } This moves the body of the if statement into the LIO_SYNC case of the switch statement. MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-10 00:02:06 +00:00
John Baldwin	8091e52b42	Add a counter to track in-flight AIO requests using unmapped I/O. MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-09 23:57:29 +00:00
Mark Johnston	78f57a9cde	Generalize the gzio API. We currently use a set of subroutines in kern_gzio.c to perform compression of user and kernel core dumps. In the interest of adding support for other compression algorithms (zstd) in this role without complicating the API consumers, add a simple compressor API which can be used to select an algorithm. Also change the (non-default) GZIO kernel option to not enable compressed user cores by default. It's not clear that such a default would be desirable with support for multiple algorithms implemented, and it's inconsistent in that it isn't applied to kernel dumps. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D13632	2018-01-08 21:27:41 +00:00
Ian Lepore	ac579135b0	Use EVENTHANDLER_DIRECT_INVOKE for [un]mount events, for better performance.	2018-01-07 18:07:22 +00:00
Ian Lepore	f031a3b25f	Use EVENTHANDLER_DIRECT_INVOKE() for device events, for better performance.	2018-01-07 18:06:30 +00:00
Kristof Provost	fd91e076c1	Introduce mallocarray() in the kernel Similar to calloc() the mallocarray() function checks for integer overflows before allocating memory. It does not zero memory, unless the M_ZERO flag is set. Reviewed by: pfg, vangyzen (previous version), imp (previous version) Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D13766	2018-01-07 13:21:01 +00:00
Gleb Smirnoff	b4f55763ce	In sendfile_iodone() both pru_abort and sorele need to be executed with proper VNET context set. Reported by: sbruno MFC after: 2 weeks	2018-01-05 20:21:46 +00:00
John Baldwin	2da93c21ec	Always use atomic_fetchadd() when updating per-user accounting values. This avoids re-reading a variable after it has been updated via an atomic op. It is just a cosmetic cleanup as the read value was only used to control a diagnostic printf that should rarely occur (if ever). Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D13768	2018-01-04 22:07:58 +00:00
John Baldwin	3160862437	Report offset relative to the backing object for kinfo_vmentry structures. For the pathname reported in kinfo_vmentry structures (kve_path), the sysctl handlers walk the object chain to find the bottom-most VM object. This permits a COW mapping of a file with dirty pages to report the pathname of the originally mapped file. Do the same for the object offset (kve_offset) computing a cumulative offset during the same object walk so that the reported offset is relative to the reported pathname. Note that ptrace(PT_VM_ENTRY) already returns a cumulative offset rather than the raw offset of the VM map entry. Note also that this does not affect procstat -v output (even structured output) since that output does not include the kve_offset field. Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D13767	2018-01-04 21:59:34 +00:00
Mike Karels	d626b50b9d	make SW_WATCHDOG dynamic Enable the hardclock-based watchdog previously conditional on the SW_WATCHDOG option whenever hardware watchdogs are not found, and watchdogd attempts to enable the watchdog. The SW_WATCHDOG option still causes the sofware watchdog to be enabled even if there is a hardware watchdog. This does not change the other software-based watchdog enabled by the --softtimeout option to watchdogd. Note that the code to reprime the watchdog during kernel core dumps is no longer conditional on SW_WATCHDOG. I think this was previously a bug. Reviewed by: imp alfred bjk MFC after: 1 week Relnotes: yes Differential Revision: https://reviews.freebsd.org/D13713	2018-01-03 00:56:30 +00:00
Antoine Brodin	1b25176cbc	sysctl_kern_proc_args: do not take the fast path if p_args is NULL In this case it falls back to reading ps_strings	2018-01-01 21:25:01 +00:00
Colin Percival	d5d7606c0c	Use the TSLOG framework to record entry/exit timestamps for DELAY and _vprintf; these functions are called in many places and can contribute meaningfully to the total time spent booting.	2017-12-31 09:24:41 +00:00
Colin Percival	49a4e3b4b4	Instrument thread creations for the the benefit of the TSLOG framework. This assists in tracking time spent while the boot is being "held" waiting for something to happen.	2017-12-31 09:24:11 +00:00
Colin Percival	8b8a7c43a9	Instrument "boot holds" for the benefit of the TSLOG framework. These are places where the "main thread" of the booting kernel (either the thread which later becomes swapper or the thread which later becomes init) has to stop and wait for action to take place in another thread before continuing. There are currently three such holds: 1. The intr_config_hooks SYSINIT waits for hooks registered via the config_intrhook_establish function; this allows (typically) devices which need interrupts enabled to complete their initialization to do so before root is mounted. 2. The g_waitidle function waits for the GEOM event queue to be empty; this ensures that all of the disks which have been attached have been tasted before we attempt to mount root. 3. The vfs_mountroot_wait function (in addition to calling g_waitidle) waits for holds registered via root_mount_hold; among other things, this is used by the USB subsystem to ensure that we don't fail to mount root if it's located on a USB disk which takes a while to probe.	2017-12-31 09:23:52 +00:00
Colin Percival	a21a2da599	Teach makeobjops.awk to accept PROLOG and EPILOG blocks before METHOD and STATICMETHOD declarations; that code will be inserted into the dispatch function before and after the method call. Use this functionality and the TSLOG framework to record DEVICE_ATTACH and DEVICE_PROBE entry/exit timestamps.	2017-12-31 09:23:19 +00:00
Colin Percival	6032e08810	Use the TSLOG framework to record entry/exit timestamps for machine independent functions with important roles in the early boot process: mi_startup (with the "exit" recorded when it becomes swapper), start_init (with the "exit" recorded when the thread is about to "return" into the newly created init process), vfs_mountroot, and vfs_mountroot_wait.	2017-12-31 09:22:31 +00:00
Colin Percival	e31e71991a	Code for recording timestamps of events, especially function entries/exits. This is a very primitive system, intended for use in measuring performance during the early system boot, before more sophisticated tools like DTrace or infrastructure like kernel memory allocation and mutexes are available. Because this code records pointers to strings rather than copying strings (in order to keep the memory usage more manageable), if a kernel module is unloaded after logging an event, Bad Things can happen. Users are advised to not do that. Since cycle counts from the early kernel boot are used as an initial entropy source, publishing this information to userland could result in inadequate entropy being kept private to the kernel RNG. Users are advised to not enable this on systems with untrusted users. Discussed on: freebsd-current	2017-12-31 09:21:01 +00:00
Pedro F. Giffuni	0879ca728a	sysv_{ipc\|shm}: update the NetBSD VCS tags to match nearer our files. Both files originated in NetBSD: sysv_ipc.c CVS 1.9: Most of their changes don't apply to us as we already have similar changes. This is a better reference for future merges. sysv_shm.c CVS 1.39: Most of their changes don't apply to our code but interestingly this revision merged our changes and is a better point for reference. Move the VCS tags to the position recommended in our committers guide (section 8), No functional change.	2017-12-31 03:34:00 +00:00
Mateusz Guzik	efa9f177f5	locks: adjust loop limit check when waiting for readers The check was for the exact value, but since the counter started being incremented by the number of readers it could have jumped over.	2017-12-31 02:31:01 +00:00
Mateusz Guzik	cde25ed4cd	sx: fix up non-smp compilation after r327397	2017-12-31 01:59:56 +00:00
Mateusz Guzik	28f1a9e3ff	locks: re-check the reason to go to sleep after locking sleepq/turnstile In both rw and sx locks we always go to sleep if the lock owner is not running. We do spin for some time if the lock is read-locked. However, if we decide to go to sleep due to the lock owner being off cpu and after sleepq/turnstile gets acquired the lock is read-locked, we should fallback to the aforementioned wait.	2017-12-31 00:47:04 +00:00
Mateusz Guzik	fb10612355	sx: read the SX_NOADAPTIVE flag and Giant ownership only once These used to be read multiple times when waiting for the lock the become free, which had the potential to issue completely avoidable traffic.	2017-12-31 00:37:50 +00:00
Mateusz Guzik	15140a8ade	mtx: deduplicate indefinite wait check in spinlocks and thread lock	2017-12-31 00:34:29 +00:00
Mateusz Guzik	1f4d28c7ea	mtx: pre-read the lock value in thread_lock_flags_ Since this function is effectively slow path, if we get here the lock is most likely already taken in which case it is cheaper to not blindly attempt the atomic op. While here move hwpmc probe out of the loop to match other primitives.	2017-12-31 00:33:28 +00:00
Mateusz Guzik	80c39f6c37	rwlock: tidy up __rw_runlock_hard similarly to r325921	2017-12-31 00:31:14 +00:00
Konstantin Belousov	baaa79699a	Make kern_proc_vmmap_resident() externally accesible, and move the vmmap_skip_res_cnt control check inside it. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13595	2017-12-28 13:16:32 +00:00
Eitan Adler	caa7e52f3f	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno	2017-12-27 03:23:21 +00:00
Alexander Kabaev	151ba7933a	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385	2017-12-25 04:48:39 +00:00
Alexander Kabaev	6d41588b6b	Reverse the check to allocate the buffer if cached pointer is NULL. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D13596	2017-12-23 17:55:19 +00:00
Alexander Kabaev	4daa09f343	Remove dead store to local variable.	2017-12-23 16:49:57 +00:00
Bruce Evans	da9fba5447	Use resume_cpus() instead of restart_cpus() to resume from ACPI suspension. restart_cpus() worked well enough by accident. Before this set of fixes, resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to restart) as restart_cpus(). resume_cpus() waited for the wrong cpuset (stopped_cpus) to become empty, but since mixtures of stopped and suspended CPUs are not close to working, stopped_cpus must be empty when resuming so the wait is null -- restart_cpus just allows the other CPUs to restart and returns without waiting. Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and add further kludges to try to keep it working for the XEN case. It was only used for XEN. It waited on suspended_cpus. This works for XEN. However, for ACPI, resuming is a 2-step process. ACPI has already woken up the other CPUs and removed them from suspended_cpus. This fix records the move by putting them in a new cpuset resuming_cpus. Waiting on suspended_cpus would give the same null wait as waiting on stopped_cpus. Wait on resuming_cpus instead. Add a cpuset toresume_cpus to map the CPUs being told to resume to keep this separate from the cpuset started_cpus for mapping the CPUs being told to restart. Mixtures of stopped and suspended/resuming CPUs are still far from working. Describe new and some old cpusets in comments. Add further kludges to cpususpend_handler() to try to avoid breaking it for XEN. XEN doesn't use resumectx(), so it doesn't use the second return path for savectx(), and it goes from the suspended state directly to the restarted state, while ACPI resume goes through the resuming state. Enter the resuming state early for all cases so that resume_cpus can test for being in this state and not have to worry about the intermediate !suspended state for ACPI only. Reviewed by: kib	2017-12-21 09:17:48 +00:00
John Baldwin	b501cc5da6	Rework pathconf handling for FIFOs. On the one hand, FIFOs should respect other variables not supported by the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.). These values are fs-specific and must come from a fs-specific method. On the other hand, filesystems that support FIFOs are required to support _PC_PIPE_BUF on directory vnodes that can contain FIFOs. Given this latter requirement, once the fs-specific VOP_PATHCONF method supports _PC_PIPE_BUF for directories, it is also suitable for FIFOs permitting a single VOP_PATHCONF method to be used for both FIFOs and non-FIFOs. To that end, retire all of the FIFO-specific pathconf methods from filesystems and change FIFO-specific vnode operation switches to use the existing fs-specific VOP_PATHCONF method. For fifofs, set it's VOP_PATHCONF to VOP_PANIC since it should no longer be used. While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that only filesystems supporting FIFOs will report a value. In addition, only report a valid _PC_PIPE_BUF for directories and FIFOs. Discussed with: bde Reviewed by: kib (part of a larger patch) MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D12572	2017-12-19 22:39:05 +00:00
John Baldwin	599afe53a8	Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf(). Having all filesystems fall through to default values isn't always correct and these values can vary for different filesystem implementations. Most of these changes just use the existing default values with a few exceptions: - Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact permissions check this claims for chown(). - Use NANDFS_NAME_LEN for NAME_MAX for nandfs. - Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to indicate hard links aren't supported. Requested by: bde (though perhaps not this exact implementation) Reviewed by: kib (earlier version) MFC after: 1 month Sponsored by: Chelsio Communications	2017-12-19 19:51:36 +00:00
John Baldwin	dd688800e1	Add a custom VOP_PATHCONF method for fdescfs. The method handles NAME_MAX and LINK_MAX explicitly. For all other pathconf variables, the method passes the request down to the underlying file descriptor. This requires splitting a kern_fpathconf() syscallsubr routine out of sys_fpathconf(). Also, to avoid lock order reversals with vnode locks, the fdescfs vnode is unlocked around the call to kern_fpathconf(), but with the usecount of the vnode bumped. MFC after: 1 month Sponsored by: Chelsio Communications	2017-12-19 18:20:38 +00:00
Konstantin Belousov	6f697994fd	Use atomic_load(9) to read ppsinfo sequence numbers. In this case volatile qualifiers enusre that a compiler does not optimize the accesses out. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 10:05:45 +00:00
Pedro F. Giffuni	62cf53fdac	SPDX: some uses of the RSA-MD license.	2017-12-13 16:30:39 +00:00
Fedor Uporov	4ba058c0cf	Fix kernel build if MAC is not defined. Reported by: Ravi Pokala, Andrew Turner Approved by: pfg (mentor) MFC after: 1 week	2017-12-13 16:14:38 +00:00
Fedor Uporov	61b214f338	Move buffer size checks outside of the vnode locks. Reviewed by: kib, cem, pfg (mentor) Approved by: pfg (mentor) MFC after: 1 weeks Differential Revision: https://reviews.freebsd.org/D13405	2017-12-12 20:15:57 +00:00
Bruce Evans	fb3cc1c37d	Move instantiation of msgbufp from 9 MD files to subr_prf.c. This variable should be pure MI except possibly for reading it in MD dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD changed this in r36441 in 1998. There were many imperfections in r36441. This commit fixes only a small one, to simplify fixing the others 1 arch at a time. (r47678 added support for special/early/multiple message buffer initialization which I want in a more general form, but this was too fragile to use because hacking on the msgbufp global corrupted it, and was only used for 5 hours in -current...)	2017-12-07 07:55:38 +00:00
Mark Johnston	e1703ef5ae	Plug a name cache lock leak. Reviewed by: mjg MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-12-01 22:51:02 +00:00
Konstantin Belousov	36bce27be9	Destroy seltd st_mtx and st_wait in seltdfini(). A correct destruction is important for WITNESS(4) and LOCK_PROFILING(9). Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2017-12-01 11:18:19 +00:00
Pedro F. Giffuni	64de3fdd58	SPDX: use the Beerware identifier.	2017-11-30 20:33:45 +00:00
Hans Petter Selasky	1408b84a26	The sched_add() function is not only used when the thread is initially started, but also by the turnstiles to mark a thread as runnable for all locks, for instance sleepqueues do: setrunnable()->sched_wakeup()->sched_add() In r326218 code was added to allow booting from non-zero CPU numbers by setting the ts_cpu field inside the ULE scheduler's sched_add() function. This had an undesired side-effect that prior sched_pin() and sched_bind() calls got disregarded. This patch fixes the initialization of the ts_cpu field for the ULE scheduler to only happen once when the initial thread is constructed during system init. Forking will then later on ensure that a valid ts_cpu value gets copied to all children. Reviewed by: jhb, kib Discussed with: nwhitehorn MFC after: 1 month Differential revision: https://reviews.freebsd.org/D13298 Sponsored by: Mellanox Technologies	2017-11-29 23:28:40 +00:00
Alexey Dokuchaev	2c9ec07528	Fix several noticed style issues. Reviewed by: bde Approved by: bapt	2017-11-29 12:49:22 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Brooks Davis	5cd667e65f	Disable vim syntax highlighting. Vim's default pick doesn't understand that ';' is a comment character and the result looks horrible. Reviewed by: emaste	2017-11-28 18:23:17 +00:00
Edward Tomasz Napierala	212ff84f4a	Make kdb_reenter() silent when explicitly called from db_error(). This removes the useless backtrace on various ddb(4) user errors. Reviewed by: jhb@ Obtained from: CheriBSD MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D13212	2017-11-28 12:53:55 +00:00
Nathan Whitehorn	51de47e3f8	Remove assertion that a CPU be present before returning a PCPU for it. It is up to the caller to check for a NULL return value. The assert was meant to catch buggy code that did not check the return value. Some code, however, was smart and used the return value to see if a CPU existed, which this broke. Requested by: jhb@	2017-11-28 05:39:48 +00:00
Pedro F. Giffuni	8a36da99de	sys/kern: adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:20:12 +00:00
Mateusz Guzik	e57b2b1830	rw: fix runlock_hard when new readers show up When waiters/writer spinner flags are set no new readers can show up unless they already have a different rw rock read locked. The change in r326195 failed to take that into account - in presence of new readers it would spin until they all drain, which would be lead to trouble if e.g. they go off cpu and can get scheduled because of this thread. Reported by: pho	2017-11-26 21:10:47 +00:00
Nathan Whitehorn	efe67753cc	Remove some, but not all, assumptions that the BSP is CPU 0 and that CPUs are numbered densely from there to n_cpus. MFC after: 1 month	2017-11-25 23:41:05 +00:00
Mateusz Guzik	2c50bafef5	Add the missing lockstat check for thread lock.	2017-11-25 20:49:27 +00:00
Mateusz Guzik	5ba6facfcd	rwlock: fix up compilation of the previous change commmitted wrong version of the patch	2017-11-25 20:25:45 +00:00
Mateusz Guzik	c1e1a7ec30	rwlock: add __rw_try_{r,w}lock_int	2017-11-25 20:22:51 +00:00
Mateusz Guzik	cec1747322	sx: change sunlock to wake waiters up if it locked sleepq sleepq is only locked if the curhtread is the last reader. By the time the lock gets acquired new ones could have arrived. The previous code would unlock and loop back. This results spurious relocking of sleepq. This is a step towards xadd-based unlock routine.	2017-11-25 20:13:50 +00:00
Mateusz Guzik	93118b62f9	locks: retry turnstile/sleepq loops on failed cmpset In order to go to sleep threads set waiter flags, but that can spuriously fail e.g. when a new reader arrives. Instead of unlocking everything and looping back, re-evaluate the new state while still holding the lock necessary to go to sleep.	2017-11-25 20:10:33 +00:00
Mateusz Guzik	2e106e0427	rwlock: stop re-reading the owner when going to sleep	2017-11-25 20:08:11 +00:00
John Baldwin	ffb6607984	Decode kevent structures logged via ktrace(2) in kdump. - Add a new KTR_STRUCT_ARRAY ktrace record type which dumps an array of structures. The structure name in the record payload is preceded by a size_t containing the size of the individual structures. Use this to replace the previous code that dumped the kevent arrays dumped for kevent(). kdump is now able to decode the kevent structures rather than dumping their contents via a hexdump. One change from before is that the 'changes' and 'events' arrays are not marked with separate 'read' and 'write' annotations in kdump output. Instead, the first array is the 'changes' array, and the second array (only present if kevent doesn't fail with an error) is the 'events' array. For kevent(), empty arrays are denoted by an entry with an array containing zero entries rather than no record. - Move kevent decoding tables from truss to libsysdecode. This adds three new functions to decode members of struct kevent: sysdecode_kevent_filter, sysdecode_kevent_flags, and sysdecode_kevent_fflags. kdump uses these helper functions to pretty-print kevent fields. - Move structure definitions for freebsd11 and freebsd32 kevent structures to <sys/event.h> so that they can be shared with userland. The 32-bit structures are only exposed if _WANT_KEVENT32 is defined. The freebsd11 structures are only exposed if _WANT_FREEBSD11_KEVENT is defined. The 32-bit freebsd11 structure requires both. - Decode freebsd11 kevent structures in truss for the compat11.kevent() system call. - Log 32-bit kevent structures via ktrace for 32-bit compat kevent() system calls. - While here, constify the 'void *data' argument to ktrstruct(). Reviewed by: kib (earlier version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12470	2017-11-25 04:49:12 +00:00
Mark Johnston	dbe4541db2	Have lockstat:::sx-release fire only after the lock state has changed. MFC after: 1 week	2017-11-24 19:04:31 +00:00
Mark Johnston	26d94f99af	Add a missing lockstat:::sx-downgrade probe. We were returning without firing the probe when the lock had no shared waiters. MFC after: 1 week	2017-11-24 19:02:06 +00:00
Ed Schouten	814629dd64	Don't let cpu_set_syscall_retval() clobber exec_setregs(). Upon successful completion, the execve() system call invokes exec_setregs() to initialize the registers of the initial thread of the newly executed process. What is weird is that when execve() returns, it still goes through the normal system call return path, clobbering the registers with the system call's return value (td->td_retval). Though this doesn't seem to be problematic for x86 most of the times (as the value of eax/rax doesn't matter upon startup), this can be pretty frustrating for architectures where function argument and return registers overlap (e.g., ARM). On these systems, exec_setregs() also needs to initialize td_retval. Even worse are architectures where cpu_set_syscall_retval() sets registers to values not derived from td_retval. On these architectures, there is no way cpu_set_syscall_retval() can set registers to the way it wants them to be upon the start of execution. To get rid of this madness, let sys_execve() return EJUSTRETURN. This will cause cpu_set_syscall_retval() to leave registers intact. This makes process execution easier to understand. It also eliminates the difference between execution of the initial process and successive ones. The initial call to sys_execve() is not performed through a system call context. Reviewed by: kib, jhibbits Differential Revision: https://reviews.freebsd.org/D13180	2017-11-24 07:35:08 +00:00
Konstantin Belousov	ee50062cfb	Kill all descendants of the reaper, even if they are descendants of a subordinate reaper. Also, mark reapers when listing pids. Reported by: Michael Zuo <muh.muhten@gmail.com> PR: 223745 Reviewed by: bapt Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13183	2017-11-23 11:25:11 +00:00
Mateusz Guzik	2d96bd8812	sx: unbreak debug after r326107 An assertion was modified to use the found value, but it was not updated to handle a race where blocked threads appear after the entrance to the func. Move the assertion down to the area protected with sleepq lock where the lock is read anyway. This does not affect coverage of the assertion and is consistent with what rw locks are doing. Reported by: Shawn Webb	2017-11-23 03:40:51 +00:00
Mateusz Guzik	62b0676cde	rwlock: unbreak WITNESS builds after r326110 Reported by: Shawn Webb	2017-11-23 03:20:12 +00:00
Mateusz Guzik	70502e39d3	rwlock: don't check for curthread's read lock count in the fast path	2017-11-22 23:52:05 +00:00
Mateusz Guzik	b584eb2e90	locks: pass the found lock value to unlock slow path This avoids an explicit read later. While here whack the cheaply obtainable 'tid' argument.	2017-11-22 22:04:04 +00:00
Mateusz Guzik	013c0b493f	locks: remove the file + line argument from internal primitives when not used The pair is of use only in debug or LOCKPROF kernels, but was passed (zeroed) for many locks even in production kernels. While here whack the tid argument from wlock hard and xlock hard. There is no kbi change of any sort - "external" primitives still accept the pair.	2017-11-22 21:51:17 +00:00
Mark Johnston	755230eb9f	Clean up the SYSINIT_FLAGS definitions for rwlock(9) and rmlock(9). Avoid duplication in their macro definitions, and document them. No functional change intended. MFC after: 1 week	2017-11-21 14:59:23 +00:00
Scott Long	cab229b2a6	Update a comment in brelse() to match reality.	2017-11-20 20:53:03 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Pedro F. Giffuni	df57947f08	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
Mateusz Guzik	284194f183	locks: fix compilation issues without SMP or KDTRACE_HOOKS	2017-11-17 23:27:06 +00:00
Mateusz Guzik	18f23540d8	lockmgr: remove the ADAPTIVE_LOCKMGRS option The code was never enabled and is very heavy weight. A revamped adaptive spinning may show up at a later time. Discussed with: kib	2017-11-17 20:41:17 +00:00
Conrad Meyer	38d84d683e	vfs_lookup: Allow PATH_MAX-1 symlinks Previously, symlinks in FreeBSD were artificially limited to PATH_MAX-2. Add a short test case to verify the change. Submitted by: Gaurav Gangalwar <ggangalwar AT isilon.com> Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12589	2017-11-17 19:25:39 +00:00
Mateusz Guzik	2ccee9cc52	mtx: add missing parts of the diff in r325920 Fixes build breakage.	2017-11-17 02:59:28 +00:00
Mateusz Guzik	32aef9ff05	sched: move panic handling code out of choosethread This avoids jumps in the common case of the kernel not being panicked.	2017-11-17 02:45:38 +00:00
Mateusz Guzik	997131646f	Check for PRS_NEW without locking the proc in sysctl_kern_proc	2017-11-17 02:29:06 +00:00
Mateusz Guzik	bc24577c25	sx: perform a minor cleanup of the unlock slowpath No functional changes.	2017-11-17 02:27:04 +00:00
Mateusz Guzik	8fef6b2c67	rwlock: unlock before traversing threads to wake up While here perform a minor cleanup of the unlock path.	2017-11-17 02:26:15 +00:00
Mateusz Guzik	8448e02081	mtx: unlock before traversing threads to wake up This shortens the lock hold time while not affecting corretness. All the woken up threads end up competing can lose the race against a completely unrelated thread getting the lock anyway.	2017-11-17 02:25:04 +00:00
Mateusz Guzik	ae7d25a4d7	locks: pull up PMC_SOFT_CALLs out of slow path loops	2017-11-17 02:22:51 +00:00
Mateusz Guzik	3af300592c	rwlock: avoid branches in the slow path if lockstat is disabled	2017-11-17 02:21:24 +00:00
Mateusz Guzik	e41d616684	sx: avoid branches if in the slow path if lockstat is disabled	2017-11-17 02:21:07 +00:00
Gordon Tetlow	edb01d11f8	Properly bzero kldstat structure to prevent kernel information leak. Submitted by: kib Reported by: TJ Corley Security: CVE-2017-1088	2017-11-15 22:30:21 +00:00
Ed Maste	81d606f52e	disallow clock_settime too far in the future to avoid panic clock_ts_to_ct has a KASSERT that the converted year fits into four digits. By default (sysctl debug.allow_insane_settime is 0) the kernel disallows a time too far in the future, using a value of 9999 366-day years. However, clock_settime is epoch-relative and the assertion will fail with a tv_sec corresponding to some 8030 years. Avoid trying to be too clever, and just use a limit of 8000 365-day years past the epoch. Submitted by: Heqing Yan <scottieyan@gmail.com> Reported by: Syzkaller (https://github.com/google/syzkaller) MFC after: 1 week Sponsored by: The FreeBSD Foundation	2017-11-14 18:18:18 +00:00
Warner Losh	48f1a4921e	Add two new tunables / sysctls to controll reboot after panic: kern.poweroff_on_panic which, when enabled, instructs a system to power off on a panic instead of a reboot. kern.powercyle_on_panic which, when enabled, instructs a system to power cycle, if possible, on a panic instead of a reboot. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D13042	2017-11-14 00:29:14 +00:00
John Baldwin	7e3e36068b	Move loop to clear TDB_SUSPEND into PT_DETACH case. The PT_DETACH case above the sendsig: label already looped over all threads clearing flags in td_dbgflags. Reuse this loop to clear TDB_SUSPEND and move the logic out of the sendsig: block.	2017-11-13 21:22:33 +00:00
John Baldwin	2a2b23cae2	Pull the PT_ATTACH case out of the 'sendsig:' block. Most of the conditionals in the 'sendsig:' block are now only different for PT_ATTACH vs other continue requests. Pull the PT_ATTACH-specific logic up into the PT_ATTACH case and simplify the 'sendsig:' block. This also permits moving the unlock of proctree_lock above the sendsig: label since PT_KILL doesn't hold the lock and and the other cases all fall through to the label. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D13073	2017-11-13 21:09:08 +00:00
John Baldwin	feeaec18d4	Only clear a pending thread event if one is pending. This fixes a panic when attaching to an already-stopped process after r325028. While here, clean up a few other things in the control flow of the 'sendsig' section: - Only check for P_STOPPED_TRACE rather than either of P_STOPPED_SIG or P_STOPPED_TRACE for most ptrace requests. The signal handling code in kern_sig.c never sets just P_STOPPED_SIG for a traced process, so if P_STOPPED_SIG is stopped, P_STOPPED_TRACE should be set anyway. Remove a related debug printf. Assuming P_STOPPED_TRACE permits simplifications in the 'sendsig:' block. - Move the block to clear the pending thread state up into a new block conditional on P_STOPPED_TRACE and handle delivering pending signals to the reporting thread and clearing the reporting thread's state in this block. - Consolidate case to send a signal to the process in a single case for PT_ATTACH. The only case that could have been in the else before was a PT_ATTACH where P_STOPPED_SIG was not set, so both instances of kern_psignal() collapse down to just PT_ATTACH. Reported by: pho, mmel Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D12837	2017-11-13 19:58:58 +00:00
Xin LI	712dda7fb0	Be more careful when doing calculation with request from userland. MFC after: 2 weeks	2017-11-13 07:47:43 +00:00
Mateusz Guzik	fe7979a12c	Use passed thread pointer instead of curthread in sys_sched_yield No functional changes.	2017-11-12 02:34:33 +00:00
Mateusz Guzik	baaa6ec7ed	Avoid locking and refing in sysctl_kern_proc_args if possible. Turns out the sysctl is called a lot e.g. by pkg-static.	2017-11-11 22:39:33 +00:00
Mateusz Guzik	8b9817a443	sysctl: try to avoid malloc in name2oid name2oid is called all the time and passed names are almost always very short (< 16 characters).	2017-11-11 21:50:36 +00:00
Mateusz Guzik	537d0fb138	Use pfind_any in linux_rt_sigqueueinfo and kern_sigqueue	2017-11-11 18:10:09 +00:00
Mateusz Guzik	6e1619dae3	Add pfind_any It looks for both regular and zombie processes. This avoids allproc relocking previously seen with pfind -> zpfind calls.	2017-11-11 18:04:39 +00:00
Mateusz Guzik	272640b7fc	Avoid allproc lock in pfind if curproc->pid == pid	2017-11-11 18:03:26 +00:00
Mateusz Guzik	9b57bf75d0	Remove useless proc lookup from sysctl_out_proc	2017-11-11 18:02:23 +00:00
Mateusz Guzik	c7e4e92ecd	rwlock: use fcmpset for setting RW_LOCK_WRITE_SPINNER	2017-11-11 09:34:11 +00:00
Matt Joras	2ca45184dc	Introduce EVENTHANDLER_LIST and some users. This introduces a facility to EVENTHANDLER(9) for explicitly defining a reference to an event handler list. This is useful since previously all invokers of events had to do a locked traversal of the global list of event handler lists in order to find the appropriate event handler list. By keeping a pointer to the appropriate list an invoker can avoid this traversal completely. The pointer is initialized with SYSINIT(9) during the eventhandler stage. Users registering interest in events do not need to know if the event is backed by such a list, since the list is added to the global list of lists. As with lists that are not pre-defined it is safe to register for the events before the list has been created. This converts the process_* and thread_* events to using the new facility, as these are events whose locked traversals end up showing up significantly in ports build workflows (and presumably other workflows with many short lived threads/procs). It may be advantageous to convert other events to using the new facility. The el_flags field is now unused, but leave it be so that this revision can be MFC'd. Reviewed by: bdrewery, markj, mjg Approved by: rstone (mentor) In collaboration with: ian MFC after: 4 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12814	2017-11-09 22:51:48 +00:00
Konstantin Belousov	9acf7b136d	Zero whole struct ptrace_lwpinfo to not leak kernel stack data. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Discussed with: secteam Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D12796	2017-11-08 23:32:56 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Bartek Rutkowski	cee09850f7	Make sysctl_kern_proc_umask execute fast path when requested pid in curproc->p_pid or 0, avoiding unnecessary locking. Update libc consumer to skip calling getpid(). Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com> Reviewed by: mjg, robak Approved by: mjg Sponsored by: Mysterious Code Ltd. Differential Revision: D12972	2017-11-07 15:13:32 +00:00
Mateusz Guzik	db520fdd46	rwlock: fix up compilation without KDTRACE_HOOKS after r324787	2017-11-06 05:14:05 +00:00
Mateusz Guzik	ce80021f4e	namecache: bump numcache after dropping all locks This makes no difference correctness-wise, but shortens total hold time.	2017-11-05 22:29:45 +00:00
Mateusz Guzik	119b826a62	namecache: wlock buckets in cache_lookup_nomakeentry Since the case of an empty chain was already covered, it si very likely that the existing entry is matching. Skipping readlocking saves on lock upgrade.	2017-11-05 22:28:39 +00:00
Mateusz Guzik	ba324b5946	namecache: skip locking in cache_lookup_nomakeentry if there is no entry	2017-11-05 21:59:39 +00:00
Ed Maste	80dc9f8888	ANSIfy sys/kern/md4c.c PR: 223453 Submitted by: ota@j.email.ne.jp MFC After: 2 weeks	2017-11-05 19:49:44 +00:00
Mateusz Guzik	a52058f013	namecache: skip locking in cache_purge_negative if there are no entries	2017-11-05 08:31:25 +00:00
Pedro F. Giffuni	7aa472731e	ANSI-fy exec_shell_imgact(). Fix a stray space while here. PR: 223317 MFC after: 3 days	2017-11-04 15:41:08 +00:00
Konstantin Belousov	30c438723d	Convert explicit panic() call to assert. Based on github pull request: #113 Submitted by: pmarillo@github MFC after: 1 week	2017-11-04 10:49:34 +00:00
Mateusz Guzik	a2c36a24b6	Special-case pget lookups where pid == curproc->pid Saves on allproc_lock acquires during buildworld, poudriere etc. Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com> Sponsored by: Mysterious Code Ltd. Differential Revision: D12929	2017-11-03 19:21:36 +00:00
Mateusz Guzik	ac850e5a8d	namecache: fix .. check broken after r324378 wtf by: mjg Diagnosed by: avg	2017-11-01 08:40:04 +00:00
Mateusz Guzik	59e260f860	Fixup r325264, take #2 whack an unused variable	2017-11-01 06:46:58 +00:00
Mateusz Guzik	5644fffa25	namecache: ncnegfactor 16 -> 12 It is used on each new entry addition to decide whether to whack an existing negative entry in order to prevent a blow out in size, but the parameter was set years ago and never revisited. Building with poudriere results in about 400 evictions per second which unnecessarily grab entries from the hot list. With the new parameter there are next to no evictions of the sort.	2017-11-01 06:45:41 +00:00
Mateusz Guzik	5d03f1e11f	Fixup r325264 Accidentally committed an incomplete diff.	2017-11-01 06:38:46 +00:00
Mateusz Guzik	c0b5261b55	Save on loginclass list locking by checking if caller already uses the struct	2017-11-01 06:12:14 +00:00
Mateusz Guzik	5949c7e504	Save on uihash table locking by checking if the caller already uses the struct In particular with poudriere this saves about 90% of lookups.	2017-11-01 05:51:20 +00:00
John Baldwin	e012fe34cb	Discard the correct thread event reported for a ptrace stop. When multiple threads wish to report a tracing event to a debugger, both threads call ptracestop() and one thread will win the race to be the reporting thread (p->p_xthread). The debugger uses PT_LWPINFO with the process ID to determine which thread / LWP is reporting an event and the details of that event. This event is cleared as a side effect of the subsequent ptrace event that resumed the process (PT_CONTINUE, PT_STEP, etc.). However, ptrace() was clearing the event identified by the LWP ID passed to the resume request even if that wasn't the 'p_xthread'. This could result in clearing an event that had not yet been observed by the debugger and leaving the existing event for 'p_thread' pending so that it was reported a second time. Specifically, if the debugger stopped due to a software breakpoint in one thread, but then switched to another thread that was used to resume (e.g. if the user switched to a different thread and issued a step), the resume request (PT_STEP) cleared a pending event (if any) for the thread being stepped. However, the process immediately stopped and the first thread reported it's breakpoint event a second time. The debugger decremented the PC for "both" breakpoint events which resulted in the PC now pointing into the middle of an instruction (on x86) and a SIGILL fault when the process was resumed a second time. To fix, always clear the pending event for 'p_xthread' when resuming a process. ptrace() still honors the requested LWP ID when enabling single-stepping (PT_STEP) or setting a different PC (PT_CONTINUE). Reported by: GDB testsuite (gdb.threads/continue-pending-status.exp) Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12794	2017-10-27 03:16:19 +00:00
Alan Somers	df485bdb3c	Fix aio_suspend in 32-bit emulation An off-by-one error has been present since the system call was first present in 185878. It additionally became a memory corruption bug after change 324941. The failure is actually revealed by our existing AIO tests. However, apparently nobody's been running those in 32-bit emulation mode. Reported by: Coverity, cem CID: 1382114 MFC after: 18 days X-MFC-With: 324941 Sponsored by: Spectra Logic Corp	2017-10-26 19:45:15 +00:00
Warner Losh	7d41b6f078	Handle RB_POWERCYCLE in the MI part of the kernel Signal init with SIGWINCH in shutdown_nice for RB_POWERCYCLE. Sponsored by: Netflix	2017-10-25 15:30:44 +00:00
Mark Johnston	64a16434d8	Add support for compressed kernel dumps. When using a kernel built with the GZIO config option, dumpon -z can be used to configure gzip compression using the in-kernel copy of zlib. This is useful on systems with large amounts of RAM, which require a correspondingly large dump device. Recovery of compressed dumps is also faster since fewer bytes need to be copied from the dump device. Because we have no way of knowing the final size of a compressed dump until it is written, the kernel will always attempt to dump when compression is configured, regardless of the dump device size. If the dump is aborted because we run out of space, an error is reported on the console. savecore(8) is modified to handle compressed dumps and save them to vmcore.<index>.gz, as it does when given the -z option. A new rc.conf variable, dumpon_flags, is added. Its value is added to the boot-time dumpon(8) invocation that occurs when a dump device is configured in rc.conf. Reviewed by: cem (earlier version) Discussed with: def, rgrimes Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11723	2017-10-25 00:51:00 +00:00
Alan Somers	913b932900	Remove artificial restriction on lio_listio's operation count In r322258 I made p1003_1b.aio_listio_max a tunable. However, further investigation shows that there was never any good reason for that limit to exist in the first place. It's used in two completely different ways: * To size a UMA zone, which globally limits the number of concurrent aio_suspend calls. * To artifically limit the number of operations in a single lio_listio call. There doesn't seem to be any memory allocation associated with this limit. This change does two things: * Properly names aio_suspend's UMA zone, and sizes it based on a new constant. * Eliminates the artifical restriction on lio_listio. Instead, lio_listio calls will now be limited by the more generous max_aio_queue_per_proc. The old p1003_1b.aio_listio_max is now an alias for vfs.aio.max_aio_queue_per_proc, so sysconf(3) will still work with _SC_AIO_LISTIO_MAX. Reported by: bde Reviewed by: jhb MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D12120	2017-10-23 23:12:01 +00:00
Mateusz Guzik	5132933a08	Bump WITNESS_PENDLIST to accomodate sleepq chain bump. Reported by: ngie	2017-10-23 01:00:35 +00:00
Mateusz Guzik	9e68989764	Make the sleepq chain hash size configurable per-arch and bump on amd64. While here cache-align chains. This shortens longest found chain during poudriere -j 80 from 32 to 16. Pushing this higher up will probably require allocation on boot.	2017-10-22 20:43:50 +00:00
Mateusz Guzik	5a17c5524f	sdt: make all sdt probe sites test one variable This saves on cache misses at the expense of a slight grow of .text. Note this is a bandaid for lack of hotpatching. Discussed with: markj	2017-10-22 20:22:23 +00:00
Mateusz Guzik	614e1868d6	Change kdb_active type to u_char. Fixes warnings from gcc and keeps the small size. Perhaps nesting should be moved to another variablle. Reported by: ngie	2017-10-22 13:42:56 +00:00
Enji Cooper	f2374e0cc5	Clean up trailing whitespace in kdb_thr_ctx(..) MFC after: 1 week	2017-10-22 12:12:52 +00:00
Konstantin Belousov	456a73ef01	Remove the support for mknod(S_IFMT), which created dummy vnodes with VBAD type. FFS ffs_write() VOP catches such vnodes and panics, other VOPs do not check for the type and their behaviour is really undefined. The comment claims that this support was done for 'badsect' to flag bad sectors, we do not have such facility in kernel anyway. Reported by: Dmitry Vyukov <dvyukov@google.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-22 08:11:45 +00:00
Mateusz Guzik	be49509eea	mtx: implement thread lock fastpath MFC after: 1 week	2017-10-21 22:40:09 +00:00
Michal Meloun	904d8c492f	Add AT_HWCAP2 ELF auxiliary vector. - allocate value for new AT_HWCAP2 auxiliary vector on all platforms. - expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly same way as for AT_HWCAP. MFC after: 1 month Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D12699	2017-10-21 12:05:01 +00:00
Mark Johnston	a3e8a25a52	Avoid the nbp lookup in the final loop iteration in flushbuflist(). The end of the loop must re-lookup the next buf since the bufobj lock is dropped in the loop body. If the lookup fails, the loop is restarted. This mechanism non-obviously also terminates the loop when the end of the buf list is reached. Split up the two loops termination cases to make the code a bit less fragile. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12730	2017-10-20 14:56:13 +00:00
Mateusz Guzik	62bf13cbf9	mtx: fix up UP build after r324778 Reported by: Michael Butler	2017-10-20 14:04:01 +00:00
Mateusz Guzik	c48a94251d	Mark kdb_active as __read_frequently and switch to bool to eat less space.	2017-10-20 04:02:53 +00:00
Mateusz Guzik	2567807c32	rwlock: reduce lockstat branches in the slowpath MFC after: 1 week	2017-10-20 03:32:42 +00:00
Mateusz Guzik	cbc2d7c218	mtx: stop testing SCHEDULER_STOPPED in kabi funcs for spin mutexes There is nothing panic-breaking to do in the unlock case and the lock case will fallback to the slow path doing the check already. MFC after: 1 week	2017-10-20 00:34:25 +00:00
Mateusz Guzik	0d74fe267b	mtx: clean up locking spin mutexes 1) shorten the fast path by pushing the lockstat probe to the slow path 2) test for kernel panic only after it turns out we will have to spin, in particular test only after we know we are not recursing MFC after: 1 week	2017-10-20 00:30:35 +00:00
Mateusz Guzik	9b8de76beb	sysctl: only take mem lock if oldlen is > 4 * PAGE_SIZE The previous limit of just one page is hit by ps. The entire mechanism should be reworked, if not whacked. It seems the intent is to reduce kernel dos-ability - some handlers wire the amount of memory passed here. Handlers should probably stop wiring in the first place or in the worst case indicate they are doing so so that the check is done only if necessary. It should also probably be a counter, not a lock. MFC after: 1 week	2017-10-19 01:38:31 +00:00
Mateusz Guzik	e6b645ae89	execve: avoid one proc lock/unlock trip unless PTRACE_EXEC is set MFC after: 1 week	2017-10-19 00:46:15 +00:00
Mateusz Guzik	80a2397a38	Tidy up pmc support at execve. The proc-specific check is inherently racy, so the code can just unlock beforehand. MFC after: 1 week	2017-10-19 00:38:14 +00:00
Mateusz Guzik	cb1c79008e	sysvsem: check if semu_list has anything on it before grabbing the lock This should get a process-specific support instead. MFC after: 1 week	2017-10-19 00:31:00 +00:00
Mateusz Guzik	c69a1a50cd	Don't take Giant for SMP status and cpu topology sysctls. Not only this lock doesn't play any role here, dirtying it slows down other things a little bit as giant-held checks (e.g. DROP_GIANT) are spread all over the kernel. MFC after: 1 week	2017-10-18 22:00:44 +00:00
Mark Johnston	46fcd1af63	Move kernel dump offset tracking into MI code. All of the kernel dump implementations keep track of the current offset ("dumplo") within the dump device. However, except for textdumps, they all write the dump sequentially, so we can reduce code duplication by having the MI code keep track of the current offset. The new dump_append() API can be used to write at the current offset. This is needed to implement support for kernel dump compression in the MI kernel dump code. Also simplify dump_encrypted_write() somewhat: use dump_write() instead of duplicating its bounds checks, and get rid of the redundant offset tracking. Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11722	2017-10-18 15:38:05 +00:00
Brooks Davis	39ed7f250a	Remove mbpool(9) now that it has no consumers. mbpool existed to support NICs with memory interfaces and all remaining comsumers were removed earlier this year with NATM. Reviewed by: jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D10513	2017-10-18 00:18:03 +00:00
Mark Johnston	fa00affd18	Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL(). MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes, but the VI_DOOMED check it used was done without the vnode interlock held, so it could race with a concurrent vgone(). Submitted by: Don Morris <don.morris@isilon.com> Reviewed by: kib, mckusick MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12704	2017-10-17 19:41:45 +00:00
Andriy Voskoboinyk	6623429867	mbuf(9): unbreak m_fragment() - Fix it by replacing m_cat() with m_prev->m_next = m_new (m_cat() will try to append data - as a result, there will be no fragmentation). - Move some constants out of the loop. Was previously tested with D4077. Differential Revision: https://reviews.freebsd.org/D4090	2017-10-16 21:46:11 +00:00
Konstantin Belousov	e9445808a8	Re-evaluate thread' signal mask after ptracestop(). The stop drops process lock, which allows the signal mask to be changed and our selected signal might become blocked, i.e. should be returned to the process queue instead of delivery. Also, for the existing check of the process no longer having an attached debugger, we should not loose the signal, but requeue it. Reported and tested by: bdrewery Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-16 20:21:51 +00:00
Konstantin Belousov	cd735d8f5a	Improve assertion that an ignored or blocked signal is not delivered. Split two conditions into separate asserts. Print additional details, like the signal number and action value. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-16 20:15:19 +00:00
Konstantin Belousov	0167b33b81	Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-16 20:11:29 +00:00
Matt Joras	0d8e04054e	Properly reset the fields in clean_unrhdr. In r324542 I neglected to reset the first and last fields of struct unrhdr. This causes a tmpfs to fail the unr(9) consistency checks with DIAGNOSTIC on. Fix this by resetting the fields by calling init_unrhdr. While here, change a loop to use TAILQ_FOREACH_SAFE since it is more readable and equally fast. Reported by: David Wolfskill <david@catwhisker.org> Approved by: rstone (mentor) Sponsored by: Dell EMC Isilon	2017-10-16 16:14:50 +00:00
Tijl Coosemans	11ce4d9f39	When a Linux program tries to access a /path the kernel tries /compat/linux/path before /path. Stop following symbolic links when looking up /compat/linux/path so dead symbolic links aren't ignored. This allows syscalls like readlink(2) and lstat(2) to work on such links. And open(2) will return an error now instead of trying /path.	2017-10-15 18:53:21 +00:00
Mateusz Guzik	e280ce465b	mtx: fix up owner_mtx after r324609 Now that MTX_UNOWNED is 0 the test was alwayas false.	2017-10-14 00:47:30 +00:00
Alan Cox	41bf90bb78	Address two problems with sendfile(..., SF_NOCACHE) and apply one "optimization". First, sendfile(..., SF_NOCACHE) frees pages without checking whether those pages are mapped. This can leave the system with mappings to free or repurposed pages. Second, a page can be busied between the time of the current busy test and acquiring the object lock. Essentially, the test performed before the object lock is acquired can only be regarded as an optimization to short-circuit further work on the page. It cannot, however, be relied upon to prove that it is safe to free the page. Third, when sendfile(..., SF_NOCACHE) was originally implemented, vm_page_deactivate_noreuse() did not yet exist. Use vm_page_deactivate_noreuse() instead of vm_page_deactivate(), because it comes closer to freeing the page. In collaboration with: glebius Discussed with: gallatin, kib, markj X-MFC after: r324448	2017-10-13 16:31:50 +00:00
Andriy Gapon	f92e3400bc	remove process and jail directory machinations from dounmount The manipulations done by mountcheckdirs() are not that useful during the unmount, they can bring about unexpected security consequences. Thic change effectively reverts the change in r73241. The change also allows to simplify the handling of rootvnode global variable. Discussed with: mckusick, mjg, kib Reviewed by: trasz MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12366	2017-10-13 09:42:05 +00:00
Ed Maste	05e47051a2	regen init_sysent.c r324560	2017-10-12 15:48:37 +00:00
Ed Maste	5532aa9bb4	allow posix_fallocate in capability mode posix_fallocate is logically equivalent to writing zero blocks to the desired file size and there is no reason to prevent calling it in capability mode. posix_fallocate already checked for the CAP_WRITE right, so we merely need to list it in capabilities.conf. Reviewed by: allanjude MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D12640	2017-10-12 15:45:53 +00:00
Matt Joras	333dcaa498	Add clearing function for unr(9). Previously before you could call unrhdr_delete you needed to individually free every allocated unit. It is useful to be able to tear down the unr without having to go through this process, as it is significantly faster than freeing the individual units. Reviewed by: cem, lidl Approved by: rstone (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12591	2017-10-11 21:53:50 +00:00
Konstantin Belousov	70e3b262d1	The th_bintime, th_microtime and th_nanotime members of the timehand all cache the last system time (uptime + boottime). Only the format differs. Do not re-calculate the bintime and simply use the value used to calculate the microtime and nanotime. Group all the updates under the relevant comment. Remove obsoleted XXX part. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2017-10-11 11:03:11 +00:00
Sean Bruno	1f9916ed08	match sendfile() error handling to send(). Sendfile() should match the error checking order of send() which is currently: SBS_CANTSENDMORE so_error SS_ISCONNECTED Submitted by: Jason Eggleston <jason@eggnet.com> Reviewed by: glebius MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12633	2017-10-10 22:21:05 +00:00
Sean Bruno	009ad5724d	Revert r324405 at the request of the submitter pending better solution. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks	2017-10-10 00:32:21 +00:00
Gleb Smirnoff	9c82bec42d	Improvements to sendfile(2) mbuf free routine. o Fall back to default m_ext free mech, using function pointer in m_ext_free, and remove sf_ext_free() called directly from mbuf code. Testing on modern CPUs showed no regression. o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses SF_SYNC flag. Lack of the flag allows us not to dereference ext_arg2, saving from a cache line miss. o Create function sendfile_free_page() that later will be used, for multi-page mbufs. For now compiler will inline it into sendfile_free_mext(). In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 21:06:16 +00:00
Gleb Smirnoff	07e87a1d55	In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively, in mb_free_ext() always use fields from the original refcount holding mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext(). However, treat EXT_EXTREF mbufs differently, since they are different - they don't have a refcount holding mbuf. Provide longer comments in m_ext declaration to explain this change and change from r296242. In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:51:58 +00:00
Gleb Smirnoff	e8fd18f306	Shorten list of arguments to mbuf external storage freeing function. All of these arguments are stored in m_ext, so there is no reason to pass them in the argument list. Not all functions need the second argument, some don't even need the first one. The second argument lives in next cache line, so not dereferencing it is a performance gain. This was discovered in sendfile(2), which will be covered by next commits. The second goal of this commit is to bring even more flexibility to m_ext mbufs, allowing to create more fields in m_ext, opaque to the generic mbuf code, and potentially set and dereferenced by subsystems. Reviewed by: gallatin, kbowling Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:35:31 +00:00
Hans Petter Selasky	32b413d7f0	When showing the sleepqueues from the in-kernel debugger, properly dump all the sendqueues and not just the first one History: It appears that in the commit which introduced the code, r165272, the array indexes of "sq_blocked[0]" and "td_name[i]" were interchanged. In r180927 "td_name[i]" was corrected to "td_name[0]", but "sq_blocked[0]" was left unchanged. PR: 222624 Discussed with: kmacy @ MFC after: 1 week Sponsored by: Mellanox Technologies	2017-10-09 18:33:29 +00:00
Alan Cox	03ca213761	The recent change to initialization of blists (r324420) relied on '-1' appearing only where the code explicitly set it, but since much of the data was not initialized, '-1' appeared other places too, and led to panics. Clear the allocated data before initializing nonzero values by allocating with M_ZERO. Submitted by: Doug Moore <dougm@rice.edu> Reported by: Oleg V. Nauman <oleg@theweb.org.ua>, cy Tested by: Oleg V. Nauman <oleg@theweb.org.ua> MFC after: 1 week X-MFC with: r324420 Differential Revision: https://reviews.freebsd.org/D12627	2017-10-09 18:19:06 +00:00
Alan Cox	8eefcd407b	The blst_radix_init function has two purposes - to compute the number of nodes to allocate for the blist, and to initialize them. The computation can be done much more quickly by identifying the terminating node, if any, at every level of the tree and then summing the number of nodes at each level that precedes the topmost terminator. The initialization can also be done quickly, since settings at the root mark the tree as all-allocated, and only a few terminator nodes need to be marked in the rest of the tree. Eliminate blst_radix_init, and perform its two functions more simply in blist_create. The allocation of the blist takes places in two pieces, but there's no good reason to do so, when a single allocation is sufficient, and simpler. Allocate the blist struct, and the array of nodes associated with it, with a single allocation. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11968	2017-10-08 22:17:39 +00:00
Ian Lepore	7f92689427	Add eventhandler notifications for newbus device attach/detach. The detach case is slightly complicated by the fact that some in-kernel consumers may want to know before a device detaches (so they can release related resources, stop using the device, etc), but the detach can fail. So there are pre- and post-detach notifications for those consumers who need to handle all cases. A couple salient comments from the review, they amount to some helpful documentation about these events, but there's currently no good place for such documentation... Note that in the current newbus locking model, DETACH_BEGIN and DETACH_COMPLETE/FAILED sequence of event handler invocation might interweave with other attach/detach events arbitrarily. The handlers should be prepared for such situations. Also should note that detach may be called after the parent bus knows the hardware has left the building. In-kernel consumers have to be prepared to cope with this race. Differential Revision: https://reviews.freebsd.org/D12557	2017-10-08 17:33:49 +00:00
Ian Lepore	fc09164658	Restore the ability to deregister an eventhandler from within the callback. When the EVENTHANDLER(9) subsystem was created, it was a documented feature that an eventhandler callback function could safely deregister itself. In r200652 that feature was inadvertantly broken by adding drain-wait logic to eventhandler_deregister(), so that it would be safe to unload a module upon return from deregistering its event handlers. There are now 145 callers of EVENTHANDLER_DEREGISTER(), and it's likely many of them are depending on the drain-wait logic that has been in place for 8 years. So instead of creating a separate eventhandler_drain() and adding it to some or all of those 145 call sites, this creates a separate eventhandler_drain_nowait() function for the specific purpose of deregistering a callback from within the running callback. Differential Revision: https://reviews.freebsd.org/D12561	2017-10-08 17:21:16 +00:00
Sean Bruno	75c8dfb6ae	Check so_error early in sendfile() call. Prior to this patch, if a connection was reset by the remote end, sendfile() would just report ENOTCONN instead of ECONNRESET. Submitted by: Jason Eggleston <jason@eggnet.com> Reviewed by: glebius Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12575	2017-10-07 23:30:57 +00:00
Mateusz Guzik	709939a7b7	namecache: factor out ~MAKEENTRY lookups from the common path Lookups of the sort are rare compared to regular ones and succesfull ones result in removing entries from the cache. In the current code buckets are rlocked and a trylock dance is performed, which can fail and cause a restart. Fixing it will require a little bit of surgery and in order to keep the code maintaineable the 2 cases have to split. MFC after: 1 week	2017-10-06 23:05:55 +00:00
Mark Johnston	f38c0c46c5	Let stack_create(9) take a malloc flags argument. Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12614	2017-10-06 21:52:28 +00:00
Emmanuel Vadot	b1d51685ea	vfs_export_lookup: Fix r324054 When using the default address list nam is still valid, the code in r324054 assumed that is was NULL. Reported by: Guy Yur <guyyur@gmail.com> Tested by: Guy Yur <guyyur@gmail.com>	2017-10-06 09:02:36 +00:00
Mateusz Guzik	d07e22cdd8	locks: take the number of readers into account when waiting Previous code would always spin once before checking the lock. But a lock with e.g. 6 readers is not going to become free in the duration of once spin even if they start draining immediately. Conservatively perform one for each reader. Note that the total number of allowed spins is still extremely small and is subject to change later. MFC after: 1 week	2017-10-05 19:18:02 +00:00
Stephen Hurd	1c0054d261	Fix "taskqgroup_attach: setaffinity failed: 3" with iflib drivers Improved logging added in r323879 exposed an error during attach. We need the irq, not the rid to work correctly. em uses shared irqs, so it will use the same irq for TX as RX. bnxt does not use shared irqs, or TX irqs at all, so there's no need to set the TX irq affinity. Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12496	2017-10-05 14:43:30 +00:00
Mateusz Guzik	20a15d1752	locks: partially tidy up waiting on readers spin first instant of instantly re-readoing and don't re-read after spinning is finished - the state is already known. Note the code is subject to significant changes later. MFC after: 1 week	2017-10-05 13:01:18 +00:00
Andriy Gapon	693593b6f0	sysctl-s in a module should be accessible only when the module is initialized A sysctl can have a custom handler that may access data that is initialized via SYSINIT(9) or via a module event handler (also invoked via SYSINIT). Thus, it is not safe to allow access to the module's sysctl-s until the initialization is performed. Likewise, we should not allow access to teh sysctl-s after the module is uninitialized. The latter is easy to achieve by properly ordering linker_file_unregister_sysctls and linker_file_sysuninit. The former is not as easy for two reasons: - the initialization may depend on tunables which get set when sysctl-s are registered, so we need to set the tunables before running sysinit-s - the initialization may try to dynamically add more sysctl-s under statically defined sysctl nodes So, this change splits the sysctl setup into two phases. In the first phase the sysctl-s are registered as before but they are disabled and hidden from consumers. In the second phase, done after sysinit-s, normal access to the sysctl-s is enabled. The change should affect only dynamic module loading and unloading after the system boot-up. Nothing changes for sysctl-s compiled into the kernel and sysctl-s in preloaded modules. Discussed with: hselasky, ian, jhb Reviewed by: julian, kib MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D12545	2017-10-05 12:32:14 +00:00
Gleb Smirnoff	0e229f343f	Hide struct socket and struct unpcb from the userland. Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and are not guaranteed for stability of the structures. The violators list is the the usual one: libprocstat(3) and netstat(1) internally and lsof in ports. In struct xunpcb remove the inclusion of kernel structure and add a bunch of spare fields. The xsocket already has socket not included, but add there spares as well. Embed xsockbuf into xsocket. Sort declarations in sys/socketvar.h to separate kernel only from userland available ones. PR: 221820 (exp-run)	2017-10-02 23:29:56 +00:00
Alan Cox	0c0e1e96c6	Use vm_page_active() rather than directly accessing the page's queue field. Reviewed by: kib, markj MFC after: 2 weeks X-MFC with: r324146	2017-10-02 07:30:21 +00:00
Andriy Gapon	550374efe6	revert r324166, it has an unrelated change in it	2017-10-01 16:37:54 +00:00
Andriy Gapon	7d5c6491f0	MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best() illumos/illumos-gate@7d3000f774 `7d3000f774` https://www.illumos.org/issues/8521 Yuri reported this to the mailing list: doing a `reboot -d` on current illumos-gate HEAD gives the following ":: findleaks -dv" output: findleaks: maximum buffers => 301061 findleaks: actual buffers => 297587 findleaks: findleaks: potential pointers => 29289774 findleaks: dismissals => 26242305 (89.5%) findleaks: misses => 331153 ( 1.1%) findleaks: dups => 2419681 ( 8.2%) findleaks: follows => 296635 ( 1.0%) findleaks: findleaks: peak memory usage => 7353 kB findleaks: elapsed CPU time => 1.5 seconds findleaks: elapsed wall time => 2.0 seconds findleaks: CACHE LEAKED BUFCTL CALLER ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-01 16:34:16 +00:00
Mark Johnston	0ffc7ed7e3	Have uiomove_object_page() keep accessed pages in the active queue. Previously, uiomove_object_page() would maintain LRU by requeuing the accessed page. This involves acquiring one of the heavily contended page queue locks. Moreover, it is unnecessarily expensive for pages in the active queue. As of r254304 the page daemon continually performs a slow scan of the active queue, with the effect that unreferenced pages are gradually moved to the inactive queue, from which they can be reclaimed. Prior to that revision, the active queue was scanned only during shortages of free and inactive pages, meaning that unreferenced pages could get "stuck" in the queue. Thus, tmpfs was required to use the inactive queue and requeue pages in order to maintain LRU. Now that this is no longer the case, tmpfs I/O operations can use the active queue and avoid the page queue locks in most cases, instead setting PGA_REFERENCED on referenced pages to provide pseudo-LRU. Reviewed by: alc (previous version) MFC after: 2 weeks	2017-09-30 23:41:28 +00:00
Konstantin Belousov	d3c968bf84	Revert r323722. A better fix will be committed shortly, as well as some still useful bits of the reverted revision. The problem with the committed fix is that there are still issues with returning from NMI, when NMI interrupted kernel in a moment where the kernel segments selectors were still not loaded into registers. If this happens, the NMI return would loose the userspace selectors because r323722 does not reload segment registers on return to kernel mode. Fixing the problem is complicated. Since an alternative approach to handle the original bug exists, it makes sence to stop adding more complexity. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-28 08:38:24 +00:00
John Baldwin	c2dc6d5db1	Use UMA_ALIGNOF() for name cache UMA zones. This fixes kernel crashes due to misaligned accesses to the 64-bit time_t embedded in struct namecache_ts in MIPS n32 kernels. MFC after: 1 week Sponsored by: DARPA / AFRL	2017-09-27 23:18:57 +00:00
Emmanuel Vadot	5e254379a8	vfs_export: Simplify vfs_export_lookup If the filesystem is not exported directly return NULL. If no address is given and filesystem is exported using some default one return it directly, if it doesn't have a default one directly return NULL. Reviewed by: kib, bapt MFC after: 1 week Sponsored by: Gandi.net Differential Revision: https://reviews.freebsd.org/D12505	2017-09-27 09:39:16 +00:00
Mateusz Guzik	a79d52d739	sysctl: remove target buffer read/write checks prior to calling the handler Said checks were inherently racy anyway as jokers could unmap target areas before the handler got around to accessing them. This saves time by avoiding locking the address space. MFC after: 1 week	2017-09-27 01:31:52 +00:00
Mateusz Guzik	956713cb74	Annotate sysctlmemlock with __exclusive_cache_line. MFC after: 1 week	2017-09-27 01:27:43 +00:00
Mateusz Guzik	2f1ddb89fc	mtx: drop the tid argument from _mtx_lock_sleep tid must be equal to curthread and the target routine was already reading it anyway, which is not a problem. Not passing it as a parameter allows for a little bit shorter code in callers. MFC after: 1 week	2017-09-27 00:57:05 +00:00
John Baldwin	09f3bb8756	Log signal number passed to PT_STEP requests in KTR_PTRACE traces. MFC after: 1 week	2017-09-25 20:38:55 +00:00
Conrad Meyer	f41b85a63c	ddb(4): Add 'show badstacks' command to show witness badstacks Add a DDB command that mirrors sysctl debug.witness.badstacks. Reapply r323935 after fixing trivial deficiency. I forgot to compile with WITNESS enabled. Thanks emaste@ for fixing the build while I was asleep. Reported by: rstone Reviewed by: rstone (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12468	2017-09-23 17:48:49 +00:00
Ed Maste	4c087f8a83	Revert r323935 as it broke the build subr_witness.c:2577:4: error: use of undeclared identifier 'req' req->oldidx = 0; ^	2017-09-23 12:35:46 +00:00
Stephen Hurd	d57a78580e	Make struct grouptask gt_name member a char array Previously, it was just a pointer which was copied, but some callers pass in a stack variable which will go out of scope. Add GROUPTASK_NAMELEN macro (32) and snprintf() the name into it, using "grouptask" if name is NULL. We can now safely include gtask->gt_name in console messages. Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12449	2017-09-23 01:39:16 +00:00
Conrad Meyer	6fec2d2cce	ddb(4): Add 'show badstacks' command to show witness badstacks Add a DDB command that mirrors sysctl debug.witness.badstacks. Reported by: rstone Reviewed by: rstone Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12468	2017-09-22 20:01:12 +00:00
Kirk McKusick	75e3597abb	Continuing efforts to provide hardening of FFS, this change adds a check hash to cylinder groups. If a check hash fails when a cylinder group is read, no further allocations are attempted in that cylinder group until it has been fixed by fsck. This avoids a class of filesystem panics related to corrupted cylinder group maps. The hash is done using crc32c. Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily used in embedded systems with small memories and low-powered processors which need as light-weight a filesystem as possible. Specifics of the changes: sys/sys/buf.h: Add BX_FSPRIV to reserve a set of eight b_xflags that may be used by individual filesystems for their own purpose. Their specific definitions are found in the header files for each filesystem that uses them. Also add fields to struct buf as noted below. sys/kern/vfs_bio.c: It is only necessary to compute a check hash for a cylinder group when it is actually read from disk. When calling bread, you do not know whether the buffer was found in the cache or read. So a new flag (GB_CKHASH) and a pointer to a function to perform the hash has been added to breadn_flags to say that the function should be called to calculate a hash if the data has been read. The check hash is placed in b_ckhash and the B_CKHASH flag is set to indicate that a read was done and a check hash calculated. Though a rather elaborate mechanism, it should also work for check hashing other metadata in the future. A kernel internal API change was to change breada into a static fucntion and add flags and a function pointer to a check-hash function. sys/ufs/ffs/fs.h: Add flags for types of check hashes; stored in a new word in the superblock. Define corresponding BX_ flags for the different types of check hashes. Add a check hash word in the cylinder group. sys/ufs/ffs/ffs_alloc.c: In ffs_getcg do the dance with breadn_flags to get a check hash and if one is provided, check it. sys/ufs/ffs/ffs_vfsops.c: Copy across the BX_FFSTYPES flags in background writes. Update the check hash when writing out buffers that need them. sys/ufs/ffs/ffs_snapshot.c: Recompute check hash when updating snapshot cylinder groups. sys/libkern/crc32.c: lib/libufs/Makefile: lib/libufs/libufs.h: lib/libufs/cgroup.c: Include libkern/crc32.c in libufs and use it to compute check hashes when updating cylinder groups. Four utilities are affected: sbin/newfs/mkfs.c: Add the check hashes when building the cylinder groups. sbin/fsck_ffs/fsck.h: sbin/fsck_ffs/fsutil.c: Verify and update check hashes when checking and writing cylinder groups. sbin/fsck_ffs/pass5.c: Offer to add check hashes to existing filesystems. Precompute check hashes when rebuilding cylinder group (although this will be done when it is written in fsutil.c it is necessary to do it early before comparing with the old cylinder group) sbin/dumpfs/dumpfs.c Print out the new check hash flag(s) sbin/fsdb/Makefile: Needs to add libufs now used by pass5.c imported from fsck_ffs. Reviewed by: kib Tested by: Peter Holm (pho)	2017-09-22 12:45:15 +00:00
Stephen Hurd	bf227542f3	Fix undeclared identifier error introduced in r323879 It doesn't appear to be safe to use gtask->gt_name. Reported by: Mark Johnston, Jenkins Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12448	2017-09-21 23:27:35 +00:00
John Baldwin	e1d15b892a	Only handle _PC_MAX_CANON, _PC_MAX_INPUT, and _PC_VDISABLE for TTY devices. Move handling of these three pathconf() variables out of vop_stdpathconf() and into devfs_pathconf() as TTY devices can only be devfs files. In addition, only return settings for these three variables for devfs devices whose device switch has the D_TTY flag set. Discussed with: bde, kib Sponsored by: Chelsio Communications	2017-09-21 23:05:32 +00:00
Stephen Hurd	326aacb0e3	Improved logging of gtaskqueue failues Check the return code of intr_setaffinity() and log any errors it returns. When a qid is not located, log an error before returning failure. Also, use __func__ rather than hardcoding the function name Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12436	2017-09-21 21:14:48 +00:00
Stephen Hurd	a0fcc37122	Fix M_GTASKQUEUE definition Previously had the same short and long description as taskqueues. This could cause problems with memguard(9) and vmstat -m which use the short description as a unique identifier. Reviewed by: sbruno Approved by: sbruno (mentor) MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12438	2017-09-21 20:34:33 +00:00
Konstantin Belousov	9770475ce7	Do not vrele() covered vnode under the mp mutex. If vrele() changes the hold count to zero, it needs to acquire the vnode lock. Sponsored by: The FreeBSD Foundation Discussed with: avg X-MFC with: r323578	2017-09-19 16:49:45 +00:00
Konstantin Belousov	5bf949377e	For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411	2017-09-19 16:46:37 +00:00
Konstantin Belousov	5efe338f3d	Fix handling of the segment registers on i386. Suppose that userspace is executing with the non-standard segment descriptors. Then, until exception or interrupt handler executed SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs. If an interrupt occurs in this window, the interrupt handler is executed unsafely, relying on usability of the usermode registers. If the interrupt results in the context switch on return, the contamination of the kernel state spreads to the thread we switched to. As result, kernel data accesses might fault or, if only the base is changed, completely messed up. More, if the user segment was allocated in LDT, another thread might mark the descriptor as invalid before doreti code tried to reload them. In this case kernel panics. The issue exists for all exception entry points which use trap gate, and thus do not automatically disable interrupts on entry, and for lcall_handler. Fix is two-fold: first, we need to disable interrupts for all kernel entries, changing the IDT descriptor types from trap gate to interrupt gate. Interrupts are re-enabled not earlier than the kernel segments are loaded into the segment registers. Second, we only load the segment registers from the trap frame when returning to usermode. For the later, all interrupt return paths must happen through the doreti common code. There is no way to disable interrupts on call gate, which is the supposed mode of servicing for lcall $7,$0 syscalls. Change the LDT descriptor 0 into a code segment type and point it to the userspace trampoline which redirects the syscall to int $0x80. All the measures make the segment register handling similar to that of amd64. We do not apply amd64 optimizations of not reloading segment registers on return from the syscall. Reported by: Maxime Villard <max@m00nbsd.net> Tested by: pho (the non-lcall part) Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12402	2017-09-18 20:22:42 +00:00
Alan Cox	ec371b57e8	Modify blst_leaf_alloc to take only the cursor argument. Modify blst_leaf_alloc to find allocations that cross the boundary between one leaf node and the next when those two leaves descend from the same meta node. Update the hint field for leaves so that it represents a bound on how large an allocation can begin in that leaf, where it currently represents a bound on how large an allocation can be found within the boundaries of the leaf. The first phase of blst_leaf_alloc currently shrinks sequences of consecutive 1-bits in mask until each has been shrunken by count-1 bits, so that any bits remaining show where an allocation can begin, or until all the bits have disappeared, in which case the allocation fails. This change amends that so that the high-order bit is copied, as if, when the last block was free, it was followed by an endless stream of free blocks. It also amends the early stopping condition, so that the shrinking of 1-sequences stops early when there are none, or there is only one unbounded one remaining. The search for the first set bit is unchanged, and the code path thereafter is mostly unchanged unless the first set bit is in a position that makes some of those copied sign bits matter. In that case, we look for a next leaf, and at what blocks it can provide, to see if a cross-boundary allocation is possible. The hint is updated on a successful allocation that clears the last bit, but it not updated on a failed allocation that leaves the last bit set. So, as long as the last block is free, the hint value for the leaf is large. As long as the last block is free, and there's a next leaf, a large allocation can begin here, perhaps. A stricter rule than this would mean that allocations and frees in one leaf could require hint updates to the preceding leaf, and this change seeks to leave the freeing code unmodified. Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and blist_leaf_fill, as well as in blst_leaf_alloc. Correct a panic message in blst_leaf_free. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11819	2017-09-16 18:12:15 +00:00
Stephen Hurd	ab2e3f7958	Revert r323516 (iflib rollup) This was really too big of a commit even if everything worked, but there are multiple new issues introduced in the one huge commit, so it's not worth keeping this until it's fixed. I'll work on splitting this up into logical chunks and introduce them one at a time over the next week or two. Approved by: sbruno (mentor) Sponsored by: Limelight Networks	2017-09-16 02:41:38 +00:00
Gleb Smirnoff	584ab65a75	Fix locking in soisconnected(). When a newborn socket moves from incomplete queue to complete one, we need to obtain the listening socket lock after the child, which is a wrong order. The old code did that in potentially endless loop of mtx_trylock(). The new one does only one attempt of mtx_trylock(), and in case of failure references listening socket, unlocks child and locks everything in right order. In case if listening socket shuts down during that, just bail out. Reported & tested by: Jason Eggleston <jeggleston llnw.com> Reported & tested by: Jason Wolfe <jason llnw.com>	2017-09-14 18:05:54 +00:00
John Baldwin	c2f37b9245	Add AT_HWCAP and AT_EHDRFLAGS on all platforms. A new 'u_long sv_hwcap' field is added to 'struct sysentvec'. A process ABI can set this field to point to a value holding a mask of architecture-specific CPU feature flags. If an ABI does not wish to supply AT_HWCAP to processes the field can be left as NULL. The support code for AT_EHDRFLAGS was already present on all systems, just the #define was not present. This is a step towards unifying the AT_ constants across platforms. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12290	2017-09-14 14:26:55 +00:00
Andriy Gapon	cbc785c293	dounmount: do not release the mount point's reference on the covered vnode As long as mnt_ref is not zero there can be a consumer that might try to access mnt_vnodecovered. For this reason the covered vnode must not be freed until mnt_ref goes to zero. So, move the release of the covered vnode to vfs_mount_destroy. Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D12329	2017-09-14 08:47:06 +00:00
Gleb Smirnoff	d37aa3ccce	Use soref() in sendfile(2) instead fhold() to reference a socket. The problem is that fdrop() requires syscall context, as it may enter sleep in some cases. The reason to use it in the original non-blocking sendfile implementation, was to avoid use of global ACCEPT_LOCK() on every I/O completion. Now in head sorele() no longer requires this lock.	2017-09-13 22:11:05 +00:00
Gleb Smirnoff	100db364eb	Fix two issues with not ready data in sockets (read: sendfile) in UNIX sockets. o Check that socket is still connected in uipc_ready(). If not we are responsible to free mbufs. o In uipc_send() if socket appears to be disconnected, but we are sending data with pending I/Os, don't free mbufs. Reported by: Kevin Bowling <kbowling llnw.com> Tested by: Kevin Bowling <kbowling llnw.com> PR: 222259 Reported by: Mark Martinec <Mark.Martinec ijs.si> MFC after: 3 days	2017-09-13 16:47:23 +00:00
Stephen Hurd	d300df0182	Roll up iflib commits from github. This pulls in most of the work done by Matt Macy as well as other changes which he has accepted via pull request to his github repo at https://github.com/mattmacy/networking/ This should bring -CURRENT and the github repo into close enough sync to allow small feature branches rather than a large chain of interdependant patches being developed out of tree. The reset of the synchronization should be able to be completed on github by splitting the remaining changes that are not yet ready into short feature branches for later review as smaller commits. Here is a summary of changes included in this patch: 1) More checks when INVARIANTS are enabled for eariler problem detection 2) Group Task Queue cleanups - Fix use of duplicate shortdesc for gtaskqueue malloc type. Some interfaces such as memguard(9) use the short description to identify malloc types, so duplicates should be avoided. 3) Allow gtaskqueues to use ithreads in addition to taskqueues - In some cases, this can improve performance 4) Better logging when taskqgroup_attach*() fails to set interrupt affinity. 5) Do not start gtaskqueues until they're needed 6) Have mp_ring enqueue function enter the ABDICATED rather than BUSY state. This moves the TX to the gtaskq and allows processing to continue faster as well as make TX batching more likely. 7) Add an ift_txd_errata function to struct if_txrx. This allows drivers to inspect/modify mbufs before transmission. 8) Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need checksums zeroed for checksum offload to work. This avoids modifying packet data in the TX path when possible. 9) Use ithreads for iflib I/O instead of taskqueues 10) Clean up ioctl and support async ioctl functions 11) Prefetch two cachlines from each mbuf instead of one up to 128B. We often need to parse packet header info beyond 64B. 12) Fix potential memory corruption due to fence post error in bit_nclear() usage. 13) Improved hang detection and handling 14) If the packet is smaller than MTU, disable the TSO flags. This avoids extra packet parsing when not needed. 15) Move TCP header parsing inside the IS_TSO?() test. This avoids extra packet parsing when not needed. 16) Pass chains of mbufs that are not consumed by lro to if_input() rather call if_input() for each mbuf. 17) Re-arrange packet header loads to get as much work as possible done before a cache stall. 18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/ IFDI_DETACH(); 19) Attempt to distribute RX/TX tasks across cores more sensibly, especially when RX and TX share an interrupt. RX will attempt to take the first threads on a core, and TX will attempt to take successive threads. 20) Allow iflib_softirq_alloc_generic() to request affinity to the same cpus an interrupt has affinity with. This allows TX queues to ensure they are serviced by the socket the device is on. 21) Add new iflib sysctls to net.iflib: - timer_int - interval at which to run per-queue timers in ticks - force_busdma 22) Add new per-device iflib sysctls to dev.X.Y.iflib - rx_budget allows tuning the batch size on the RX path - watchdog_events Count of watchdog events seen since load 23) Fix error where netmap_rxq_init() could get called before IFDI_INIT() 24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY when waiting for firmware - After interrupts are enabled, convert all waits to sleeps - Eliminates e1000 software/firmware synchronization busy waits after startup 25) e1000: Remove special case for budget=1 in em_txrx.c - Premature optimization which may actually be incorrect with multi-segment packets 26) e1000: Split out TX interrupt rather than share an interrupt for RX and TX. - Allows better performance by keeping RX and TX paths separate 27) e1000: Separate igb from em code where suitable Much easier to understand separate functions and "if (is_igb)" than previous tests like "if (reg_icr & (E1000_ICR_RXSEQ \| E1000_ICR_LSC))" #blamebruno Reviewed by: sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12235	2017-09-13 01:18:42 +00:00
Alan Cox	d027ed2e7a	To analyze the allocation of swap blocks by blist functions, add a method for analyzing the radix tree structures and reporting on the number, and sizes, of maximal intervals of free blocks. The report includes the number of maximal intervals, and also the number of them in each of several size ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size boundaries defined by Fibonacci numbers. The report is written in the test tool with the 's' command, or in a running kernel by sysctl. The analysis of the radix tree frequently computes the position of the lone bit set in a u_daddr_t, a computation that also appears in leaf allocation. That computation has been moved into a function of its own, and optimized for cases where an inlined machine instruction can replace the usual binary search. Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11906	2017-09-10 17:46:03 +00:00
Dag-Erling Smørgrav	008a09355b	If the user tries to set kern.randompid to 1 (which is meaningless), set it to a random value between 100 and 1123, rather than 0 as before. Submitted by: Marie Helene Kvello-Aune <marieheleneka@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D5336	2017-09-10 15:01:29 +00:00
Mateusz Guzik	0bbae6f364	namecache: clean up struct namecache_ts handling namecache_ts differs from mere namecache by few fields placed mid struct. The access to the last element (the name) is thus special-cased. The standard solution is to put new fields at the very beginning anad embedd the original struct. The pointer shuffled around points to the embedded part. If needed, access to new fields can be gained through __containerof. MFC after: 1 week	2017-09-10 11:17:32 +00:00
Mateusz Guzik	dad74ce924	namecache: fold the unlock label into the only consumer No functional changes. MFC after: 1 week	2017-09-08 06:57:11 +00:00
Mateusz Guzik	da8f32a7f1	namecache: factor out dot lookup into a dedicated function The intent is to move uncommon cases out of the way. MFC after: 1 week	2017-09-08 06:51:33 +00:00
Mateusz Guzik	6a569d3525	Annotate Giant with __exclusive_cache_line	2017-09-08 06:46:24 +00:00
Mateusz Guzik	3e72c8449b	Annotate global process locks with __exclusive_cache_line MFC after: 1 week	2017-09-08 06:46:02 +00:00
Mateusz Guzik	574adb65c8	Sprinkle __read_frequently on few obvious places. Note that some of annotated variables should probably change their types to something smaller, preferably bit-sized.	2017-09-06 20:33:33 +00:00
Mateusz Guzik	fe933c1d88	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week	2017-09-06 20:28:18 +00:00
Edward Tomasz Napierala	b0618cda03	Make root_mount_rel(9) ignore NULL arguments, like it used to before r313351. It would be better to fix API consumers to not pass NULL there - most of them, such as gmirror, already contain the neccessary checks - but this is easier and much less error-prone. One known user-visible result is that it fixes panic on a failed "graid label". PR: 221846 MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-09-05 14:32:56 +00:00
Warner Losh	519772814d	Add CAM/NVMe support for CAM_DATA_SG This adds support in pass(4) for data to be described with a scatter-gather list (sglist) to augment the existing (single) virtual address. Differential Revision: https://reviews.freebsd.org/D11361 Submitted by: Chuck Tuffli Reviewed by: imp@, scottl@, kenm@	2017-08-29 15:29:57 +00:00
Bryan Drewery	8359a6b7b3	Allow vdrop() of a vnode not yet on the per-mount list after r306512. The old code allowed calling vdrop() before insmntque() to place the vnode back onto the freelist for later recycling. Some downstream consumers may rely on this support. Normally insmntque() failing is fine since is uses vgone() and immediately frees the vnode rather than attempting to add it to the freelist if vdrop() were used instead. Also assert that vhold() cannot be used on such a vnode. Reviewed by: kib, cem, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12126	2017-08-28 19:29:51 +00:00
Conrad Meyer	4ae2ade114	Enhance debugibility of sysctl leaf re-use warnings Print the full conflicting oid path, and include the function name in the warning so it is clear that the warnings are sysctl-related. PR: 221853 Submitted by: Fabian Keil <fk AT fabiankeil.de> (earlier version) Sponsored by: Dell EMC Isilon	2017-08-27 17:12:30 +00:00
Conrad Meyer	eee87314d3	Improve scheduler performance Improve scheduler performance by flattening nonsensical topology layers (layers with only one child don't serve any purpose). This is especially relevant on non-AMD Zen systems after r322776. On my dual core Intel laptop, this brings the kern.sched.topology_spec table down from three levels to two. Submitted by: jeff Reviewed by: attilio Sponsored by: Dell EMC Isilon	2017-08-27 05:14:48 +00:00
John Baldwin	12fb14f36d	Don't grab SOCK_LOCK for soref() when queuing an AIO request. The AIO job holds a reference on the associated file descriptor, so the socket's count should already be > 0. This fixes a LOR with the socket buffer lock after recent socket locking changes in HEAD. Sponsored by: Chelsio Communications	2017-08-25 23:10:27 +00:00
Alan Cox	a0ae476b7e	Correct a regression in the previous change, r322459. Specifically, the removal of the "blk" parameter from blst_meta_alloc() had the unintended effect of generating an out-of-range allocation when the cursor reaches the end of the tree if the number of managed blocks in the tree equals the so-called "radix" (which in the blist code is not the standard notion of what a radix is but rather the maximum number of leaves in a tree of the current height.) In other words, only certain swap configurations were affected, which is why earlier testing did not reveal the problem. Submitted by: Doug Moore <dougm@rice.edu> Reported by: pho, kib Tested by: pho X-MFC with: r322459 Differential Revision: https://reviews.freebsd.org/D12106	2017-08-25 18:47:23 +00:00
Gleb Smirnoff	555b3e2f2c	Third take on the r319685 and r320480. Actually allow for call soisconnected() via soisdisconnected(), and in the earlier unlock earlier to avoid lock recursion. This fixes a situation when a socket on accept queue is reset before being accepted. Reported by: Jason Eggleston <jeggleston llnw.com>	2017-08-24 20:49:19 +00:00
Conrad Meyer	d2e155a4f0	Remove unused declaration and update ddb.4 A follow-up to r322836. Warnings for the unused declaration were breaking some second tier architectures, but did not show up in Clang on x86. Reported by: markj (ddb.4), emaste (declaration) Sponsored by: Dell EMC Isilon	2017-08-24 19:16:25 +00:00
Conrad Meyer	0c1d923efb	Merge print_lockchain and print_sleepchain When debugging a deadlock, it is useful to follow the full chain of locks as far as possible. Reviewed by: jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12115	2017-08-24 15:12:16 +00:00
Jung-uk Kim	a1d0659ca9	Fix size to copyout(9) for cpuset_getid(2). MFC after: 3 days	2017-08-22 20:46:29 +00:00
Conrad Meyer	bb14d5643b	subr_smp: Clean up topology analysis, add additional layers Rather than repeatedly nesting loops, separate concerns with a single loop per call stack level. Use a table to drive the recursive routine. Handle missing topology layers more gracefully (infer a single unit). Analyze some additional optional layers which may be present on e.g. AMD Zen systems (groups, aka dies, per package; and cachegroups, aka CCXes, per group). Display that additional information in the boot-time topology information, when it is relevent (non-one). Reviewed by: markj@, mjoras@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12019	2017-08-22 00:10:15 +00:00
Konstantin Belousov	b59ea73029	Allow vinvalbuf() to operate with the shared vnode lock. This mode allows other clean buffers to arrive while we flush the buf lists for the vnode, which is fine for the targeted use. We only need that all buffers existed at the time of the function start were flushed. In fact, only one assert has to be relaxed. In collaboration with: pho Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D12083	2017-08-20 10:07:45 +00:00
Mark Johnston	e9666bf645	Remove some unneeded subroutines for padding writes to dump devices. Right now we only need to pad when writing kernel dump headers, so flatten three related subroutines into one. The encrypted kernel dump code already writes out its key in a dumper.blocksize-sized block. No functional change intended. Reviewed by: cem, def Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11647	2017-08-18 04:07:25 +00:00
Mark Johnston	01938d3666	Rename mkdumpheader() and group EKCD functions in kern_shutdown.c. This helps simplify the code in kern_shutdown.c and reduces the number of globally visible functions. No functional change intended. Reviewed by: cem, def Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11603	2017-08-18 04:04:09 +00:00
Mark Johnston	50ef60dabe	Factor out duplicated kernel dump code into dump_{start,finish}(). dump_start() and dump_finish() are responsible for writing kernel dump headers, optionally writing the key when encryption is enabled, and initializing the initial offset into the dump device. Also remove the unused dump_pad(), and make some functions static now that they're only called from kern_shutdown.c. No functional change intended. Reviewed by: cem, def Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11584	2017-08-18 03:52:35 +00:00
Lawrence Stewart	9a61faf67d	An off-by-one error exists in sbuf_vprintf()'s use of SBUF_HASROOM() when an sbuf is filled to capacity by vsnprintf(), the loop exits without error, and the sbuf is not marked as auto-extendable. SBUF_HASROOM() evaluates true if there is room for one or more non-NULL characters, but in the case that the sbuf was filled exactly to capacity, SBUF_HASROOM() evaluates false. Consequently, sbuf_vprintf() incorrectly assigns an ENOMEM error to the sbuf when in fact everything is fine, in turn poisoning the buffer for all subsequent operations. Correct by moving the ENOMEM assignment into the loop where it can be made unambiguously. As a related safety net change, explicitly check for the zero bytes drained case in sbuf_drain() and set EDEADLK as the error. This avoids an infinite loop in sbuf_vprintf() if a drain function were to inadvertently return a value of zero to sbuf_drain(). Reviewed by: cem, jtl, gallatin MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D8535	2017-08-18 02:06:28 +00:00
Lawrence Stewart	a8ec96af28	Implement simple record boundary tracking in sbuf(9) to avoid record splitting during drain operations. When an sbuf is configured to use this feature by way of the SBUF_DRAINTOEOR sbuf_new() flag, top-level sections started with sbuf_start_section() create a record boundary marker that is used to avoid flushing partial records. Reviewed by: cem,imp,wblock MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D8536	2017-08-17 07:20:09 +00:00
Ian Lepore	ce44a73667	Fix compile error with option DEBUG. This is fallout from some long-ago INTRNG refactoring that didn't get caught at the time because code in a debugf() statement isn't compiled unless DEBUG is defined. PR: 221557	2017-08-16 16:51:55 +00:00
Conrad Meyer	f3fed04372	Fix a couple of comment typos No functional change. Submitted by: Anton Rang <anton.rang AT isilon.com> Sponsored by: Dell EMC Isilon	2017-08-15 02:21:02 +00:00
Ian Lepore	2db14f97de	Add config_intrhook_oneshot(): schedule an intrhook function and unregister it automatically after it runs. The config_intrhook mechanism allows a driver to stall the boot process until device(s) required for booting are available, by not allowing system inits to proceed until all intrhook functions have been unregistered. Virtually all existing code simply unregisters from within the hook function when it gets called. This new function makes that common usage more convenient. Instead of allocating and filling in a struct, passing it to a function that might (in theory) fail, and checking the return code, now a driver can simply call this cannot-fail routine, passing just the intrhook function and its arg. Differential Revision: https://reviews.freebsd.org/D11963	2017-08-13 18:10:24 +00:00

... 3 4 5 6 7 ...

15996 Commits