freebsd-skq

Author	SHA1	Message	Date
pfg	50d0301ca5	kern: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:18:04 +00:00
ian	9c7cac637a	Add RTC clock conversions for BCD values, with non-panic validation. RTC clock hardware frequently uses BCD numbers. Currently the low-level bcd2bin() and bin2bcd() functions will KASSERT if given out-of-range BCD values. Every RTC driver must implement its own code for validating the unreliable data coming from the hardware to avoid a potential kernel panic. This change introduces two new functions, clock_bcd_to_ts() and clock_ts_to_bcd(). The former validates its inputs and returns EINVAL if any values are out of range. The latter guarantees the returned data will be valid BCD in a known format (4-digit years, etc). A new bcd_clocktime structure is used with the new functions. It is similar to the original clocktime structure, but defines the fields holding BCD values as uint8_t (uint16_t for year), and adds a PM flag for handling hours using AM/PM mode. PR: 224813 Differential Revision: https://reviews.freebsd.org/D13730 (no reviewers)	2018-01-14 17:01:37 +00:00
bz	8d499a89c5	Remove trailing whitespace. No functional change.	2018-01-14 15:01:25 +00:00
kib	78904c70bb	Add sysctl debug.kdb.stack_overflow to conveniently test kernel handling of the kstack overflow. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-13 11:59:49 +00:00
mjg	00f02d857e	sx: retry hard shared unlock just like in r327905 for rwlocks	2018-01-13 09:26:24 +00:00
mjg	6f29c630d9	rwlock: try regular read unlock even in the hard path Saves on turnstile trips if the lock got more readers.	2018-01-13 00:05:31 +00:00
jeff	f375b4dd66	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
jeff	e7c9f84113	Implement NUMA policy for kmem_*(9). This maintains compatibility with reservations by giving each memory domain its own KVA space in vmem that is naturally aligned on superpage boundaries. Reviewed by: alc, markj, kib (some objections) Sponsored by: Netflix, Dell/EMC Isilon Tested by; pho Differential Revision: https://reviews.freebsd.org/D13289	2018-01-12 23:13:55 +00:00
jeff	058d03378a	Regenerate auto-generated files	2018-01-12 23:06:35 +00:00
jeff	94c7af8ca2	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
mjg	ddfd5797d3	mtx: use fcmpset to cover setting MTX_CONTESTED	2018-01-12 13:40:50 +00:00
mjg	749ff5c610	vfs: tidy up vdrop Skip vfs_refcount_release_if_not_last if the interlock is held and just go straight to refcount_release. While here do cosmetic rearrangement of _vhold to better show it contains equivalent behaviour.	2018-01-12 13:39:02 +00:00
tuexen	131d6b30b7	Ensure that the vnet is set when calling pru_sockaddr() and pru_peeraddr(). This is already true when called via kern_getsockname() and kern_getpeername(). This patch sets it also, when they arecalled via soo_fill_kinfo(). This is necessary, since the corresponding functions for SCTP require the vnet to be set. Without this, if a process having an wildcard bound SCTP socket is terminated and a core is written, the kernel panics. Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D13652	2018-01-11 20:26:17 +00:00
cem	2c9ae2323b	mallocarray(9): panic if the requested allocation would overflow Additionally, move the overflow check logic out to WOULD_OVERFLOW() for consumers to have a common means of testing for overflowing allocations. WOULD_OVERFLOW() should be a secondary check -- on 64-bit platforms, just because an allocation won't overflow size_t does not mean it is a sane size to request. Callers should be imposing reasonable allocation limits far, far, below overflow. Discussed with: emaste, jhb, kp Sponsored by: Dell EMC Isilon	2018-01-10 21:49:45 +00:00
jhb	db85554676	Don't store shadow copies of per-process AIO limits. Previously the AIO subsystem would save a snapshot of the currently configured per-process limits the first time a process used AIO. The process would continue to use the snapshotted limits ignoring any changes to the global limits during the rest of its lifetime. This change removes the snapshotted values and changes the AIO code to always check the global values which can be toggled at runtime. This means an administrator can now change the effective limits of existing processes. This is more consistent with how other limits configured via sysctl work in FreeBSD. Reviewed by: asomers, kib MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D13819	2018-01-10 21:18:46 +00:00
jhb	0e2f42609d	Allow the fast-path for disk AIO requests to fail requests. - If aio_qphysio() returns a non-zero error code, fail the request rather than queueing it to the AIO kproc pool to be retried via the slow path. Currently this means that if vm_fault_quick_hold_pages() reports an error, EFAULT is returned from the fast-path rather than retrying the request in the slow path where it will still fail with EFAULT. - If aio_qphysio() wishes to use the fast path for a device that doesn't support unmapped I/O but there are already the maximum number of such requests in flight, fail with EAGAIN as we do for other AIO resource limits rather than queueing the request to the AIO kproc pool. - Move the opcode check for aio_qphysio() out of the caller and into aio_qphysio() to simplify some logic and remove two goto's while here. It also uses a whitelist (only supported for LIO_READ / LIO_WRITE) rather than a blacklist (skipped for LIO_SYNC). PR: 217261 Submitted by: jkim (an earlier version) MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-10 00:18:47 +00:00
jhb	0fe959494b	Simplify some logic by merging an if test with a subsequent switch. Specifically, in aio_queue_file() the code was doing this: if (opcode == LIO_SYNC) { ... } switch (opcode) { ... case LIO_SYNC: ... } This moves the body of the if statement into the LIO_SYNC case of the switch statement. MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-10 00:02:06 +00:00
jhb	fb5e317b48	Add a counter to track in-flight AIO requests using unmapped I/O. MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-09 23:57:29 +00:00
markj	8b210be68e	Generalize the gzio API. We currently use a set of subroutines in kern_gzio.c to perform compression of user and kernel core dumps. In the interest of adding support for other compression algorithms (zstd) in this role without complicating the API consumers, add a simple compressor API which can be used to select an algorithm. Also change the (non-default) GZIO kernel option to not enable compressed user cores by default. It's not clear that such a default would be desirable with support for multiple algorithms implemented, and it's inconsistent in that it isn't applied to kernel dumps. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D13632	2018-01-08 21:27:41 +00:00
ian	f47ac564b6	Use EVENTHANDLER_DIRECT_INVOKE for [un]mount events, for better performance.	2018-01-07 18:07:22 +00:00
ian	bdbecc7f3f	Use EVENTHANDLER_DIRECT_INVOKE() for device events, for better performance.	2018-01-07 18:06:30 +00:00
kp	69291826ac	Introduce mallocarray() in the kernel Similar to calloc() the mallocarray() function checks for integer overflows before allocating memory. It does not zero memory, unless the M_ZERO flag is set. Reviewed by: pfg, vangyzen (previous version), imp (previous version) Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D13766	2018-01-07 13:21:01 +00:00
glebius	6947b91466	In sendfile_iodone() both pru_abort and sorele need to be executed with proper VNET context set. Reported by: sbruno MFC after: 2 weeks	2018-01-05 20:21:46 +00:00
jhb	3cc4339e5b	Always use atomic_fetchadd() when updating per-user accounting values. This avoids re-reading a variable after it has been updated via an atomic op. It is just a cosmetic cleanup as the read value was only used to control a diagnostic printf that should rarely occur (if ever). Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D13768	2018-01-04 22:07:58 +00:00
jhb	a5135f50cf	Report offset relative to the backing object for kinfo_vmentry structures. For the pathname reported in kinfo_vmentry structures (kve_path), the sysctl handlers walk the object chain to find the bottom-most VM object. This permits a COW mapping of a file with dirty pages to report the pathname of the originally mapped file. Do the same for the object offset (kve_offset) computing a cumulative offset during the same object walk so that the reported offset is relative to the reported pathname. Note that ptrace(PT_VM_ENTRY) already returns a cumulative offset rather than the raw offset of the VM map entry. Note also that this does not affect procstat -v output (even structured output) since that output does not include the kve_offset field. Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D13767	2018-01-04 21:59:34 +00:00
karels	95cd76bc3b	make SW_WATCHDOG dynamic Enable the hardclock-based watchdog previously conditional on the SW_WATCHDOG option whenever hardware watchdogs are not found, and watchdogd attempts to enable the watchdog. The SW_WATCHDOG option still causes the sofware watchdog to be enabled even if there is a hardware watchdog. This does not change the other software-based watchdog enabled by the --softtimeout option to watchdogd. Note that the code to reprime the watchdog during kernel core dumps is no longer conditional on SW_WATCHDOG. I think this was previously a bug. Reviewed by: imp alfred bjk MFC after: 1 week Relnotes: yes Differential Revision: https://reviews.freebsd.org/D13713	2018-01-03 00:56:30 +00:00
antoine	2d68146a96	sysctl_kern_proc_args: do not take the fast path if p_args is NULL In this case it falls back to reading ps_strings	2018-01-01 21:25:01 +00:00
cperciva	55fe5887ff	Use the TSLOG framework to record entry/exit timestamps for DELAY and _vprintf; these functions are called in many places and can contribute meaningfully to the total time spent booting.	2017-12-31 09:24:41 +00:00
cperciva	d913278fc1	Instrument thread creations for the the benefit of the TSLOG framework. This assists in tracking time spent while the boot is being "held" waiting for something to happen.	2017-12-31 09:24:11 +00:00
cperciva	95c6ba719c	Instrument "boot holds" for the benefit of the TSLOG framework. These are places where the "main thread" of the booting kernel (either the thread which later becomes swapper or the thread which later becomes init) has to stop and wait for action to take place in another thread before continuing. There are currently three such holds: 1. The intr_config_hooks SYSINIT waits for hooks registered via the config_intrhook_establish function; this allows (typically) devices which need interrupts enabled to complete their initialization to do so before root is mounted. 2. The g_waitidle function waits for the GEOM event queue to be empty; this ensures that all of the disks which have been attached have been tasted before we attempt to mount root. 3. The vfs_mountroot_wait function (in addition to calling g_waitidle) waits for holds registered via root_mount_hold; among other things, this is used by the USB subsystem to ensure that we don't fail to mount root if it's located on a USB disk which takes a while to probe.	2017-12-31 09:23:52 +00:00
cperciva	a8789eddcf	Teach makeobjops.awk to accept PROLOG and EPILOG blocks before METHOD and STATICMETHOD declarations; that code will be inserted into the dispatch function before and after the method call. Use this functionality and the TSLOG framework to record DEVICE_ATTACH and DEVICE_PROBE entry/exit timestamps.	2017-12-31 09:23:19 +00:00
cperciva	c5a93994dc	Use the TSLOG framework to record entry/exit timestamps for machine independent functions with important roles in the early boot process: mi_startup (with the "exit" recorded when it becomes swapper), start_init (with the "exit" recorded when the thread is about to "return" into the newly created init process), vfs_mountroot, and vfs_mountroot_wait.	2017-12-31 09:22:31 +00:00
cperciva	e9be0de69c	Code for recording timestamps of events, especially function entries/exits. This is a very primitive system, intended for use in measuring performance during the early system boot, before more sophisticated tools like DTrace or infrastructure like kernel memory allocation and mutexes are available. Because this code records pointers to strings rather than copying strings (in order to keep the memory usage more manageable), if a kernel module is unloaded after logging an event, Bad Things can happen. Users are advised to not do that. Since cycle counts from the early kernel boot are used as an initial entropy source, publishing this information to userland could result in inadequate entropy being kept private to the kernel RNG. Users are advised to not enable this on systems with untrusted users. Discussed on: freebsd-current	2017-12-31 09:21:01 +00:00
pfg	81e471c032	sysv_{ipc\|shm}: update the NetBSD VCS tags to match nearer our files. Both files originated in NetBSD: sysv_ipc.c CVS 1.9: Most of their changes don't apply to us as we already have similar changes. This is a better reference for future merges. sysv_shm.c CVS 1.39: Most of their changes don't apply to our code but interestingly this revision merged our changes and is a better point for reference. Move the VCS tags to the position recommended in our committers guide (section 8), No functional change.	2017-12-31 03:34:00 +00:00
mjg	ea6bb8f729	locks: adjust loop limit check when waiting for readers The check was for the exact value, but since the counter started being incremented by the number of readers it could have jumped over.	2017-12-31 02:31:01 +00:00
mjg	38154309c8	sx: fix up non-smp compilation after r327397	2017-12-31 01:59:56 +00:00
mjg	1cb4f9d747	locks: re-check the reason to go to sleep after locking sleepq/turnstile In both rw and sx locks we always go to sleep if the lock owner is not running. We do spin for some time if the lock is read-locked. However, if we decide to go to sleep due to the lock owner being off cpu and after sleepq/turnstile gets acquired the lock is read-locked, we should fallback to the aforementioned wait.	2017-12-31 00:47:04 +00:00
mjg	095fbaddd3	sx: read the SX_NOADAPTIVE flag and Giant ownership only once These used to be read multiple times when waiting for the lock the become free, which had the potential to issue completely avoidable traffic.	2017-12-31 00:37:50 +00:00
mjg	c7b2a94e2a	mtx: deduplicate indefinite wait check in spinlocks and thread lock	2017-12-31 00:34:29 +00:00
mjg	a59af230d8	mtx: pre-read the lock value in thread_lock_flags_ Since this function is effectively slow path, if we get here the lock is most likely already taken in which case it is cheaper to not blindly attempt the atomic op. While here move hwpmc probe out of the loop to match other primitives.	2017-12-31 00:33:28 +00:00
mjg	c5a49cfe43	rwlock: tidy up __rw_runlock_hard similarly to r325921	2017-12-31 00:31:14 +00:00
kib	d1455c25af	Make kern_proc_vmmap_resident() externally accesible, and move the vmmap_skip_res_cnt control check inside it. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13595	2017-12-28 13:16:32 +00:00
eadler	421a929b1e	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno	2017-12-27 03:23:21 +00:00
kan	c8da6fae2c	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385	2017-12-25 04:48:39 +00:00
kan	37353d6326	Reverse the check to allocate the buffer if cached pointer is NULL. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D13596	2017-12-23 17:55:19 +00:00
kan	d0a5d6e832	Remove dead store to local variable.	2017-12-23 16:49:57 +00:00
bde	cf8a25e82e	Use resume_cpus() instead of restart_cpus() to resume from ACPI suspension. restart_cpus() worked well enough by accident. Before this set of fixes, resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to restart) as restart_cpus(). resume_cpus() waited for the wrong cpuset (stopped_cpus) to become empty, but since mixtures of stopped and suspended CPUs are not close to working, stopped_cpus must be empty when resuming so the wait is null -- restart_cpus just allows the other CPUs to restart and returns without waiting. Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and add further kludges to try to keep it working for the XEN case. It was only used for XEN. It waited on suspended_cpus. This works for XEN. However, for ACPI, resuming is a 2-step process. ACPI has already woken up the other CPUs and removed them from suspended_cpus. This fix records the move by putting them in a new cpuset resuming_cpus. Waiting on suspended_cpus would give the same null wait as waiting on stopped_cpus. Wait on resuming_cpus instead. Add a cpuset toresume_cpus to map the CPUs being told to resume to keep this separate from the cpuset started_cpus for mapping the CPUs being told to restart. Mixtures of stopped and suspended/resuming CPUs are still far from working. Describe new and some old cpusets in comments. Add further kludges to cpususpend_handler() to try to avoid breaking it for XEN. XEN doesn't use resumectx(), so it doesn't use the second return path for savectx(), and it goes from the suspended state directly to the restarted state, while ACPI resume goes through the resuming state. Enter the resuming state early for all cases so that resume_cpus can test for being in this state and not have to worry about the intermediate !suspended state for ACPI only. Reviewed by: kib	2017-12-21 09:17:48 +00:00
jhb	e09154bf75	Rework pathconf handling for FIFOs. On the one hand, FIFOs should respect other variables not supported by the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.). These values are fs-specific and must come from a fs-specific method. On the other hand, filesystems that support FIFOs are required to support _PC_PIPE_BUF on directory vnodes that can contain FIFOs. Given this latter requirement, once the fs-specific VOP_PATHCONF method supports _PC_PIPE_BUF for directories, it is also suitable for FIFOs permitting a single VOP_PATHCONF method to be used for both FIFOs and non-FIFOs. To that end, retire all of the FIFO-specific pathconf methods from filesystems and change FIFO-specific vnode operation switches to use the existing fs-specific VOP_PATHCONF method. For fifofs, set it's VOP_PATHCONF to VOP_PANIC since it should no longer be used. While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that only filesystems supporting FIFOs will report a value. In addition, only report a valid _PC_PIPE_BUF for directories and FIFOs. Discussed with: bde Reviewed by: kib (part of a larger patch) MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D12572	2017-12-19 22:39:05 +00:00
jhb	3efec8ad25	Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf(). Having all filesystems fall through to default values isn't always correct and these values can vary for different filesystem implementations. Most of these changes just use the existing default values with a few exceptions: - Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact permissions check this claims for chown(). - Use NANDFS_NAME_LEN for NAME_MAX for nandfs. - Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to indicate hard links aren't supported. Requested by: bde (though perhaps not this exact implementation) Reviewed by: kib (earlier version) MFC after: 1 month Sponsored by: Chelsio Communications	2017-12-19 19:51:36 +00:00
jhb	b731672881	Add a custom VOP_PATHCONF method for fdescfs. The method handles NAME_MAX and LINK_MAX explicitly. For all other pathconf variables, the method passes the request down to the underlying file descriptor. This requires splitting a kern_fpathconf() syscallsubr routine out of sys_fpathconf(). Also, to avoid lock order reversals with vnode locks, the fdescfs vnode is unlocked around the call to kern_fpathconf(), but with the usecount of the vnode bumped. MFC after: 1 month Sponsored by: Chelsio Communications	2017-12-19 18:20:38 +00:00

1 2 3 4 5 ...

15826 Commits