freebsd-skq

Author	SHA1	Message	Date
Wojciech Macek	4d249cdd4c	ULE: provide defaults to ts_cpu Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0. This had no practical effect since a real CPU was chosen immediately by the scheduler. However, on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU when assigned the initial choice of the old one. This caused an attempt to use illegal memory and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP explicitly and add some asserts to see that this problem does not recur. Authored by: Nathan Whitehorn <nwhitehorn@freebsd.org> Submitted by: Wojciech Macek <wma@semihalf.com> Obtained from: Semihalf Differential revision: https://reviews.freebsd.org/D13932	2018-01-24 07:54:05 +00:00
Pedro F. Giffuni	44c514b142	Forgot to sort here in r328238.	2018-01-22 02:26:10 +00:00
Pedro F. Giffuni	d821d36419	Unsign some values related to allocation. When allocating memory through malloc(9), we always expect the amount of memory requested to be unsigned as a negative value would either stand for an error or an overflow. Unsign some values, found when considering the use of mallocarray(9), to avoid unnecessary casting. Also consider that indexes should be of at least the same size/type as the upper limit they pretend to index. MFC after: 3 weeks	2018-01-22 02:08:10 +00:00
Pedro F. Giffuni	ac2fffa4b7	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Andriy Gapon	3e4f610dad	correct read-ahead calculations in vfs_bio_getpages Previously the calculations were done as if the requested region ended at the start of the last requested page, not its end. The problem as actually quite minor as it affected only stats and page prefaulting, not the actual page data, and only with specific parameters. Reviewed by: kib (previous version) MFC after: 2 weeks	2018-01-18 12:59:04 +00:00
Wojciech Macek	5b3e8b0725	KDB: restart only CPUs stopped by KDB There is a case when not all CPUs went online. In that situation, restart only APs which were operational before entering KDB. Created by: Wojciech Macek <wma@semihalf.com> Obtained from: Semihalf Reviewed by: nwhitehorn Differential revision: https://reviews.freebsd.org/D13949 Sponsored by: QCM Technologies	2018-01-18 07:38:54 +00:00
John Baldwin	58c4aee0d7	Require the SHF_ALLOC flag for program sections from kernel object modules. ELF object files can contain program sections which are not supposed to be loaded into memory (e.g. .comment). Normally the static linker uses these flags to decide which sections are allocated to loadable program segments in ELF binaries and shared objects (including kernels on all architectures and kernel modules on architectures other than amd64). Mapping ELF object files (such as amd64 kernel modules) into memory directly is a bit of a grey area. ELF object files are intended to be used as inputs to the static linker. As a result, there is not a standardized definition for what the memory layout of an ELF object should be (none of the section headers have valid virtual memory addresses for example). The kernel and loader were not checking the SHF_ALLOC flag but loading any program sections with certain types such as SHT_PROGBITS. As a result, the kernel and loader would load into RAM some sections that weren't marked with SHF_ALLOC such as .comment that are not loaded into RAM for kernel modules on other architectures (which are implemented as ELF shared objects). Aside from possibly requiring slightly more RAM to hold a kernel module this does not affect runtime correctness as the kernel relocates symbols based on the layout it uses. Debuggers such as gdb and lldb do not extract symbol tables from a running process or kernel. Instead, they replicate the memory layout of ELF executables and shared objects and use that to construct their own symbol tables. For executables and shared objects this works fine. For ELF objects the current logic in kgdb (and probably lldb based on a simple reading) assumes that only sections with SHF_ALLOC are memory resident when constructing a memory layout. If the debugger constructs a different memory layout than the kernel, then it will compute different addresses for symbols causing symbols in the debugger to appear to have the wrong values (though the kernel itself is working fine). The current port of mdb does not check SHF_ALLOC as it replicates the kernel's logic in its existing kernel support. The bfd linker sorts the sections in ELF object files such that all of the allocated sections (sections with SHF_ALLOCATED) are placed first followed by unallocated sections. As a result, when kgdb composed a memory layout using only the allocated sections, this layout happened to match the layout used by the kernel and loader. The lld linker does not sort the sections in ELF object files and mixed allocated and unallocated sections. This resulted in kgdb composing a different memory layout than the kernel and loader. We could either patch kgdb (and possibly in the future lldb) to use custom handling when generating memory layouts for kernel modules that are ELF objects, or we could change the kernel and loader to check SHF_ALLOCATED. I chose the latter as I feel we shouldn't be loading things into RAM that the module won't use. This should mostly be a NOP when linking with bfd but will allow the existing kgdb to work with amd64 kernel modules linked with lld. Note that we only require SHF_ALLOC for "program" sections for types like SHT_PROGBITS and SHT_NOBITS. Other section types such as symbol tables, string tables, and relocations must also be loaded and are not marked with SHF_ALLOC. Reported by: np Reviewed by: kib, emaste MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D13926	2018-01-17 22:51:59 +00:00
John Baldwin	b1288166e0	Use long for the last argument to VOP_PATHCONF rather than a register_t. pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf() functions now accept a pointer to a long value rather than modifying td_retval directly. Instead, the system calls explicitly store the returned long value in td_retval[0]. Requested by: bde Reviewed by: kib Sponsored by: Chelsio Communications	2018-01-17 22:36:58 +00:00
Pedro F. Giffuni	a18a2290cd	kern: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:18:04 +00:00
Ian Lepore	862993757a	Add RTC clock conversions for BCD values, with non-panic validation. RTC clock hardware frequently uses BCD numbers. Currently the low-level bcd2bin() and bin2bcd() functions will KASSERT if given out-of-range BCD values. Every RTC driver must implement its own code for validating the unreliable data coming from the hardware to avoid a potential kernel panic. This change introduces two new functions, clock_bcd_to_ts() and clock_ts_to_bcd(). The former validates its inputs and returns EINVAL if any values are out of range. The latter guarantees the returned data will be valid BCD in a known format (4-digit years, etc). A new bcd_clocktime structure is used with the new functions. It is similar to the original clocktime structure, but defines the fields holding BCD values as uint8_t (uint16_t for year), and adds a PM flag for handling hours using AM/PM mode. PR: 224813 Differential Revision: https://reviews.freebsd.org/D13730 (no reviewers)	2018-01-14 17:01:37 +00:00
Bjoern A. Zeeb	8e23158af7	Remove trailing whitespace. No functional change.	2018-01-14 15:01:25 +00:00
Konstantin Belousov	fd94177c70	Add sysctl debug.kdb.stack_overflow to conveniently test kernel handling of the kstack overflow. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-13 11:59:49 +00:00
Mateusz Guzik	1b54ffc8d2	sx: retry hard shared unlock just like in r327905 for rwlocks	2018-01-13 09:26:24 +00:00
Mateusz Guzik	84f2a8a4b4	rwlock: try regular read unlock even in the hard path Saves on turnstile trips if the lock got more readers.	2018-01-13 00:05:31 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Jeff Roberson	7a469c8ef3	Implement NUMA policy for kmem_*(9). This maintains compatibility with reservations by giving each memory domain its own KVA space in vmem that is naturally aligned on superpage boundaries. Reviewed by: alc, markj, kib (some objections) Sponsored by: Netflix, Dell/EMC Isilon Tested by; pho Differential Revision: https://reviews.freebsd.org/D13289	2018-01-12 23:13:55 +00:00
Jeff Roberson	af80820a57	Regenerate auto-generated files	2018-01-12 23:06:35 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Mateusz Guzik	310f24d72a	mtx: use fcmpset to cover setting MTX_CONTESTED	2018-01-12 13:40:50 +00:00
Mateusz Guzik	31c2c6e95e	vfs: tidy up vdrop Skip vfs_refcount_release_if_not_last if the interlock is held and just go straight to refcount_release. While here do cosmetic rearrangement of _vhold to better show it contains equivalent behaviour.	2018-01-12 13:39:02 +00:00
Michael Tuexen	ce076a1f58	Ensure that the vnet is set when calling pru_sockaddr() and pru_peeraddr(). This is already true when called via kern_getsockname() and kern_getpeername(). This patch sets it also, when they arecalled via soo_fill_kinfo(). This is necessary, since the corresponding functions for SCTP require the vnet to be set. Without this, if a process having an wildcard bound SCTP socket is terminated and a core is written, the kernel panics. Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D13652	2018-01-11 20:26:17 +00:00
Conrad Meyer	c02fc9607a	mallocarray(9): panic if the requested allocation would overflow Additionally, move the overflow check logic out to WOULD_OVERFLOW() for consumers to have a common means of testing for overflowing allocations. WOULD_OVERFLOW() should be a secondary check -- on 64-bit platforms, just because an allocation won't overflow size_t does not mean it is a sane size to request. Callers should be imposing reasonable allocation limits far, far, below overflow. Discussed with: emaste, jhb, kp Sponsored by: Dell EMC Isilon	2018-01-10 21:49:45 +00:00
John Baldwin	86bbef4379	Don't store shadow copies of per-process AIO limits. Previously the AIO subsystem would save a snapshot of the currently configured per-process limits the first time a process used AIO. The process would continue to use the snapshotted limits ignoring any changes to the global limits during the rest of its lifetime. This change removes the snapshotted values and changes the AIO code to always check the global values which can be toggled at runtime. This means an administrator can now change the effective limits of existing processes. This is more consistent with how other limits configured via sysctl work in FreeBSD. Reviewed by: asomers, kib MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D13819	2018-01-10 21:18:46 +00:00
John Baldwin	f54c5606b3	Allow the fast-path for disk AIO requests to fail requests. - If aio_qphysio() returns a non-zero error code, fail the request rather than queueing it to the AIO kproc pool to be retried via the slow path. Currently this means that if vm_fault_quick_hold_pages() reports an error, EFAULT is returned from the fast-path rather than retrying the request in the slow path where it will still fail with EFAULT. - If aio_qphysio() wishes to use the fast path for a device that doesn't support unmapped I/O but there are already the maximum number of such requests in flight, fail with EAGAIN as we do for other AIO resource limits rather than queueing the request to the AIO kproc pool. - Move the opcode check for aio_qphysio() out of the caller and into aio_qphysio() to simplify some logic and remove two goto's while here. It also uses a whitelist (only supported for LIO_READ / LIO_WRITE) rather than a blacklist (skipped for LIO_SYNC). PR: 217261 Submitted by: jkim (an earlier version) MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-10 00:18:47 +00:00
John Baldwin	7e40918452	Simplify some logic by merging an if test with a subsequent switch. Specifically, in aio_queue_file() the code was doing this: if (opcode == LIO_SYNC) { ... } switch (opcode) { ... case LIO_SYNC: ... } This moves the body of the if statement into the LIO_SYNC case of the switch statement. MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-10 00:02:06 +00:00
John Baldwin	8091e52b42	Add a counter to track in-flight AIO requests using unmapped I/O. MFC after: 2 weeks Sponsored by: Chelsio Communications	2018-01-09 23:57:29 +00:00
Mark Johnston	78f57a9cde	Generalize the gzio API. We currently use a set of subroutines in kern_gzio.c to perform compression of user and kernel core dumps. In the interest of adding support for other compression algorithms (zstd) in this role without complicating the API consumers, add a simple compressor API which can be used to select an algorithm. Also change the (non-default) GZIO kernel option to not enable compressed user cores by default. It's not clear that such a default would be desirable with support for multiple algorithms implemented, and it's inconsistent in that it isn't applied to kernel dumps. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D13632	2018-01-08 21:27:41 +00:00
Ian Lepore	ac579135b0	Use EVENTHANDLER_DIRECT_INVOKE for [un]mount events, for better performance.	2018-01-07 18:07:22 +00:00
Ian Lepore	f031a3b25f	Use EVENTHANDLER_DIRECT_INVOKE() for device events, for better performance.	2018-01-07 18:06:30 +00:00
Kristof Provost	fd91e076c1	Introduce mallocarray() in the kernel Similar to calloc() the mallocarray() function checks for integer overflows before allocating memory. It does not zero memory, unless the M_ZERO flag is set. Reviewed by: pfg, vangyzen (previous version), imp (previous version) Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D13766	2018-01-07 13:21:01 +00:00
Gleb Smirnoff	b4f55763ce	In sendfile_iodone() both pru_abort and sorele need to be executed with proper VNET context set. Reported by: sbruno MFC after: 2 weeks	2018-01-05 20:21:46 +00:00
John Baldwin	2da93c21ec	Always use atomic_fetchadd() when updating per-user accounting values. This avoids re-reading a variable after it has been updated via an atomic op. It is just a cosmetic cleanup as the read value was only used to control a diagnostic printf that should rarely occur (if ever). Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D13768	2018-01-04 22:07:58 +00:00
John Baldwin	3160862437	Report offset relative to the backing object for kinfo_vmentry structures. For the pathname reported in kinfo_vmentry structures (kve_path), the sysctl handlers walk the object chain to find the bottom-most VM object. This permits a COW mapping of a file with dirty pages to report the pathname of the originally mapped file. Do the same for the object offset (kve_offset) computing a cumulative offset during the same object walk so that the reported offset is relative to the reported pathname. Note that ptrace(PT_VM_ENTRY) already returns a cumulative offset rather than the raw offset of the VM map entry. Note also that this does not affect procstat -v output (even structured output) since that output does not include the kve_offset field. Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D13767	2018-01-04 21:59:34 +00:00
Mike Karels	d626b50b9d	make SW_WATCHDOG dynamic Enable the hardclock-based watchdog previously conditional on the SW_WATCHDOG option whenever hardware watchdogs are not found, and watchdogd attempts to enable the watchdog. The SW_WATCHDOG option still causes the sofware watchdog to be enabled even if there is a hardware watchdog. This does not change the other software-based watchdog enabled by the --softtimeout option to watchdogd. Note that the code to reprime the watchdog during kernel core dumps is no longer conditional on SW_WATCHDOG. I think this was previously a bug. Reviewed by: imp alfred bjk MFC after: 1 week Relnotes: yes Differential Revision: https://reviews.freebsd.org/D13713	2018-01-03 00:56:30 +00:00
Antoine Brodin	1b25176cbc	sysctl_kern_proc_args: do not take the fast path if p_args is NULL In this case it falls back to reading ps_strings	2018-01-01 21:25:01 +00:00
Colin Percival	d5d7606c0c	Use the TSLOG framework to record entry/exit timestamps for DELAY and _vprintf; these functions are called in many places and can contribute meaningfully to the total time spent booting.	2017-12-31 09:24:41 +00:00
Colin Percival	49a4e3b4b4	Instrument thread creations for the the benefit of the TSLOG framework. This assists in tracking time spent while the boot is being "held" waiting for something to happen.	2017-12-31 09:24:11 +00:00
Colin Percival	8b8a7c43a9	Instrument "boot holds" for the benefit of the TSLOG framework. These are places where the "main thread" of the booting kernel (either the thread which later becomes swapper or the thread which later becomes init) has to stop and wait for action to take place in another thread before continuing. There are currently three such holds: 1. The intr_config_hooks SYSINIT waits for hooks registered via the config_intrhook_establish function; this allows (typically) devices which need interrupts enabled to complete their initialization to do so before root is mounted. 2. The g_waitidle function waits for the GEOM event queue to be empty; this ensures that all of the disks which have been attached have been tasted before we attempt to mount root. 3. The vfs_mountroot_wait function (in addition to calling g_waitidle) waits for holds registered via root_mount_hold; among other things, this is used by the USB subsystem to ensure that we don't fail to mount root if it's located on a USB disk which takes a while to probe.	2017-12-31 09:23:52 +00:00
Colin Percival	a21a2da599	Teach makeobjops.awk to accept PROLOG and EPILOG blocks before METHOD and STATICMETHOD declarations; that code will be inserted into the dispatch function before and after the method call. Use this functionality and the TSLOG framework to record DEVICE_ATTACH and DEVICE_PROBE entry/exit timestamps.	2017-12-31 09:23:19 +00:00
Colin Percival	6032e08810	Use the TSLOG framework to record entry/exit timestamps for machine independent functions with important roles in the early boot process: mi_startup (with the "exit" recorded when it becomes swapper), start_init (with the "exit" recorded when the thread is about to "return" into the newly created init process), vfs_mountroot, and vfs_mountroot_wait.	2017-12-31 09:22:31 +00:00
Colin Percival	e31e71991a	Code for recording timestamps of events, especially function entries/exits. This is a very primitive system, intended for use in measuring performance during the early system boot, before more sophisticated tools like DTrace or infrastructure like kernel memory allocation and mutexes are available. Because this code records pointers to strings rather than copying strings (in order to keep the memory usage more manageable), if a kernel module is unloaded after logging an event, Bad Things can happen. Users are advised to not do that. Since cycle counts from the early kernel boot are used as an initial entropy source, publishing this information to userland could result in inadequate entropy being kept private to the kernel RNG. Users are advised to not enable this on systems with untrusted users. Discussed on: freebsd-current	2017-12-31 09:21:01 +00:00
Pedro F. Giffuni	0879ca728a	sysv_{ipc\|shm}: update the NetBSD VCS tags to match nearer our files. Both files originated in NetBSD: sysv_ipc.c CVS 1.9: Most of their changes don't apply to us as we already have similar changes. This is a better reference for future merges. sysv_shm.c CVS 1.39: Most of their changes don't apply to our code but interestingly this revision merged our changes and is a better point for reference. Move the VCS tags to the position recommended in our committers guide (section 8), No functional change.	2017-12-31 03:34:00 +00:00
Mateusz Guzik	efa9f177f5	locks: adjust loop limit check when waiting for readers The check was for the exact value, but since the counter started being incremented by the number of readers it could have jumped over.	2017-12-31 02:31:01 +00:00
Mateusz Guzik	cde25ed4cd	sx: fix up non-smp compilation after r327397	2017-12-31 01:59:56 +00:00
Mateusz Guzik	28f1a9e3ff	locks: re-check the reason to go to sleep after locking sleepq/turnstile In both rw and sx locks we always go to sleep if the lock owner is not running. We do spin for some time if the lock is read-locked. However, if we decide to go to sleep due to the lock owner being off cpu and after sleepq/turnstile gets acquired the lock is read-locked, we should fallback to the aforementioned wait.	2017-12-31 00:47:04 +00:00
Mateusz Guzik	fb10612355	sx: read the SX_NOADAPTIVE flag and Giant ownership only once These used to be read multiple times when waiting for the lock the become free, which had the potential to issue completely avoidable traffic.	2017-12-31 00:37:50 +00:00
Mateusz Guzik	15140a8ade	mtx: deduplicate indefinite wait check in spinlocks and thread lock	2017-12-31 00:34:29 +00:00
Mateusz Guzik	1f4d28c7ea	mtx: pre-read the lock value in thread_lock_flags_ Since this function is effectively slow path, if we get here the lock is most likely already taken in which case it is cheaper to not blindly attempt the atomic op. While here move hwpmc probe out of the loop to match other primitives.	2017-12-31 00:33:28 +00:00
Mateusz Guzik	80c39f6c37	rwlock: tidy up __rw_runlock_hard similarly to r325921	2017-12-31 00:31:14 +00:00

1 2 3 4 5 ...

15808 Commits