freebsd-dev

Author	SHA1	Message	Date
Matt Macy	1f4beb6312	epoch(9): cleanups, additional debug checks, and add global_epoch - GC the _nopreempt routines - to really benefit we'd need a separate routine - they're not currently in use - they complicate the API for no benefit at this time - check that we're actually in a epoch section at exit - handle epoch_call() early in boot - Fix copyright declaration language Approved by: sbruno@	2018-05-13 23:24:48 +00:00
Konstantin Belousov	2ebc882927	Detect and optimize reads from the hole on UFS. - Create getblkx(9) variant of getblk(9) which can return error. - Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP was performed before the buffer is created, and EJUSTRETURN returned in case the requested block does not exist. - Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and allocating the pages for it), copying from zero_region instead. The end result is less page allocations and buffer recycling when a hole is read, which is important for some benchmarks. Requested and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D14917	2018-05-13 09:47:28 +00:00
Matt Macy	f1401123c5	hwpmc/epoch - don't reference domain if NUMA is not set It appears that domain information is set correctly independent of whether or not NUMA is defined. However, there is no memory backing secondary domains leading to allocation failure. Reported by: pho@, np@ Approved by: sbruno@	2018-05-12 20:00:29 +00:00
Matt Macy	e6b475e0af	hwpmc(9): Make pmclog buffer pcpu and update constants On non-trivial SMP systems the contention on the pmc_owner mutex leads to a substantial number of samples captured being from the pmc process itself. This change a) makes buffers larger to avoid contention on the global list b) makes the working sample buffer per cpu. Run pmcstat in the background (default event rate of 64k): pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 & Before: make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total After: make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total For more realistic overhead measurement set the sample rate for ~2khz on a 2.1Ghz processor: pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 & Collecting 10 samples of `make -j96 buildkernel` from each: x before + after real time: N Min Max Median Avg Stddev x 10 76.4 127.62 84.845 88.577 15.100031 + 10 59.71 60.79 60.135 60.179 0.29957192 Difference at 95.0% confidence -28.398 +/- 10.0344 -32.0602% +/- 7.69825% (Student's t, pooled s = 10.6794) system time: N Min Max Median Avg Stddev x 10 2277.96 6948.53 2949.47 3341.492 1385.2677 + 10 1038.7 1081.06 1070.555 1064.017 15.85404 Difference at 95.0% confidence -2277.47 +/- 920.425 -68.1574% +/- 8.77623% (Student's t, pooled s = 979.596) x no pmc + pmc running real time: HEAD: N Min Max Median Avg Stddev x 10 58.38 59.15 58.86 58.847 0.22504567 + 10 76.4 127.62 84.845 88.577 15.100031 Difference at 95.0% confidence 29.73 +/- 10.0335 50.5208% +/- 17.0525% (Student's t, pooled s = 10.6785) patched: N Min Max Median Avg Stddev x 10 58.38 59.15 58.86 58.847 0.22504567 + 10 59.71 60.79 60.135 60.179 0.29957192 Difference at 95.0% confidence 1.332 +/- 0.248939 2.2635% +/- 0.426506% (Student's t, pooled s = 0.264942) system time: HEAD: N Min Max Median Avg Stddev x 10 1010.15 1073.31 1025.465 1031.524 18.135705 + 10 2277.96 6948.53 2949.47 3341.492 1385.2677 Difference at 95.0% confidence 2309.97 +/- 920.443 223.937% +/- 89.3039% (Student's t, pooled s = 979.616) patched: N Min Max Median Avg Stddev x 10 1010.15 1073.31 1025.465 1031.524 18.135705 + 10 1038.7 1081.06 1070.555 1064.017 15.85404 Difference at 95.0% confidence 32.493 +/- 16.0042 3.15% +/- 1.5794% (Student's t, pooled s = 17.0331) Reviewed by: jeff@ Approved by: sbruno@ Differential Revision: https://reviews.freebsd.org/D15155	2018-05-12 01:26:34 +00:00
Matt Macy	8dcbd0eae6	epoch(9): always set inited in epoch_init - set inited in the !usedomains case Reported by: jhibbits Approved by: sbruno	2018-05-11 18:37:14 +00:00
Matt Macy	4aa302dfc9	epoch(9): callback task fixes - initialize the pcpu STAILQ in the NUMA case - don't enqueue the callback task if there isn't sufficient work to be done Reported by: pho@ Approved by: sbruno@	2018-05-11 08:16:56 +00:00
Mateusz Guzik	85c1b3c1cb	rmlock: partially depessimize lock/unlock fastpath Previusly the slow path was folded in and partially jumped over in the common case.	2018-05-11 06:59:54 +00:00
Matt Macy	b2cb28963b	epoch(9): fix priority handling, make callback lists pcpu, and other fixes - Lend priority to preempted threads in epoch_wait to handle the case in which we've had priority lent to us. Previously we borrowed the priority of the lowest priority preempted thread. (pointed out by mjg@) - Don't attempt allocate memory per-domain on powerpc, we don't currently handle empty sockets (as is the case on jhibbits Talos' board). - Handle deferred callbacks as pcpu lists and poll the lists periodically. Currently the interval is 1/hz. - Drop the thread lock when adaptive spinning. Holding the lock starves other threads and can even lead to lockups. - Keep a generation count pcpu so that we don't keep spining if a thread has left and re-entered an epoch section. - Actually removed the callback from the callback list so that we don't double free. Sigh ... Approved by: sbruno@	2018-05-11 04:54:12 +00:00
Matt Macy	06bf2a6aef	Add simple preempt safe epoch API Read locking is over used in the kernel to guarantee liveness. This API makes it easy to provide livenes guarantees without atomics. Includes epoch_test kernel module to stress test the API. Documentation will follow initial use case. Test case and improvements to preemption handling in response to discussion with mjg@ Reviewed by: imp@, shurd@ Approved by: sbruno@	2018-05-10 17:55:24 +00:00
Andrew Gallatin	d5cdcc3a06	Fix the build after r333457 In r333457, the arguments to kern_pwritev() were accidentally re-ordered as part of ANSIfication, breaking the build.	2018-05-10 13:19:42 +00:00
Ed Maste	cc3c9df80f	ANSIfy sys_generic.c	2018-05-10 11:36:16 +00:00
Matt Macy	36688f706e	Add taskqgroup_config_gtask_deinit to support teardown after taskqgroup_config_gtask_init. Approved by: sbruno	2018-05-09 18:51:35 +00:00
Matt Macy	cbd92ce62e	Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month	2018-05-09 18:47:24 +00:00
Konstantin Belousov	55c9d75e6b	Avoid calls to bzero() before ireloc. Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see correct flags, most important is SMAP. Tested by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D15367	2018-05-09 14:39:24 +00:00
Matt Macy	ad738f3791	Reduce overhead of ktrace checks in the common case. KTRPOINT() checks both if we are tracing _and_ if we are recursing within ktrace. The second condition is only ever executed if ktrace is actually enabled. This change moves the check out of the hot path in to the functions themselves. Discussed with mjg@ Reported by: mjg@ Approved by: sbruno@	2018-05-09 00:00:47 +00:00
Mateusz Guzik	2824088536	Inlined sched_userret. The tested condition is rarely true and it induces a function call on each return to userspace. Bumps getuid rate by about 1% on Broadwell.	2018-05-07 23:36:16 +00:00
Mateusz Guzik	75e9b455a9	Change trap_enotcap to bool and annotate with __read_frequently It is read on each return to user space.	2018-05-07 23:10:12 +00:00
Mateusz Guzik	79ca7cbf09	Avoid calls to syscall_thread_enter/exit for statically defined syscalls The entire mechanism is rarely used and is quite not performant due to atomci ops on the syscall table. It also has added overhead for completely unrelated syscalls. Reduce it by avoiding the func calls if possible (which consistutes vast majority of cases). Provides about 3% syscall rate speed up for getuid on Broadwell.	2018-05-07 22:29:32 +00:00
Warner Losh	ad7142757b	Add device_quiet_children() and device_has_quiet_children() If you add a child to a device that has quiet children, we'll automatically set the quiet flag on the children, and its children. This is indended for things like CPU that have a large amount of repetition in booting that adds nothing.	2018-05-07 21:09:08 +00:00
Andrew Gallatin	e7bd0750af	Boost thread priority while changing CPU frequency Boost the priority of user-space threads when they set their affinity to a core to adjust its frequency. This avoids a situation where a CPU bound kernel thread with the same affinity is running on a down-clocked core, and will "block" powerd from up-clocking the core until the kernel thread yields. This can lead to poor perfomance, and to things potentially getting stuck on Giant. Reviewed by: kib (imp reviewed earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15246	2018-05-07 15:24:03 +00:00
Mark Johnston	bd92e6b6f5	Refactor some of the MI kernel dump code in preparation for netdump. - Add clear_dumper() to complement set_dumper(). - Drain netdump's preallocated mbuf pool when clearing the dumper. - Don't do bounds checking for dumpers with mediasize 0. - Add dumper callbacks for initialization for writing out headers. Reviewed by: sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15252	2018-05-06 00:22:38 +00:00
Mark Johnston	5475ca5aca	Add an mbuf allocator for netdump. The aim is to permit mbuf allocations after a panic without calling into the page allocator, without imposing any runtime overhead during regular operation of the system, and without modifying driver code. The approach taken is to preallocate a number of mbufs and clusters, storing them in linked lists, and using the lists to back some UMA cache zones. At panic time, the mbuf and cluster zone pointers are overwritten with those of the cache zones so that the mbuf allocator returns preallocated items. Using this scheme, drivers which cache mbuf zone pointers from m_getzone() require special handling when implementing netdump support. Reviewed by: cem (earlier version), julian, sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15251	2018-05-06 00:19:48 +00:00
Mark Johnston	c2ba2d1b0e	Style. MFC after: 3 days	2018-05-06 00:11:30 +00:00
Andriy Gapon	bd3afae0ca	for bus suspend, detach and shutdown iterate children in reverse order For most buses all children are equal, so the order does not matter. Other buses, such as acpi, carefully order their child devices to express implicit dependencies between them. For such buses it is safer to bring down devices in the reverse order. I believe that this is the reason why hpet_suspend had to be disabled. Some drivers depend on a working event timer until they are suspended. But previously we would suspend hpet very early. I tested this change by makinbg hpet_suspend actually stop HPET timers and tested that too. Note that this change is not a complete solution as it does not take into account bus passes. A better approach would be to track the actual attach order of the devices and to use the reverse of that. Reviewed by: imp, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15291	2018-05-05 05:19:32 +00:00
Mateusz Guzik	5ec2c93667	tc: bcopy -> memcpy	2018-05-04 22:48:10 +00:00
Jamie Gritton	0e5c6bd436	Make it easier for filesystems to count themselves as jail-enabled, by doing most of the work in a new function prison_add_vfs in kern_jail.c Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and the rest is taken care of. This includes adding a jail parameter like allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed. Both of these used to be a static list of known filesystems, with predefined permission bits. Reviewed by: kib Differential Revision: D14681	2018-05-04 20:54:27 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Matt Macy	748ff486b0	`dup1_processes -t 96 -s 5` on a dual 8160 x dup_before + dup_after +------------------------------------------------------------+ \| x + \| \|x x x x ++ ++\| \| \|____AM___\| \|AM\|\| +------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71 + 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829 Difference at 95.0% confidence 3.08952e+06 +/- 365604 2.03266% +/- 0.245071% (Student's t, pooled s = 250681) Reported by: mjg@ MFC after: 1 week	2018-05-04 06:51:01 +00:00
Konstantin Belousov	7035cf14ee	Implement support for ifuncs in the kernel linker. Required MD bits are only provided for x86. Reviewed by: jhb (previous version, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 21:37:46 +00:00
Stephen Hurd	f3e1324b41	Separate list manipulation locking from state change in multicast Multicast incorrectly calls in to drivers with a mutex held causing drivers to have to go through all manner of contortions to use a non sleepable lock. Serialize multicast updates instead. Submitted by: mmacy <mmacy@mattmacy.io> Reviewed by: shurd, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14969	2018-05-02 19:36:29 +00:00
Mark Johnston	20f85b1ddd	Print the dump progress indicator after calling dump_start(). Dumpers may wish to print messages from an initialization hook; this change ensures that such messages aren't mixed with output from the generic dump code. MFC after: 1 week	2018-05-01 17:32:43 +00:00
Nathan Whitehorn	ee900504cf	Report the kernel base address properly in kldstat when using PowerPC kernels loaded at addresses other than their link address.	2018-05-01 04:06:59 +00:00
Ed Maste	2216c6933c	Disable connectat/bindat with AT_FDCWD in capmode Previously it was possible to connect a socket (which had the CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in capabilties mode. This combination should be treated the same as a call to connect (i.e. forbidden in capabilities mode). Similarly for bindat. Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the documentation and add tests. PR: 222632 Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com> Reviewed by: Domagoj Stolfa MFC after: 1 week Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D15221	2018-04-30 17:31:06 +00:00
Mateusz Guzik	9d68f7741f	systrace: track it like sdt probes While here predict false. Note the code is wrong (regardless of this change). Dereference of the pointer can race with module unload. A fix would set the probe to a nop stub instead of NULL.	2018-04-27 15:16:34 +00:00
Emmanuel Vadot	ee710ecf32	clk: Put the sysctls under hw.clock instead of clock This is more consistant with hw.regulator and other hardware related sysctls.	2018-04-27 00:12:00 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Sean Bruno	7875017ca9	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks	2018-04-24 19:55:12 +00:00
Conrad Meyer	65df124845	Do not totally silence suppressed secondary kasserts unless debug.kassert.do_log is disabled To totally silence and ignore secondary kassert violations after a primary panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1. Additional assertion warnings shouldn't block core dump and may alert the developer to another erroneous condition. Secondary stack traces may be printed, identically to the unsuppressed case where panic() is reentered -- controlled via debug.trace_all_panics. Sponsored by: Dell EMC Isilon	2018-04-24 19:10:51 +00:00
Conrad Meyer	07aa6ea677	Fix debug.kassert.do_log description text This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only since it was originally committed in r243980. Sponsored by: Dell EMC Isilon	2018-04-24 18:59:40 +00:00
Conrad Meyer	ad1fc31570	panic: Optionally, trace secondary panics To diagnose and fix secondary panics, it is useful to have a stack trace. When panic tracing is enabled, optionally trace secondary panics as well. The option is configured with the tunable/sysctl debug.trace_all_panics. (The original concern that inspired only tracing the primary panic was likely that the secondary trace may scroll the original panic message or trace off the screen. This is less of a concern for serial consoles with logging. Not everything has a serial console, though, so the behavior is optional.) Discussed with: jhb Sponsored by: Dell EMC Isilon	2018-04-24 18:54:20 +00:00
Jonathan T. Looney	18959b695d	Update r332860 by changing the default from suppressing post-panic assertions to not suppressing post-panic assertions. There are some post-panic assertions that are valuable and we shouldn't default to disabling them. However, when a user trips over them, the user can still adjust the tunable/sysctl to suppress them temporarily to get conduct troubleshooting (e.g. get a core dump). Reported by: cem, markj	2018-04-24 18:47:35 +00:00
Conrad Meyer	b543c98cab	lockmgr: Add missed neutering during panic r313683 introduced new lockmgr APIs that missed the panic-time neutering present in the rest of our locks. Correct that by adding the usual check. Additionally, move the __lockmgr_args neutering above the assertions at the top of the function. Drop the interlock unlock because we shouldn't have an unneutered interlock either. No point trying to unlock it. PR: 227749 Reported by: jtl Sponsored by: Dell EMC Isilon	2018-04-24 18:41:14 +00:00
Mateusz Guzik	d357c16adc	lockf: change the owner hash from pid to vnode-based This adds a bit missed due to the patch split, see r332882 Tested by: pho	2018-04-24 06:10:36 +00:00
Mateusz Guzik	7cd794214a	dtrace: depessimize dtmalloc when dtrace is active Each malloc/free was testing dtrace_malloc_enabled and forcing extra reads from the malloc type struct to see if perhaps a dtmalloc probe was on. Treat it like lockstat and sdt: have a global bolean.	2018-04-24 01:06:20 +00:00
Mateusz Guzik	4c5209cb21	lockstat: track lockstat just like sdt probes In particular flip the frequently tested var to bool.	2018-04-24 01:04:10 +00:00
Mateusz Guzik	c9e05ccd62	malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default) malloc was showing at the top of profile during while running microbenchmarks. #define DTMALLOC_PROBE_MAX 2 struct malloc_type_internal { uint32_t mti_probes[DTMALLOC_PROBE_MAX]; u_char mti_zone; struct malloc_type_stats mti_stats[MAXCPU]; }; Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone (which we know is 0) + part of malloc stats of the first cpu which on top induces false-sharing. In particular will-it-scale lock1_processes -t 128 -s 10: before: average:45879692 after: average:51655596 Note the counters can be padded but the right fix is to move them to counter(9), leaving the struct read-only after creation (modulo dtrace probes).	2018-04-23 22:28:49 +00:00
Sean Bruno	7b7796eea5	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-04-23 19:51:00 +00:00
Mateusz Guzik	833dc05a6e	lockf: add per-chain locks to the owner hash This combined with previous changes significantly depessimizes the behaviour under contentnion. In particular the lock1_processes test (locking/unlocking separate files) from the will-it-scale suite was executed with 128 concurrency on a 4-socket Broadwell with 128 hardware threads. Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%) For reference single-process is ~1680000 (i.e. on stock kernel the resulting perf is less than half of the single-threaded run), Note this still does not really scale all that well as the locks were just bolted on top of the current implementation. Significant room for improvement is still here. In particular the top performance fluctuates depending on the extent of false sharing in given run (which extends beyond the file). Added chain+lock pairs were not padded w.r.t. cacheline size. One big ticket item is the hash used for spreading threads: it used to be the process pid (which basically serialized all threaded ops). Temporarily the vnode addr was slapped in instead. Tested by: pho	2018-04-23 08:23:10 +00:00
Mateusz Guzik	63286976b5	lockf: skip locking the graph if not necessary (common case) Tested by: pho	2018-04-23 07:54:02 +00:00
Mateusz Guzik	717df0b0e8	lockf: perform wakeup onlly when there is anybody waiting Tested by: pho	2018-04-23 07:52:56 +00:00
Mateusz Guzik	c72ead2815	lockf: skip the hard work in lf_purgelocks if possible Tested by: pho	2018-04-23 07:52:10 +00:00
Mateusz Guzik	0d3323f557	lockf: free state only when recycling the vnode This avoids malloc/free cycles when locking/unlocking the vnode when nobody is contending. Tested by: pho	2018-04-23 07:51:19 +00:00
Tijl Coosemans	7dfbbc613b	Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of kproc_suspend_check. In r329612 bufspacedaemon was turned into a thread of the bufdaemon process causing both to call kproc_suspend_check with the same proc argument and that function contains the following while loop: while (SIGISMEMBER(p->p_siglist, SIGSTOP)) { wakeup(&p->p_siglist); msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0); } So one thread wakes up the other and the other wakes up the first again, locking up UP machines on shutdown. Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they run after the syncer has shutdown, because the syncer can cause a situation where bufdaemon help is needed to proceed. PR: 227404 Reviewed by: kib Tested by: cy, rmacklem	2018-04-22 16:05:29 +00:00
Mateusz Guzik	7d853f62bf	lockf: slightly depessimize 1. check if P_ADVLOCK is already set and if so, don't lock to set it (stolen from DragonFly) 2. when trying for fast path unlock, check that we are doing unlock first instead of taking the interlock for no reason (e.g. if we want to lock). whilere make it more likely that falling fast path will not take the interlock either by checking for state Note the code is severely pessimized both single- and multithreaded.	2018-04-22 09:30:07 +00:00
Jonathan T. Looney	44b71282b5	When running with INVARIANTS, the kernel contains extra checks. However, these assumptions may not hold true once we've panic'd. Therefore, the checks hold less value after a panic. Additionally, if one of the checks fails while we are already panic'd, this creates a double-panic which can interfere with debugging the original panic. Therefore, this commit allows an administrator to suppress a response to KASSERT checks after a panic by setting a tunable/sysctl. The tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being enabled. Reviewed by: kib Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D12920	2018-04-21 17:05:00 +00:00
Konstantin Belousov	1302eea7bb	Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET -> PROC_PDEATHSIG_STATUS for consistency with other procctl(2) operations names. Requested by: emaste Sponsored by: The FreeBSD Foundation MFC after: 13 days	2018-04-20 15:19:27 +00:00
Andriy Gapon	f87beb93e8	call racct_proc_ucred_changed() under the proc lock The lock is required to ensure that the switch to the new credentials and the transfer of the process's accounting data from the old credentials to the new ones is done atomically. Otherwise, some updates may be applied to the new credentials and then additionally transferred from the old credentials if the updates happen after proc_set_cred() and before racct_proc_ucred_changed(). The problem is especially pronounced for RACCT_RSS because - there is a strict accounting for this resource (it's reclaimable) - it's updated asynchronously by the vm daemon - it's updated by setting an absolute value instead of applying a delta I had to remove a call to rctl_proc_ucred_changed() from racct_proc_ucred_changed() and make all callers of latter call the former as well. The reason is that rctl_proc_ucred_changed, as it is implemented now, cannot be called while holding the proc lock, so the lock is dropped after calling racct_proc_ucred_changed. Additionally, I've added calls to crhold / crfree around the rctl call, because without the proc lock there is no gurantee that the new credentials, owned by the process, will stay stable. That does not eliminate a possibility that the credentials passed to the rctl will get stale. Ideally, rctl_proc_ucred_changed should be able to work under the proc lock. Many thanks to kib for pointing out the above problems. PR: 222027 Discussed with: kib No comment: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15048	2018-04-20 13:08:04 +00:00
John Baldwin	73c8686e91	Simplify the code to allocate stack for auxv, argv[], and environment vectors. Remove auxarg_size as it was only used once right after a confusing assignment in each of the variants of exec_copyout_strings(). Reviewed by: emaste MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15123	2018-04-19 16:00:34 +00:00
Konstantin Belousov	b940886338	Add PROC_PDEATHSIG_SET to procctl interface. Allow processes to request the delivery of a signal upon death of their parent process. Supposed consumer of the feature is PostgreSQL. Submitted by: Thomas Munro Reviewed by: jilles, mjg MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D15106	2018-04-18 21:31:13 +00:00
John Baldwin	8ce99bb405	Properly do a deep copy of the ioctls capability array for fget_cap(). fget_cap() tries to do a cheaper snapshot of a file descriptor without holding the file descriptor lock. This snapshot does not do a deep copy of the ioctls capability array, but instead uses a different return value to inform the caller to retry the copy with the lock held. However, filecaps_copy() was returning 1 to indicate that a retry was required, and fget_cap() was checking for 0 (actually '!filecaps_copy()'). As a result, fget_cap() did not do a deep copy of the ioctls array and just reused the original pointer. This cause multiple file descriptor entries to think they owned the same pointer and eventually resulted in duplicate frees. The only code path that I'm aware of that triggers this is to create a listen socket that has a restricted list of ioctls and then call accept() which calls fget_cap() with a valid filecaps structure from getsock_cap(). To fix, change the return value of filecaps_copy() to return true if it succeeds in copying the caps and false if it fails because the lock is required. I find this more intuitive than fixing the caller in this case. While here, change the return type from 'int' to 'bool'. Finally, make filecaps_copy() more robust in the failure case by not copying any of the source filecaps structure over. This avoids the possibility of leaking a pointer into a structure if a similar future caller doesn't properly handle the return value from filecaps_copy() at the expense of one more branch. I also added a test case that panics before this change and now passes. Reviewed by: kib Discussed with: mjg (not a fan of the extra branch) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D15047	2018-04-17 18:07:40 +00:00
Brooks Davis	cee61c8cac	Stop using fuswintr() and suswintr() in the profiler. Always take the AST path rather than calling MD functions which are often implemented as always failing. The is the case on amd64, arm, i386, and powerpc. This optimization (inherited from 4.4 Lite) is a pessimization on those architectures and is the sole use of these functions. They will be removed in a seperate commit. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15101	2018-04-17 16:36:53 +00:00
Alan Somers	52c0983128	lio_listio: return EAGAIN instead of EIO when out of resources This behavior is already documented by the man page, and suggested by POSIX. Reviewed by: jhb MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15099	2018-04-16 18:12:15 +00:00
Konstantin Belousov	d86c1f0dc1	i386 4/4G split. The change makes the user and kernel address spaces on i386 independent, giving each almost the full 4G of usable virtual addresses except for one PDE at top used for trampoline and per-CPU trampoline stacks, and system structures that must be always mapped, namely IDT, GDT, common TSS and LDT, and process-private TSS and LDT if allocated. By using 1:1 mapping for the kernel text and data, it appeared possible to eliminate assembler part of the locore.S which bootstraps initial page table and KPTmap. The code is rewritten in C and moved into the pmap_cold(). The comment in vmparam.h explains the KVA layout. There is no PCID mechanism available in protected mode, so each kernel/user switch forth and back completely flushes the TLB, except for the trampoline PTD region. The TLB invalidations for userspace becomes trivial, because IPI handlers switch page tables. On the other hand, context switches no longer need to reload %cr3. copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for new copyout(9) is compatibility with wiring user buffers around sysctl handlers. This explains two kind of locks for copyout ptes and accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow path, is only tried after the 'fast path' failed, which temporary changes mapping to the userspace and copies the data to/from small per-cpu buffer in the trampoline. If a page fault occurs during the copy, it is short-circuit by exception.s to not even reach C code. The change was motivated by the need to implement the Meltdown mitigation, but instead of KPTI the full split is done. The i386 architecture already shows the sizing problems, in particular, it is impossible to link clang and lld with debugging. I expect that the issues due to the virtual address space limits would only exaggerate and the split gives more liveness to the platform. Tested by: pho Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14633	2018-04-13 20:30:49 +00:00
Mateusz Guzik	e0e259a888	locks: extend speculative spin waiting for readers to drain Now that 10 years have passed since the original limit of 10000 was committed, bump it a little bit. Spinning waiting for writers is semi-informed in the sense that we always know if the owner is running and base the decision to spin on that. However, no such information is provided for read-locking. In particular this means that it is possible for a write-spinner to completely waste cpu time waiting for the lock to be released, while the reader holding it was preempted and is now waiting for the spinner to go off cpu. Nonetheless, in majority of cases it is an improvement to spin instead of instantly giving up and going to sleep. The current approach is pretty simple: snatch the number of current readers and performs that many pauses before checking again. The total number of pauses to execute is limited to 10k. If the lock is still not free by that time, go to sleep. Given the previously noted problem of not knowing whether spinning makes any sense to begin with the new limit has to remain rather conservative. But at the very least it should also be related to the machine. Waiting for writers uses parameters selected based on the number of activated hardware threads. The upper limit of pause instructions to be executed in-between re-reads of the lock is typically 16384 or 32678. It was selected as the limit of total spins. The lower bound is set to already present 10000 as to not change it for smaller machines. Bumping the limit reduces system time by few % during benchmarks like buildworld, buildkernel and others. Tested on 2 and 4 socket machines (Broadwell, Skylake). Figuring out how to make a more informed decision while not pessimizing the fast path is left as an exercise for the reader.	2018-04-11 01:43:29 +00:00
Ian Lepore	97603f1da2	Use explicit_bzero() when cleaning values out of the kernel environment. Sometimes the values contain geli passphrases being communicated from loader(8) to the kernel, and some day the compiler may decide to start eliding calls to memset() for a pointer which is not dereferenced again before being passed to free().	2018-04-10 22:57:56 +00:00
Mateusz Guzik	04457342a3	rw: whack avoidable re-reads in try_upgrade	2018-04-10 22:32:31 +00:00
Stephen Hurd	f422673e10	Make BPF global lock an SX This allows NIC drivers to sleep on polling config operations. Submitted by: Matthew Macy <mmacy@mattmacy.io> Reviewed by: shurd Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14982	2018-04-10 19:42:50 +00:00
Mateusz Guzik	a045941bd2	locks: tweak backoff a little bit Previous limits were chosen when locking primitives had spurious lock accesses. Flipping the starting point to 1 (or rather 2 as the first call shifts it) provides a modest win when mild contention is seen while not hurting worse cases. Tested on a bunch of one, two and four socket old and new systems (Westmere, Skylake, Threadreaper and others) by doing concurrent page faults, buildkernel/buildworld and other stuff (although not all systems got all the tests). Another thing is the upper limit. It is semi-arbitrarily chosen as it was getting out of hand for slightly less small systems (e.g. a 128-thread one). Note that backoff is fundamentally a speculative bandaid and this change just makes it fit a little bit better. It remains completely oblivious to the hardware topology or the contention pattern. This is being experimented with.	2018-04-08 16:34:10 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Brooks Davis	89ea4a30d6	Added SAL annotatations to system calls. Modify makesyscalls.sh to strip out SAL annotations. No functional change. This is based on work I started in CheriBSD and use to validate fat pointers at the syscall boundary. Tal Garfinkel reviewed the changes, added annotations to COMPAT* syscalls and is using them in a record and playback framework. One can envision other uses such as a WITNESS-like validator for copyin/out as speculated on in the review. As this time we are only annotating sys/kern/syscalls.master as that is sufficient for userspace work. If kernel use cases materialize, we can annotate other syscalls.master as needed. Submitted by: Tal Garfinkel <talg@cs.stanford.edu> Sponsored by: DARPA, AFRL (in part) Differential Revision: https://reviews.freebsd.org/D14285	2018-04-05 20:31:45 +00:00
Jeff Roberson	e5818a53db	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839	2018-03-29 02:54:50 +00:00
Jeff Roberson	27a3c9d710	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
Andriy Gapon	f4043145f2	ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count It's not sufficient nor required to use the vnode interlock when checking if we are going to drop the last use count as the code in vputx() uses refcount (atomic) operations for both checking and decrementing the use code. Apply the same method to vn_rele_async(). While here, remove vn_rele_inactive(), a wrapper around vrele() that didn't add any value. Also, the change required making vfs_refcount_release_if_not_last() public. I've made vfs_refcount_acquire_if_not_zero() public as well. They are in sys/refcount.h now. While making the move I've dropped the vfs_ prefix. Reviewed by: mjg MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D14869	2018-03-28 08:55:31 +00:00
Mateusz Guzik	179da98f71	fd: tighten seq protected areas to not contain malloc/free	2018-03-28 03:07:02 +00:00
Konstantin Belousov	fb441a8829	Fix several leaks of kernel stack data through paddings. It is random collection of fixes for issues not yet corrected, reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many issues from that list were already corrected. Most of them are for compat32, old compat32 or affect both primary host ABI and compat32. The freebsd32_kldstat(), for instance, was already fixed by using malloc(M_ZERO). Patch includes correction to report the supplied version back, which is just pedantic. Reviewed by: brooks, emaste (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14868	2018-03-27 18:05:51 +00:00
Brooks Davis	34a77b9741	Move uio enums to sys/_uio.h. Include _uio.h instead of uio.h in several headers to reduce header polution. Fix a few places that relied on header polution to get the uio.h header. I have not moved struct uio as many more things that use it rely on header polution to get other definitions from uio.h. Reviewed by: cem, kib, markj Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14811	2018-03-27 15:20:03 +00:00
Andriy Gapon	31260bf042	vfs_donmount: in certain cases try r/o mount if r/w mount fails If the operation is not an update, if neither r/w nor r/o mode is explicitly requested, if the error code hints at the possibility of the media being read-only, and if the fallback is allowed, then we can try to automatically downgrade to the readonly mode. This is especially useful for auto-mounting of removable media that sometimes can happen to be write-protected. The fallback to r/o is not enabled by default. It can be requested on a per-mount basis with a new mount option, 'autoro'. Or it can be globally allowed by setting vfs.default_autoro. Reviewed by: cem, kib MFC after: 3 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D13361	2018-03-27 14:31:42 +00:00
Jeff Roberson	e8cbe51a04	Fix a bug introduced in r329612 that slowly invalidates all clean bufs. Reported by: bde Reviewed by: bde Sponsored by: Netflix, Dell/EMC Isilon	2018-03-26 18:36:17 +00:00
Mark Johnston	803c11a3a6	Use LIST_FOREACH_SAFE in sleepq_chains_remove_matching(). We may remove a sleepqueue from the hash table in sleepq_resume_thread(). Reviewed by: kib MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14847	2018-03-25 20:12:14 +00:00
Konstantin Belousov	ed9e8bc468	Account the size of the vslock-ed memory by the thread. Assert that all such memory is unwired on return to usermode. The count of the wired memory will be used to detect the copyout mode. Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:51:27 +00:00
Konstantin Belousov	161bf65f8a	In vn_io_fault1(), reduce the scope where pagefaults are disabled. Most important for the future use, do not call vm_fault_quick_hold_pages() with disabled pagefaults. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:13:52 +00:00
Konstantin Belousov	c398200721	Do not send signals to init directly from shutdown_nice(9), do it from the task context. shutdown_nice() is used from the fast interrupt handlers, mostly for console drivers, where we cannot lock blockable locks. Schedule the task in the fast queue to send the signal from the proper context. Reviewed by: imp Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-22 20:47:25 +00:00
Jeff Roberson	9a4b4cd3bc	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:11:43 +00:00
Warner Losh	f0d847af61	Drop any recursed taking of Giant once and for all at the top of kern_reboot(). The shutdown path is now safe to run without Giant. Discussed with: kib@ Sponsored by: Netflix	2018-03-22 15:34:37 +00:00
Jonathan T. Looney	2529f56ed3	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085	2018-03-22 09:40:08 +00:00
Gleb Smirnoff	27cd06b391	Redo r331328. We need to fix not only type but also format. While here again notice that we are fixing regression from r331106.	2018-03-22 05:26:27 +00:00
Gleb Smirnoff	5aab68f24a	Fix sysctl types broken in r329612.	2018-03-21 23:21:32 +00:00
Mark Johnston	a7defaea9a	Elide the object lock in the common case in vfs_vmio_unwire(). The object lock was only needed when attempting to free B_DIRECT buffer pages, and for testing for invalid pages (and freeing them if so). Handle the latter by instead moving invalid pages near the head of the inactive queue, where they will be reclaimed quickly. Reviewed by: alc, kib, jeff MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14778	2018-03-21 21:15:43 +00:00
Warner Losh	3e867f24cb	bufshutdown is no longer called with Giant held, so there's no need to drop or pickup Giant anymore. Remove that code and adjust comments.	2018-03-21 14:46:59 +00:00
Warner Losh	d5292812f8	Remove Giant from init creation and vfs_mountroot. Sponsored by: Netflix Discussed with: kib@, mckusick@ Differential Review: https://reviews.freebsd.org/D14712	2018-03-21 14:46:54 +00:00
Conrad Meyer	c37125d9e5	Add missed sys/limits.h include Apparently header pollution on x86 hid its absense. Sorry, other arch users. Fix the missed header introduced in r331279. Reported by: tinderbox	2018-03-21 03:43:40 +00:00
Conrad Meyer	4948f7bf11	Regenerate sysent files after r331279.	2018-03-21 01:17:01 +00:00
Conrad Meyer	e9ac27430c	Implement getrandom(2) and getentropy(3) The general idea here is to provide userspace programs with well-defined sources of entropy, in a fashion that doesn't require opening a new file descriptor (ulimits) or accessing paths (/dev/urandom may be restricted by chroot or capsicum). getrandom(2) is the more general API, and comes from the Linux world. Since our urandom and random devices are identical, the GRND_RANDOM flag is ignored. getentropy(3) is added as a compatibility shim for the OpenBSD API. truss(1) support is included. Tests for both system calls are provided. Coverage is believed to be at least as comprehensive as LTP getrandom(2) test coverage. Additionally, instructions for running the LTP tests directly against FreeBSD are provided in the "Test Plan" section of the Differential revision linked below. (They pass, of course.) PR: 194204 Reported by: David CARLIER <david.carlier AT hardenedbsd.org> Discussed with: cperciva, delphij, jhb, markj Relnotes: maybe Differential Revision: https://reviews.freebsd.org/D14500	2018-03-21 01:15:45 +00:00
Jamie Gritton	672756aa9f	Represent boolean jail options as an array of structures containing the flag and both the regular and "no" names, instead of two different string arrays whose indices need to match the flag's bit position. This makes them similar to the say "jailsys" options are represented. Loop through either kind of option array with a structure pointer rather then an integer index.	2018-03-20 23:08:42 +00:00
Gleb Smirnoff	83fc34ea0d	At this point iwmesg isn't initialized yet, so print pointer to lock rather than panic before panicing.	2018-03-20 22:05:21 +00:00
Mark Johnston	8c7549da2b	Drop KTR_CONTENTION. It is incomplete, has not been adopted in the other locking primitives, and we have other means of measuring lock contention (lock_profiling, lockstat, KTR_LOCK). Drop it to slightly de-clutter the mutex code and free up a precious KTR class index. Reviewed by: jhb, mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14771	2018-03-20 15:51:05 +00:00
Justin Hibbits	2acde6a85a	Cast through uintptr_t to narrow the buf domain pointer on 32-bit archs arg2 is an intmax_t, which on 32-bit architectures is 64 bits, wider than a pointer. When &bdomain[i] is added to arg2 it widens from uintptr_t to intmax_t, then gcc whines when it gets cast to a pointer. Casting through uintptr_t silences this warning.	2018-03-20 02:01:30 +00:00
Matt Joras	d6160f6079	Fix initialization of eventhandler mutex. mtx_init does not do a copy of the name string it is passed. The eventhandler code incorrectly passed the parameter string directly to mtx_init instead of using the copy it makes. This was an existing problem with the code that I dutifully copied over in my changes in r325621. Reported by: Anton Rang <rang AT acm.org> Reviewed by: rstone, markj Approved by: rstone (mentor) MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14764	2018-03-19 22:43:27 +00:00
Mark Johnston	0eb50f9cd2	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:40:56 +00:00
Mateusz Guzik	09bdec20a0	locks: slightly depessimize lockstat The slow path is always taken when lockstat is enabled. This induces rdtsc (or other) calls to get the cycle count even when there was no contention. Still go to the slow path to not mess with the fast path, but avoid the heavy lifting unless necessary. This reduces sys and real time during -j 80 buildkernel: before: 3651.84s user 1105.59s system 5394% cpu 1:28.18 total after: 3685.99s user 975.74s system 5450% cpu 1:25.53 total disabled: 3697.96s user 411.13s system 5261% cpu 1:18.10 total So note this is still a significant hit. LOCK_PROFILING results are not affected.	2018-03-17 19:26:33 +00:00
Jeff Roberson	3cec5c77d6	Move the dirty queues inside the per-domain structure. This resolves a bug where we had not hit global dirty limits but a single queue was starved for space by dirty buffers. A single buf_daemon is maintained for now. Add a bd_speedup() when we are low on bufspace. This can happen due to SUJ keeping many bufs locked until a cg block is written. Document this with a comment. Fix sysctls to work with per-domain variables. Add more ddb debugging. Reported by: pho Reviewed by: kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14705	2018-03-17 18:14:49 +00:00
Conrad Meyer	330b675f65	vfs_bio.c: Apply cleanups motivated by Coverity analysis It is believed that the conditions Coverity indicated were actually impossible to hit. So this patch just adds a cleanup to only compute v_mount once in brelse(), and in vfs_bio_getpages() always initializes error to zero to appease the static analyzer. No functional change intended. Submitted by: Darrick Lew <darrick.freebsd AT gmail.com> Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14613	2018-03-14 22:11:45 +00:00
Ed Maste	a95659f75f	Use C99 boolean type for translate_osrel Migrate to modern types before creating MD Linuxolator bits for new architectures. Reviewed by: cem Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14676	2018-03-13 16:40:29 +00:00
Ed Maste	b7feabf906	Use C99 designated initializers for struct execsw It it makes use slightly more clear and facilitates grepping.	2018-03-13 13:09:10 +00:00
Brooks Davis	467e627672	Use the stack for temporary storage in OTIOCCONS. The old code used the thread's pcb via the uap->data pointer. Reviewed by: ed Approved by: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14674	2018-03-12 23:04:42 +00:00
Brooks Davis	97519ff698	MIPS: Implement fueword and casueword* in assembly. Remove NO_FUEWORD so the 'e' variants are wrapped by the non-'e' variants. This is more correct and leaves sparc64 as the outlier. Reviewed by: jmallett, kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14603	2018-03-12 22:10:06 +00:00
Ed Maste	5cc6d253f4	ANSIfy sys/kern/imgact_*	2018-03-12 15:45:50 +00:00
Ian Lepore	72721caf04	Make root mount timeout logic work for filesystems other than ufs. The vfs.mountroot.timeout tunable and .timeout directive in a mount.conf(5) file allow specifying a wait timeout for the device(s) hosting the root filesystem to become usable. The current mechanism for waiting for devices and detecting their availability can't be used for zfs-hosted filesystems. See the comment #20 in the PR for some expanded detail on these points. This change adds retry logic to the actual root filesystem mount. That is, insted of relying on device availability using device name lookups, it uses the kernel_mount() call itself to detect whether the filesystem can be mounted, and loops until it succeeds or the configured timeout is exceeded. These changes are based on the patch attached to the PR, but it's rewritten enough that all mistakes belong to me. PR: 208882 X-MFC after: sufficient testing, and hopefully in time for 11.1	2018-03-10 22:07:57 +00:00
Conrad Meyer	dec5441a32	subr_gtaskqueue: Fix braino from r330715 Submitted by: markj Sponsored by: Dell EMC Isilon	2018-03-10 01:53:42 +00:00
Conrad Meyer	8e0e6abc1f	subr_gtaskqueue: Fix minor leak of tq_name in error case Reported by: cppcheck Sponsored by: Dell EMC Isilon	2018-03-10 01:01:01 +00:00
Brooks Davis	dd51fec3b9	Copyout a whole int to cpuset_domain's policy pointer. The previous code only copied 16-bits and corrupted the target int. Reviewed by: kib, markj Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14611	2018-03-09 00:50:40 +00:00
Mark Johnston	bde3b1e1a5	Return E2BIG if we run out of space writing a compressed kernel dump. ENOSPC causes the MD kernel dump code to retry the dump, but this is undesirable in the case where we legitimately ran out of space.	2018-03-08 17:04:36 +00:00
Brooks Davis	91a743004c	Use umtx_copyin_umtx_time32() in __umtx_op_lock_umutex_compat32(). Non-NULL timeouts where copied in improperly and could produce failures due to incompatible data structures. Reviewed by: kib MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14587	2018-03-06 01:52:04 +00:00
Brooks Davis	aec37bad99	Regen after r330517.	2018-03-05 17:02:50 +00:00
Brooks Davis	1c1b4c66b6	Remove remenants of 1990s efforts to let us run Net/OpenBSD binaries. No functional change (comments change in some generated files.) Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14571	2018-03-05 17:02:16 +00:00
Mateusz Guzik	0ad122a966	lockmgr: save on sleepq when cmpset fails	2018-03-05 00:30:07 +00:00
Mateusz Guzik	93d41967da	lockmgr: whack unused lockmgr_note_exclusive_upgrade	2018-03-04 22:14:20 +00:00
Mateusz Guzik	9f4e008d4a	mtx: tidy up recursion handling in thread lock Normally after grabbing the lock it has to be verified we got the right one to begin with. However, if we are recursing, it must not change thus the check can be avoided. In particular this avoids a lock read for non-recursing case which found out the lock was changed. While here avoid an irq trip of this happens. Tested by: pho (previous version)	2018-03-04 22:01:23 +00:00
Mateusz Guzik	a8e747c5e7	sx: don't do an atomic op in upgrade if it cananot succeed The code already pays the cost of reading the lock to obtain the waiters flag. Checking whether there is more than one reader is not a problem and avoids dirtying the line. This also fixes a small corner case: if waiters were to show up between reading the flag and upgrading the lock, the operation would fail even though it should not. No correctness change here though.	2018-03-04 21:41:05 +00:00
Mateusz Guzik	d94df98c5c	locks: fix a corner case in r327399 If there were exactly rowner_retries/asx_retries (by default: 10) transitions between read and write state and the waiters still did not get the lock, the next owner -> reader transition would result in the code correctly falling back to turnstile/sleepq where it would incorrectly think it was waiting for a writer and decide to leave turnstile/sleepq to loop back. From this point it would take ts/sq trips until the lock gets released. The bug sometimes manifested itself in stalls during -j 128 package builds. Refactor the code to fix the bug, while here remove some of the gratituous differences between rw and sx locks.	2018-03-04 21:38:30 +00:00
Mateusz Guzik	1c6987ebc5	lockmgr: start decomposing the main routine The main routine takes 8 args, 3 of which are almost the same for most uses. This in particular pushes it above the limit of 6 arguments passable through registers on amd64 making it impossible to tail call. This is a prerequisite for further cleanups. Tested by: pho	2018-03-04 19:12:54 +00:00
Hans Petter Selasky	2077229b56	Allow pause_sbt() to catch signals during sleep by passing C_CATCH flag. Define pause_sig() function macro helper similarly to other kernel functions which catch signals. Update outdated function description. Discussed with: kib@ MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-03 18:36:38 +00:00
Hans Petter Selasky	54fc03834a	Correct the return code from pause() during cold startup from zero to EWOULDBLOCK. This also matches the description in pause(9). Discussed with: kib@ MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-03 18:12:21 +00:00
Brooks Davis	93e48a303a	Rename kernel-only members of semid_ds and msgid_ds. This deliberately breaks the API in preperation for future syscall revisions which will remove these nonstandard members. In an exp-run a single port (devel/qemu-user-static) was found to use them which it did becuase it emulates system calls. This has been fixed in the ports tree. PR: 224443 (exp-run) Reviewed by: kib, jhb (previous version) Exp-run by: antoine Sponsored by: DARPA, AFRP Differential Revision: https://reviews.freebsd.org/D14490	2018-03-02 22:10:48 +00:00
Mateusz Guzik	c505b59961	sx: fix adaptive spinning broken in r327397 The condition was flipped. In particular heavy multithreaded kernel builds on zfs started suffering due to nested sx locks. For instance make -s -j 128 buildkernel: before: 3326.67s user 1269.62s system 6981% cpu 1:05.84 total after: 3365.55s user 911.27s system 6871% cpu 1:02.24 total ps. .-'---`-. .-'---`-. ,' `. ,' `. \| \ \| \ \| \ \| \ \ _ \ \ _ \ ,\ _ ,'-,/-)\ ,\ _ ,'-,/-)\ ( * \ \,' ,' ,'-) ( * \ \,' ,' ,'-) `._,) -',-') `._,) -',-') \/ ''/ \/ ''/ ) / / ) / / / ,'-' / ,'-'	2018-03-02 21:26:27 +00:00
Mateusz Guzik	9d4e369ae8	Don't generate data in sysctl_out_proc unless we intend to copy out. The first call is used to gauge how much spaces is needed. Just computing the size instead of generating the output allows to not take the proctree lock.	2018-02-25 15:16:58 +00:00
Jeff Roberson	1c2529ab32	Fix issues with sparse cpu allocation. Consistently use mp_maxid + 1. Reported by: pho Reviewed by: markj Sponsored by: Netflix, Dell/EMC Isilon	2018-02-25 00:35:21 +00:00
Conrad Meyer	63901c0171	kern/sys_generic.c: style(9) return(foo) -> return (foo) No functional change. Sponsored by: Dell EMC Isilon	2018-02-24 01:15:33 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Kirk McKusick	16680b6af5	Include error number in the "fsync: giving up on dirty" message (in case it ever starts happening again in spite of 328444). Submitted by: Andreas Longwitz <longwitz at incore.de>	2018-02-23 21:57:10 +00:00
Konstantin Belousov	4c8a8cfcde	Restore UP build. Reviewed by: truckman Sponsored by: The FreeBSD Foundation	2018-02-23 18:26:31 +00:00
Ed Maste	315fbaeca2	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
Don Lewis	97e9382d56	Decrease latency by not wrapping the idle loop's potentially lengthy search for a thread to steal inside a critical section. Since this allows the search to be preempted, restart the search if preemption happens since the search results found earlier may no longer be valid. Decrease the latency of starting a thread that may be assigned to this CPU during the search by polling for incoming threads during the search and switching to that thread instead of continuing the search. Test for stale search results and restart the search before going through the expense of calling tdq_lock_pair(). Retry some tests after grabbing the locks since things may have changed while waiting to get both locks. Eliminate special case handling for stealing from an SMT peer that uses 1 as the steal threshold. This can only succeed if a thread has been assigned but our SMT peer has not yet started executing it. This is quite rare and when it happens the other SMT thread is generally waiting for the same tdq lock that we hold. Basically both SMT threads are racing to grab the same spin lock. Add the kern.sched.always_steal knob from a ULE patch by jeff@. Incorporate another idea from Jeff's ULE patch. If the sched_switch() detects that the CPU is about to go idle, try to steal a thread before switching to the idle thread. Since the search for a thread to steal has to be done inside a critical section in this context, limit the impact on latency by adding the knob kern.sched.trysteal_limit to limit the topological distance of the search and don't restart the search if we detect stale results. If this search can't find an stealable thread, the idle loop can do a more complete search. Also poll for threads being assigned to this CPU during the search and switch to them instead of continuing the search. This change is responsibile for the majority of the improvement in parallel buildworld times. In sched_balance_group() change the minimum threshold from stealing a thread from 1 to 2. Poaching a newly assigned thread from a CPU that is waking up hasn't yet switched to that thread from idle is likely very rare and is likely to have the same lock race as is seen when stealing threads in the idle loop. Also use tdq_notify() to kick the destintation CPU instead of always sending an IPI. Update a stale comment, the number of transferable threads is not calculated. Reviewed by: kib (earlier version) Comments by: avg, jeff, mav MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12130	2018-02-23 00:12:51 +00:00
Mateusz Guzik	a0c722bdbf	Fix up sysctl vfs.buffercache broken in r329612 Sample problem: top: sysctl(vfs.bufspace...) expected 8, got 4 Reported by: O. Hartmann <ohartmann walstatt.org>	2018-02-22 20:39:25 +00:00
Eric van Gyzen	0127914caa	sched_ule: update a comment to reflect reality MFC after: 3 days Sponsored by: Dell EMC	2018-02-22 17:09:26 +00:00
Jeff Roberson	683ca3a432	Fix the broken subqueue assignment for the cleanq. Reported by: pho Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-02-20 21:27:17 +00:00
Mateusz Guzik	500ca73d43	mtx: add debug assertions to mtx_spin_wait_unlocked	2018-02-20 20:39:34 +00:00
Mateusz Guzik	862db53fb5	Fix reaping on process fd close broken after r329449 The only consumer of proc_reap other than proc_to_reap was not updated to not PROC_SLOCK. Reported by: Juan Ramon Molina Menor <listjm club.fr>	2018-02-20 20:19:38 +00:00
Brooks Davis	b81e88d296	Reduce duplication in dynamic syscall registration code. Remove the unused syscall_(de)register() functions in favor of the better documented and easier to use syscall_helper_(un)register(9) functions. The default and freebsd32 versions differed in which array of struct sysents they used and a few missing updates to the 32-bit code as features were added to the main code. Reviewed by: cem Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14337	2018-02-20 18:08:57 +00:00
Mateusz Guzik	681a1b752c	Make killpg1 perform process validity checks without proc lock held.	2018-02-20 10:52:07 +00:00
Mateusz Guzik	81d68271d7	Reduce contention on the proctree lock during heavy package build. There is a proctree -> allproc ordering established. Most of the time it is either xlock -> xlock or slock -> slock. On fork however there is a slock -> xlock pair which results in pathological wait times due to threads keeping proctree held for reading and all waiting on allproc. Switch this to xlock -> xlock. Longer term fix would get rid of proctree in this place to begin with. Right now it is necessary to walk the session/process group lists to determine which id is free. The walk can be avoided e.g. with bitmaps. The exit path used to have one place which dealt with allproc and then with proctree. Move the allproc acquire into the section protected by proctree. This reduces contention against threads waiting on proctree in the fork codepath - the fork proctree holder does not have to wait for allproc as often. Finally, move tidhash manipulation outside of the area protected by either of these locks. The removal from the hash was already unprotected. There is no legitimate reason to look up thread ids for a process still under construction. This results in about 50% wait time reduction during -j 128 package build.	2018-02-20 02:18:30 +00:00
Jeff Roberson	06220fa737	Further parallelize the buffer cache. Provide multiple clean queues partitioned into 'domains'. Each domain manages its own bufspace and has its own bufspace daemon. Each domain has a set of subqueues indexed by the current cpuid to reduce lock contention on the cleanq. Refine the sleep/wakeup around the bufspace daemon to use atomics as much as possible. Add a B_REUSE flag that is used to requeue bufs during the scan to approximate LRU rather than locking the queue on every use of a frequently accessed buf. Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14274	2018-02-20 00:06:07 +00:00
Mateusz Guzik	2ca66c1ef5	Fix process exit vs reap race introduced in r329449 The race manifested itself mostly in terms of crashes with "spin lock held too long". Relevant parts of respective code paths: exit: reap: PROC_LOCK(p); PROC_SLOCK(p); p->p_state == PRS_ZOMBIE PROC_UNLOCK(p); PROC_LOCK(p); /* exit work / if (p->p_state == PRS_ZOMBIE) / true / proc_reap() free proc / more exit work */ PROC_SUNLOCK(p); Thus a still exiting process is reaped. Prior to the change the zombie check was followed by slock/sunlock trip which prevented the problem. Even code prior to this commit has a bug: the proc is still accessed for statistic collection purposes. However, the severity is rather small and the bug may be fixed in a future commit. Reported by: many Tested by: allanjude	2018-02-19 00:54:08 +00:00
Mateusz Guzik	d257698833	mtx: add mtx_spin_wait_unlocked The primitive can be used to wait for the lock to be released. Intended usage is for locks in structures which are about to be freed. The benefit is the avoided interrupt enable/disable trip + atomic op to grab the lock and shorter wait if the lock is held (since there is no worry someone will contend on the lock, re-reads can be more aggressive). Briefly discussed with: kib	2018-02-19 00:38:14 +00:00
Mateusz Guzik	7beb60820f	exit: get rid of PROC_SLOCK when checking a process to report, take #2 The suspension counter needs synchronisation through slock, but we don't need it to check if inspecting the counter is necessary to begin with. In the common case it is not, thus avoid the lock if possible. Reviewed by: kib Tested by: pho	2018-02-18 21:07:15 +00:00
Mariusz Zaborski	965cd21173	Fix broken assertion in r329520. Reported by: pho@ lwhsu@	2018-02-18 20:04:39 +00:00
Brooks Davis	7a095112b2	Correct/improve the descriptions if kern.ipc.(shmsegs,sema,msqids). The description of kern.ipc.shmsegs was wrong since 2005. I updated the others (which were more correct) to match. PR: 225933 Reviewed by: cem MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14391	2018-02-18 19:19:36 +00:00
Mariusz Zaborski	20641651ec	Use the fdeget_locked function instead of the fget_locked in the sys_capability. Reviewed by: pjd@ (earlier version) Discussed with: mjg@	2018-02-18 15:27:24 +00:00
Mateusz Guzik	8bf6ff2226	Revert r329448. Turns out is is actually racy, reproducible with stress2/misc/truss.sh Requested by: kib	2018-02-17 17:23:43 +00:00
Mateusz Guzik	e4ccf57fdc	Undo LOCK_PROFILING pessimisation after r313454 and r313455 With the option used to compile the kernel both sx and rw shared ops would always go to the slow path which added avoidable overhead even when the facility is disabled. Furthermore the increased time spent doing uncontested shared lock acquire would be bogusly added to total wait time, somewhat skewing the results. Restore old behaviour of going there only when profiling is enabled. This change is a no-op for kernels without LOCK_PROFILING (which is the default).	2018-02-17 12:07:09 +00:00
Mateusz Guzik	ad58e5e86c	exit: stop doing PROC_SLOCK just to call proc_reap It immediately does PROC_SUNLOCK anyway and the lock plays no role.	2018-02-17 09:03:11 +00:00
Mateusz Guzik	9c0e785c58	exit: get rid of PROC_SLOCK when checking a process to report All accessed fields are protected with already held process lock.	2018-02-17 08:48:45 +00:00
Mateusz Guzik	015cd8dc93	On process exit signal the parent after dropping the proctree lock.	2018-02-17 00:24:50 +00:00
Mateusz Guzik	7e588b9219	Unref the prison after proctree is dropped.	2018-02-17 00:23:56 +00:00
Mateusz Guzik	65f29b9caa	Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped. There is a significant contention on the lock during -j 128 package build. This change drops total wait time on this lock by 60%.	2018-02-17 00:23:28 +00:00
Mateusz Guzik	6776bfeb8f	Tidy up kern_wait6 - don't relock curproc in msleep - don't relock proctree if P_STATCHILD is spotted - reformat the proc_to_reap call in the main loop	2018-02-17 00:21:50 +00:00
Brooks Davis	aff4f2d315	Reduce duplication in __acl_*_(file\|link). Add const to new kern_ functions and push down as required. Reviewed by: rwatson Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14174	2018-02-15 21:24:43 +00:00
Mark Johnston	05f0f0e9ea	Fix the test for SET_FOREACH termination. Unlike the queue(3) _FOREACH macros, the iterator for a SET_FOREACH is not NULL after the end of the set is reached.	2018-02-15 17:35:40 +00:00
Mateusz Guzik	f795032b47	rwlock: diff-reduction of runlock compared to sx sunlock	2018-02-14 20:37:33 +00:00
Bryan Drewery	70c144dc78	nanosleep(2): Fix bogus incrementing of rmtp by tc_tick_sbt on [EINTR]. sbt is the time in the future that the tsleep_sbt() is expected to be completed at. sbtt is the current time. Depending on the precision with sysctl kern.timecounter.alloweddeviation the start time may be incremented by tc_tick_sbt. The same increment is needed for the current time of sbtt before calculating the difference. The impact of missing this increment is that rmtp may increase by one tc_tick_sbt on every early [EINTR] return. If the same struct is passed in for rqtp as rmtp this can result in rqtp effectively incrementing by tc_tick_sbt and sleeping longer than originally intended. This problem was introduced in r247797. Reviewed by: kib, markj, vangyzen (all on an older version of the test) MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D14362	2018-02-14 18:43:50 +00:00
Mark Johnston	6026dcd7ca	Add support for zstd-compressed user and kernel core dumps. This works similarly to the existing gzip compression support, but zstd is typically faster and gives better compression ratios. Support for this functionality must be configured by adding ZSTDIO to one's kernel configuration file. dumpon(8)'s new -Z option is used to configure zstd compression for kernel dumps. savecore(8) now recognizes and saves zstd-compressed kernel dumps with a .zst extension. Submitted by: cem (original version) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13101, https://reviews.freebsd.org/D13633	2018-02-13 19:28:02 +00:00
Ian Lepore	157f3d7649	Fix bad indentation. Whitespace only, no functional changes. Reported by: bde@	2018-02-13 17:38:08 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Ian Lepore	bd54c5acba	Add a new sysctl, debug.clock_do_io, to allow manully triggering a one-shot read or write of all registered realtime clocks. In the read case, the values read are simply discarded. For writes, there's no alternative but to actually write the current system time to the device.	2018-02-12 17:41:11 +00:00
Ian Lepore	45eee6db6f	Add a set of convenience routines for RTC drivers to use for debug output, and a debug.clock_show_io sysctl to control debugging output.	2018-02-12 17:33:14 +00:00
Ian Lepore	aedc51f11a	Replace the existing print_ct() private debugging function with a set of three public functions to format and print the three major data structures used by realtime clock drivers (clocktime, bcd_clocktime, and timespec).	2018-02-12 16:25:56 +00:00
Kirk McKusick	13a025d5d8	Merge biodone_finish() back into biodone(). The primary purpose is to make the order of operations clearer to avoid the race condition that was fixed in r328914. In particular, this commit corrects a similar race that existed in the soft updates callback. Doing some sleuthing through the SVN repository, it appears that bufdone_finish() was added to support XFS: ------------------------------------------------------------------------ r153192 \| rodrigc \| 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) \| 13 lines Changes imported from XFS for FreeBSD project: - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@ ------------------------------------------------------------------------ It does not appear to ever have been used for anything else. XFS was disconnected in r241607: ------------------------------------------------------------------------ r241607 \| attilio \| 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) \| 5 lines Disconnect non-MPSAFE XFS from the build in preparation for dropping GIANT from VFS. This is not targeted for MFC. ------------------------------------------------------------------------ and removed entirely in r247631: ------------------------------------------------------------------------ r247631 \| attilio \| 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) \| 5 lines Garbage collect XFS bits which are now already completely disconnected from the tree since few months. This is not targeted for MFC. ------------------------------------------------------------------------ Since XFS support is gone, there is no reason to retain biodone_finish(). Suggested by: Warner Losh (imp) Discussed with: cem, kib Tested by: Peter Holm (pho)	2018-02-09 19:50:47 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Andriy Gapon	b2387aa652	exec_map_first_page: fix an inverse condition introduced in r254138 While the bug itself was serious, as we could either pass a non-busied page to vm_pager_get_pages() or leak a busy page, it could only be triggered under a very rare condition where the page is already inserted into the object, but it is not valid yet. Reviewed by: kib MFC after: 2 weeks	2018-02-07 21:51:59 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Mark Johnston	1d3a1bcfac	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
Ian Lepore	ce75945d3c	Use const pointers for input data not modified by clock utility functions.	2018-02-06 22:17:01 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	ae941b1b4e	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Kirk McKusick	47806d1b93	Occasional cylinder-group check-hash errors were being reported on systems running with a heavy filesystem load. Tracking down this bug was elusive because there were actually two problems. Sometimes the in-memory check hash was wrong and sometimes the check hash computed when doing the read was wrong. The occurrence of either error caused a check-hash mismatch to be reported. The first error was that the check hash in the in-memory cylinder group was incorrect. This error was caused by the following sequence of events: - We read a cylinder-group buffer and the check hash is valid. - We update its cg_time and cg_old_time which makes the in-memory check-hash value invalid but we do not mark the cylinder group dirty. - We do not make any other changes to the cylinder group, so we never mark it dirty, thus do not write it out, and hence never update the incorrect check hash for the in-memory buffer. - Later, the buffer gets freed, but the page with the old incorrect check hash is still in the VM cache. - Later, we read the cylinder group again, and the first page with the old check hash is still in the VM cache, but some other pages are not, so we have to do a read. - The read does not actually get the first page from disk, but rather from the VM cache, resulting in the old check hash in the buffer. - The value computed after doing the read does not match causing the error to be printed. The fix for this problem is to only set cg_time and cg_old_time as the cylinder group is being written to disk. This keeps the in-memory check-hash valid unless the cylinder group has had other modifications which will require it to be written with a new check hash calculated. It also requires that the check hash be recalculated in the in-memory cylinder group when it is marked clean after doing a background write. The second problem was that the check hash computed at the end of the read was incorrect because the calculation of the check hash on completion of the read was being done too soon. - When a read completes we had the following sequence: - bufdone() -- b_ckhashcalc (calculates check hash) -- bufdone_finish() --- vfs_vmio_iodone() (replaces bogus pages with the cached ones) - When we are reading a buffer where one or more pages are already in memory (but not all pages, or we wouldn't be doing the read), the I/O is done with bogus_page mapped in for the pages that exist in the VM cache. This mapping is done to avoid corrupting the cached pages if there is any I/O overrun. The vfs_vmio_iodone() function is responsible for replacing the bogus_page(s) with the cached ones. But we were calculating the check hash before the bogus_page(s) were replaced. Hence, when we were calculating the check hash, we were partly reading from bogus_page, which means we calculated a bad check hash (e.g., because multiple pages have been mapped to bogus_page, so its contents are indeterminate). The second fix is to move the check-hash calculation from bufdone() to bufdone_finish() after the call to vfs_vmio_iodone() so that it computes the check hash over the correct set of pages. With these two changes, the occasional cylinder-group check-hash errors are gone. Submitted by: David Pfitzner <dpfitzner@netflix.com> Reviewed by: kib Tested by: David Pfitzner	2018-02-06 00:19:46 +00:00
John Baldwin	15746ef43a	Ignore relocation tables for non-memory-resident sections. As a followup to r328101, ignore relocation tables for ELF object sections that are not memory resident. For modules loaded by the loader, ignore relocation tables whose associated section was not loaded by the loader (sh_addr is zero). For modules loaded at runtime via kldload(2), ignore relocation tables whose associated section is not marked with SHF_ALLOC. Reported by: Mori Hiroki <yamori813@yahoo.co.jp>, adrian Tested on: mips, mips64 MFC after: 1 month Sponsored by: DARPA / AFRL	2018-02-05 23:35:33 +00:00
John Baldwin	d722231bca	Always give ELF brands a chance to veto a match. If a brand provides a header_supported hook, check it when trying to find a brand based on a matching interpreter as well as in the final loop for the fallback brand. Previously a brand might reject a binary via a header_supported hook in one of the earlier loops, but still be chosen by one of these later loops. Reviewed by: kib Obtained from: CheriBSD MFC after: 2 weeks Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D13945	2018-02-05 23:27:42 +00:00
Brooks Davis	0a2c60c371	Reduce duplication in extattr_*_(file\|link) syscalls. Reviewed by: rwatson Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14173	2018-02-05 19:06:34 +00:00
Brooks Davis	02bc058f79	ANSIfy syscall implementations. Reviewed by: rwatson Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14172	2018-02-05 18:58:55 +00:00
Brooks Davis	0fd25723bc	Add kern.ipc.{msqids,semsegs,sema} sysctls for FreeBSD32. Stop leaking kernel pointers though theses sysctls and make sure that the padding in the structures is zeroed on allocation to avoid other leaks. Reviewed by: gordon, kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D13459	2018-02-02 18:03:12 +00:00
Hans Petter Selasky	64282a1274	Slightly bump the maximum OID path for loading tunable SYSCTLs. Coming updates to the mlx5en(4) driver will require this. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-02-02 12:42:46 +00:00
Kirk McKusick	5a35a04255	One of the vnode fields listed by vn_printf is the union of pointers whose type depends on the type of vnode. Correct vn_printf so that it correctly identifies the name of the pointer that it is printing. Submitted by: Andreas Longwitz <longwitz at incore.de> MFC after: 1 week	2018-01-31 22:49:50 +00:00
Ed Maste	37880089ac	makesyscalls: permit a range of syscall numbers for UNIMPL Some ABIs have large gaps in syscall numbers. Allow gaps to be filled as ranges of UNIMPL, with an entry like: 248-1023 AUE_NULL UNIMPL unimplemented Reviewed by: jhb, gnn Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14122	2018-01-30 18:29:38 +00:00
Bryan Drewery	595109196a	Don't use an .OBJDIR for 'make sysent'. Reported by: emaste, jhb Sponsored by: Dell EMC	2018-01-29 19:14:15 +00:00
Li-Wen Hsu	99178c2f09	Fix LINT build after r328508, add forgotten part in format string Reviewed by: delphij Differential Revision: https://reviews.freebsd.org/D14089	2018-01-29 02:29:08 +00:00
Warner Losh	7faed6e3e9	Create deprecation management functions. gone_in(majar, msg); If we're running in FreeBSD major, tell the user this code may be deleted soon. If we're running in FreeBSD major - 1, the the user is deprecated and will be gone in major. Otherwise say nothing. gone_in_dev(dev, major, msg) Just like gone_in, except use device_printf. New tunable / sysctl debug.oboslete_panic: 0 - don't panic, 1 - panic in major or newer , 2 - panic in major - 1 or newer default: 0 if NO_OBSOLETE_CODE is defined, then both of these turn into compile time errors when building for major. Add options NO_OBSOLETE_CODE to kernel build system. This lets us tag code that's going away so users know it will be gone, as well as automatically manage things. Differential Review: https://reviews.freebsd.org/D13818	2018-01-29 00:14:39 +00:00
Warner Losh	55cf33a584	Add the DF_SUSPENDED flag to flags that are printed.	2018-01-28 05:13:17 +00:00
Kirk McKusick	4d93711d80	For many years the message "fsync: giving up on dirty" has occationally appeared on UFS/FFS filesystems. In some cases it was promptly followed by a panic of "softdep_deallocate_dependencies: dangling deps". This fix should eliminate both of these occurences. Submitted by: Andreas Longwitz <longwitz at incore.de> Reviewed by: kib Tested by: Peter Holm (pho) PR: 225423 MFC after: 1 week	2018-01-26 18:17:11 +00:00
Li-Wen Hsu	5a70796a71	Fix build for architectures where size_t is not unsigned long Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D14045	2018-01-25 06:37:14 +00:00
Conrad Meyer	bd555da94b	malloc(9): Change nominal size to size_t to match standard C No functional change -- size_t matches unsigned long on all platforms. Reported by: bde Discussed with: jhb Sponsored by: Dell EMC Isilon	2018-01-24 19:37:18 +00:00
Wojciech Macek	072c8a3b39	Reverting r328320	2018-01-24 13:57:01 +00:00
Wojciech Macek	4d249cdd4c	ULE: provide defaults to ts_cpu Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0. This had no practical effect since a real CPU was chosen immediately by the scheduler. However, on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU when assigned the initial choice of the old one. This caused an attempt to use illegal memory and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP explicitly and add some asserts to see that this problem does not recur. Authored by: Nathan Whitehorn <nwhitehorn@freebsd.org> Submitted by: Wojciech Macek <wma@semihalf.com> Obtained from: Semihalf Differential revision: https://reviews.freebsd.org/D13932	2018-01-24 07:54:05 +00:00
Pedro F. Giffuni	44c514b142	Forgot to sort here in r328238.	2018-01-22 02:26:10 +00:00
Pedro F. Giffuni	d821d36419	Unsign some values related to allocation. When allocating memory through malloc(9), we always expect the amount of memory requested to be unsigned as a negative value would either stand for an error or an overflow. Unsign some values, found when considering the use of mallocarray(9), to avoid unnecessary casting. Also consider that indexes should be of at least the same size/type as the upper limit they pretend to index. MFC after: 3 weeks	2018-01-22 02:08:10 +00:00
Pedro F. Giffuni	ac2fffa4b7	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Andriy Gapon	3e4f610dad	correct read-ahead calculations in vfs_bio_getpages Previously the calculations were done as if the requested region ended at the start of the last requested page, not its end. The problem as actually quite minor as it affected only stats and page prefaulting, not the actual page data, and only with specific parameters. Reviewed by: kib (previous version) MFC after: 2 weeks	2018-01-18 12:59:04 +00:00
Wojciech Macek	5b3e8b0725	KDB: restart only CPUs stopped by KDB There is a case when not all CPUs went online. In that situation, restart only APs which were operational before entering KDB. Created by: Wojciech Macek <wma@semihalf.com> Obtained from: Semihalf Reviewed by: nwhitehorn Differential revision: https://reviews.freebsd.org/D13949 Sponsored by: QCM Technologies	2018-01-18 07:38:54 +00:00
John Baldwin	58c4aee0d7	Require the SHF_ALLOC flag for program sections from kernel object modules. ELF object files can contain program sections which are not supposed to be loaded into memory (e.g. .comment). Normally the static linker uses these flags to decide which sections are allocated to loadable program segments in ELF binaries and shared objects (including kernels on all architectures and kernel modules on architectures other than amd64). Mapping ELF object files (such as amd64 kernel modules) into memory directly is a bit of a grey area. ELF object files are intended to be used as inputs to the static linker. As a result, there is not a standardized definition for what the memory layout of an ELF object should be (none of the section headers have valid virtual memory addresses for example). The kernel and loader were not checking the SHF_ALLOC flag but loading any program sections with certain types such as SHT_PROGBITS. As a result, the kernel and loader would load into RAM some sections that weren't marked with SHF_ALLOC such as .comment that are not loaded into RAM for kernel modules on other architectures (which are implemented as ELF shared objects). Aside from possibly requiring slightly more RAM to hold a kernel module this does not affect runtime correctness as the kernel relocates symbols based on the layout it uses. Debuggers such as gdb and lldb do not extract symbol tables from a running process or kernel. Instead, they replicate the memory layout of ELF executables and shared objects and use that to construct their own symbol tables. For executables and shared objects this works fine. For ELF objects the current logic in kgdb (and probably lldb based on a simple reading) assumes that only sections with SHF_ALLOC are memory resident when constructing a memory layout. If the debugger constructs a different memory layout than the kernel, then it will compute different addresses for symbols causing symbols in the debugger to appear to have the wrong values (though the kernel itself is working fine). The current port of mdb does not check SHF_ALLOC as it replicates the kernel's logic in its existing kernel support. The bfd linker sorts the sections in ELF object files such that all of the allocated sections (sections with SHF_ALLOCATED) are placed first followed by unallocated sections. As a result, when kgdb composed a memory layout using only the allocated sections, this layout happened to match the layout used by the kernel and loader. The lld linker does not sort the sections in ELF object files and mixed allocated and unallocated sections. This resulted in kgdb composing a different memory layout than the kernel and loader. We could either patch kgdb (and possibly in the future lldb) to use custom handling when generating memory layouts for kernel modules that are ELF objects, or we could change the kernel and loader to check SHF_ALLOCATED. I chose the latter as I feel we shouldn't be loading things into RAM that the module won't use. This should mostly be a NOP when linking with bfd but will allow the existing kgdb to work with amd64 kernel modules linked with lld. Note that we only require SHF_ALLOC for "program" sections for types like SHT_PROGBITS and SHT_NOBITS. Other section types such as symbol tables, string tables, and relocations must also be loaded and are not marked with SHF_ALLOC. Reported by: np Reviewed by: kib, emaste MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D13926	2018-01-17 22:51:59 +00:00

... 2 3 4 5 6 ...

16150 Commits