freebsd-dev

Author	SHA1	Message	Date
Matt Macy	8dcbd0eae6	epoch(9): always set inited in epoch_init - set inited in the !usedomains case Reported by: jhibbits Approved by: sbruno	2018-05-11 18:37:14 +00:00
Matt Macy	4aa302dfc9	epoch(9): callback task fixes - initialize the pcpu STAILQ in the NUMA case - don't enqueue the callback task if there isn't sufficient work to be done Reported by: pho@ Approved by: sbruno@	2018-05-11 08:16:56 +00:00
Mateusz Guzik	85c1b3c1cb	rmlock: partially depessimize lock/unlock fastpath Previusly the slow path was folded in and partially jumped over in the common case.	2018-05-11 06:59:54 +00:00
Matt Macy	b2cb28963b	epoch(9): fix priority handling, make callback lists pcpu, and other fixes - Lend priority to preempted threads in epoch_wait to handle the case in which we've had priority lent to us. Previously we borrowed the priority of the lowest priority preempted thread. (pointed out by mjg@) - Don't attempt allocate memory per-domain on powerpc, we don't currently handle empty sockets (as is the case on jhibbits Talos' board). - Handle deferred callbacks as pcpu lists and poll the lists periodically. Currently the interval is 1/hz. - Drop the thread lock when adaptive spinning. Holding the lock starves other threads and can even lead to lockups. - Keep a generation count pcpu so that we don't keep spining if a thread has left and re-entered an epoch section. - Actually removed the callback from the callback list so that we don't double free. Sigh ... Approved by: sbruno@	2018-05-11 04:54:12 +00:00
Matt Macy	06bf2a6aef	Add simple preempt safe epoch API Read locking is over used in the kernel to guarantee liveness. This API makes it easy to provide livenes guarantees without atomics. Includes epoch_test kernel module to stress test the API. Documentation will follow initial use case. Test case and improvements to preemption handling in response to discussion with mjg@ Reviewed by: imp@, shurd@ Approved by: sbruno@	2018-05-10 17:55:24 +00:00
Andrew Gallatin	d5cdcc3a06	Fix the build after r333457 In r333457, the arguments to kern_pwritev() were accidentally re-ordered as part of ANSIfication, breaking the build.	2018-05-10 13:19:42 +00:00
Ed Maste	cc3c9df80f	ANSIfy sys_generic.c	2018-05-10 11:36:16 +00:00
Matt Macy	36688f706e	Add taskqgroup_config_gtask_deinit to support teardown after taskqgroup_config_gtask_init. Approved by: sbruno	2018-05-09 18:51:35 +00:00
Matt Macy	cbd92ce62e	Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month	2018-05-09 18:47:24 +00:00
Konstantin Belousov	55c9d75e6b	Avoid calls to bzero() before ireloc. Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see correct flags, most important is SMAP. Tested by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D15367	2018-05-09 14:39:24 +00:00
Matt Macy	ad738f3791	Reduce overhead of ktrace checks in the common case. KTRPOINT() checks both if we are tracing _and_ if we are recursing within ktrace. The second condition is only ever executed if ktrace is actually enabled. This change moves the check out of the hot path in to the functions themselves. Discussed with mjg@ Reported by: mjg@ Approved by: sbruno@	2018-05-09 00:00:47 +00:00
Mateusz Guzik	2824088536	Inlined sched_userret. The tested condition is rarely true and it induces a function call on each return to userspace. Bumps getuid rate by about 1% on Broadwell.	2018-05-07 23:36:16 +00:00
Mateusz Guzik	75e9b455a9	Change trap_enotcap to bool and annotate with __read_frequently It is read on each return to user space.	2018-05-07 23:10:12 +00:00
Mateusz Guzik	79ca7cbf09	Avoid calls to syscall_thread_enter/exit for statically defined syscalls The entire mechanism is rarely used and is quite not performant due to atomci ops on the syscall table. It also has added overhead for completely unrelated syscalls. Reduce it by avoiding the func calls if possible (which consistutes vast majority of cases). Provides about 3% syscall rate speed up for getuid on Broadwell.	2018-05-07 22:29:32 +00:00
Warner Losh	ad7142757b	Add device_quiet_children() and device_has_quiet_children() If you add a child to a device that has quiet children, we'll automatically set the quiet flag on the children, and its children. This is indended for things like CPU that have a large amount of repetition in booting that adds nothing.	2018-05-07 21:09:08 +00:00
Andrew Gallatin	e7bd0750af	Boost thread priority while changing CPU frequency Boost the priority of user-space threads when they set their affinity to a core to adjust its frequency. This avoids a situation where a CPU bound kernel thread with the same affinity is running on a down-clocked core, and will "block" powerd from up-clocking the core until the kernel thread yields. This can lead to poor perfomance, and to things potentially getting stuck on Giant. Reviewed by: kib (imp reviewed earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15246	2018-05-07 15:24:03 +00:00
Mark Johnston	bd92e6b6f5	Refactor some of the MI kernel dump code in preparation for netdump. - Add clear_dumper() to complement set_dumper(). - Drain netdump's preallocated mbuf pool when clearing the dumper. - Don't do bounds checking for dumpers with mediasize 0. - Add dumper callbacks for initialization for writing out headers. Reviewed by: sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15252	2018-05-06 00:22:38 +00:00
Mark Johnston	5475ca5aca	Add an mbuf allocator for netdump. The aim is to permit mbuf allocations after a panic without calling into the page allocator, without imposing any runtime overhead during regular operation of the system, and without modifying driver code. The approach taken is to preallocate a number of mbufs and clusters, storing them in linked lists, and using the lists to back some UMA cache zones. At panic time, the mbuf and cluster zone pointers are overwritten with those of the cache zones so that the mbuf allocator returns preallocated items. Using this scheme, drivers which cache mbuf zone pointers from m_getzone() require special handling when implementing netdump support. Reviewed by: cem (earlier version), julian, sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15251	2018-05-06 00:19:48 +00:00
Mark Johnston	c2ba2d1b0e	Style. MFC after: 3 days	2018-05-06 00:11:30 +00:00
Andriy Gapon	bd3afae0ca	for bus suspend, detach and shutdown iterate children in reverse order For most buses all children are equal, so the order does not matter. Other buses, such as acpi, carefully order their child devices to express implicit dependencies between them. For such buses it is safer to bring down devices in the reverse order. I believe that this is the reason why hpet_suspend had to be disabled. Some drivers depend on a working event timer until they are suspended. But previously we would suspend hpet very early. I tested this change by makinbg hpet_suspend actually stop HPET timers and tested that too. Note that this change is not a complete solution as it does not take into account bus passes. A better approach would be to track the actual attach order of the devices and to use the reverse of that. Reviewed by: imp, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15291	2018-05-05 05:19:32 +00:00
Mateusz Guzik	5ec2c93667	tc: bcopy -> memcpy	2018-05-04 22:48:10 +00:00
Jamie Gritton	0e5c6bd436	Make it easier for filesystems to count themselves as jail-enabled, by doing most of the work in a new function prison_add_vfs in kern_jail.c Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and the rest is taken care of. This includes adding a jail parameter like allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed. Both of these used to be a static list of known filesystems, with predefined permission bits. Reviewed by: kib Differential Revision: D14681	2018-05-04 20:54:27 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Matt Macy	748ff486b0	`dup1_processes -t 96 -s 5` on a dual 8160 x dup_before + dup_after +------------------------------------------------------------+ \| x + \| \|x x x x ++ ++\| \| \|____AM___\| \|AM\|\| +------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71 + 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829 Difference at 95.0% confidence 3.08952e+06 +/- 365604 2.03266% +/- 0.245071% (Student's t, pooled s = 250681) Reported by: mjg@ MFC after: 1 week	2018-05-04 06:51:01 +00:00
Konstantin Belousov	7035cf14ee	Implement support for ifuncs in the kernel linker. Required MD bits are only provided for x86. Reviewed by: jhb (previous version, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 21:37:46 +00:00
Stephen Hurd	f3e1324b41	Separate list manipulation locking from state change in multicast Multicast incorrectly calls in to drivers with a mutex held causing drivers to have to go through all manner of contortions to use a non sleepable lock. Serialize multicast updates instead. Submitted by: mmacy <mmacy@mattmacy.io> Reviewed by: shurd, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14969	2018-05-02 19:36:29 +00:00
Mark Johnston	20f85b1ddd	Print the dump progress indicator after calling dump_start(). Dumpers may wish to print messages from an initialization hook; this change ensures that such messages aren't mixed with output from the generic dump code. MFC after: 1 week	2018-05-01 17:32:43 +00:00
Nathan Whitehorn	ee900504cf	Report the kernel base address properly in kldstat when using PowerPC kernels loaded at addresses other than their link address.	2018-05-01 04:06:59 +00:00
Ed Maste	2216c6933c	Disable connectat/bindat with AT_FDCWD in capmode Previously it was possible to connect a socket (which had the CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in capabilties mode. This combination should be treated the same as a call to connect (i.e. forbidden in capabilities mode). Similarly for bindat. Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the documentation and add tests. PR: 222632 Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com> Reviewed by: Domagoj Stolfa MFC after: 1 week Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D15221	2018-04-30 17:31:06 +00:00
Mateusz Guzik	9d68f7741f	systrace: track it like sdt probes While here predict false. Note the code is wrong (regardless of this change). Dereference of the pointer can race with module unload. A fix would set the probe to a nop stub instead of NULL.	2018-04-27 15:16:34 +00:00
Emmanuel Vadot	ee710ecf32	clk: Put the sysctls under hw.clock instead of clock This is more consistant with hw.regulator and other hardware related sysctls.	2018-04-27 00:12:00 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Sean Bruno	7875017ca9	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks	2018-04-24 19:55:12 +00:00
Conrad Meyer	65df124845	Do not totally silence suppressed secondary kasserts unless debug.kassert.do_log is disabled To totally silence and ignore secondary kassert violations after a primary panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1. Additional assertion warnings shouldn't block core dump and may alert the developer to another erroneous condition. Secondary stack traces may be printed, identically to the unsuppressed case where panic() is reentered -- controlled via debug.trace_all_panics. Sponsored by: Dell EMC Isilon	2018-04-24 19:10:51 +00:00
Conrad Meyer	07aa6ea677	Fix debug.kassert.do_log description text This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only since it was originally committed in r243980. Sponsored by: Dell EMC Isilon	2018-04-24 18:59:40 +00:00
Conrad Meyer	ad1fc31570	panic: Optionally, trace secondary panics To diagnose and fix secondary panics, it is useful to have a stack trace. When panic tracing is enabled, optionally trace secondary panics as well. The option is configured with the tunable/sysctl debug.trace_all_panics. (The original concern that inspired only tracing the primary panic was likely that the secondary trace may scroll the original panic message or trace off the screen. This is less of a concern for serial consoles with logging. Not everything has a serial console, though, so the behavior is optional.) Discussed with: jhb Sponsored by: Dell EMC Isilon	2018-04-24 18:54:20 +00:00
Jonathan T. Looney	18959b695d	Update r332860 by changing the default from suppressing post-panic assertions to not suppressing post-panic assertions. There are some post-panic assertions that are valuable and we shouldn't default to disabling them. However, when a user trips over them, the user can still adjust the tunable/sysctl to suppress them temporarily to get conduct troubleshooting (e.g. get a core dump). Reported by: cem, markj	2018-04-24 18:47:35 +00:00
Conrad Meyer	b543c98cab	lockmgr: Add missed neutering during panic r313683 introduced new lockmgr APIs that missed the panic-time neutering present in the rest of our locks. Correct that by adding the usual check. Additionally, move the __lockmgr_args neutering above the assertions at the top of the function. Drop the interlock unlock because we shouldn't have an unneutered interlock either. No point trying to unlock it. PR: 227749 Reported by: jtl Sponsored by: Dell EMC Isilon	2018-04-24 18:41:14 +00:00
Mateusz Guzik	d357c16adc	lockf: change the owner hash from pid to vnode-based This adds a bit missed due to the patch split, see r332882 Tested by: pho	2018-04-24 06:10:36 +00:00
Mateusz Guzik	7cd794214a	dtrace: depessimize dtmalloc when dtrace is active Each malloc/free was testing dtrace_malloc_enabled and forcing extra reads from the malloc type struct to see if perhaps a dtmalloc probe was on. Treat it like lockstat and sdt: have a global bolean.	2018-04-24 01:06:20 +00:00
Mateusz Guzik	4c5209cb21	lockstat: track lockstat just like sdt probes In particular flip the frequently tested var to bool.	2018-04-24 01:04:10 +00:00
Mateusz Guzik	c9e05ccd62	malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default) malloc was showing at the top of profile during while running microbenchmarks. #define DTMALLOC_PROBE_MAX 2 struct malloc_type_internal { uint32_t mti_probes[DTMALLOC_PROBE_MAX]; u_char mti_zone; struct malloc_type_stats mti_stats[MAXCPU]; }; Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone (which we know is 0) + part of malloc stats of the first cpu which on top induces false-sharing. In particular will-it-scale lock1_processes -t 128 -s 10: before: average:45879692 after: average:51655596 Note the counters can be padded but the right fix is to move them to counter(9), leaving the struct read-only after creation (modulo dtrace probes).	2018-04-23 22:28:49 +00:00
Sean Bruno	7b7796eea5	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-04-23 19:51:00 +00:00
Mateusz Guzik	833dc05a6e	lockf: add per-chain locks to the owner hash This combined with previous changes significantly depessimizes the behaviour under contentnion. In particular the lock1_processes test (locking/unlocking separate files) from the will-it-scale suite was executed with 128 concurrency on a 4-socket Broadwell with 128 hardware threads. Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%) For reference single-process is ~1680000 (i.e. on stock kernel the resulting perf is less than half of the single-threaded run), Note this still does not really scale all that well as the locks were just bolted on top of the current implementation. Significant room for improvement is still here. In particular the top performance fluctuates depending on the extent of false sharing in given run (which extends beyond the file). Added chain+lock pairs were not padded w.r.t. cacheline size. One big ticket item is the hash used for spreading threads: it used to be the process pid (which basically serialized all threaded ops). Temporarily the vnode addr was slapped in instead. Tested by: pho	2018-04-23 08:23:10 +00:00
Mateusz Guzik	63286976b5	lockf: skip locking the graph if not necessary (common case) Tested by: pho	2018-04-23 07:54:02 +00:00
Mateusz Guzik	717df0b0e8	lockf: perform wakeup onlly when there is anybody waiting Tested by: pho	2018-04-23 07:52:56 +00:00
Mateusz Guzik	c72ead2815	lockf: skip the hard work in lf_purgelocks if possible Tested by: pho	2018-04-23 07:52:10 +00:00
Mateusz Guzik	0d3323f557	lockf: free state only when recycling the vnode This avoids malloc/free cycles when locking/unlocking the vnode when nobody is contending. Tested by: pho	2018-04-23 07:51:19 +00:00
Tijl Coosemans	7dfbbc613b	Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of kproc_suspend_check. In r329612 bufspacedaemon was turned into a thread of the bufdaemon process causing both to call kproc_suspend_check with the same proc argument and that function contains the following while loop: while (SIGISMEMBER(p->p_siglist, SIGSTOP)) { wakeup(&p->p_siglist); msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0); } So one thread wakes up the other and the other wakes up the first again, locking up UP machines on shutdown. Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they run after the syncer has shutdown, because the syncer can cause a situation where bufdaemon help is needed to proceed. PR: 227404 Reviewed by: kib Tested by: cy, rmacklem	2018-04-22 16:05:29 +00:00
Mateusz Guzik	7d853f62bf	lockf: slightly depessimize 1. check if P_ADVLOCK is already set and if so, don't lock to set it (stolen from DragonFly) 2. when trying for fast path unlock, check that we are doing unlock first instead of taking the interlock for no reason (e.g. if we want to lock). whilere make it more likely that falling fast path will not take the interlock either by checking for state Note the code is severely pessimized both single- and multithreaded.	2018-04-22 09:30:07 +00:00

1 2 3 4 5 ...

15996 Commits