Commit Graph

15996 Commits

Author SHA1 Message Date
Matt Macy
8dcbd0eae6 epoch(9): always set inited in epoch_init
- set inited in the !usedomains case

Reported by:	jhibbits
Approved by:	sbruno
2018-05-11 18:37:14 +00:00
Matt Macy
4aa302dfc9 epoch(9): callback task fixes
- initialize the pcpu STAILQ in the NUMA case
- don't enqueue the callback task if there isn't sufficient work to be done

Reported by:	pho@
Approved by:	sbruno@
2018-05-11 08:16:56 +00:00
Mateusz Guzik
85c1b3c1cb rmlock: partially depessimize lock/unlock fastpath
Previusly the slow path was folded in and partially jumped over in the
common case.
2018-05-11 06:59:54 +00:00
Matt Macy
b2cb28963b epoch(9): fix priority handling, make callback lists pcpu, and other fixes
- Lend priority to preempted threads in epoch_wait to handle the case
  in which we've had priority lent to us. Previously we borrowed the
  priority of the lowest priority preempted thread. (pointed out by mjg@)

- Don't attempt allocate memory per-domain on powerpc, we don't currently
  handle empty sockets (as is the case on jhibbits Talos' board).

- Handle deferred callbacks as pcpu lists and poll the lists periodically.
  Currently the interval is 1/hz.

- Drop the thread lock when adaptive spinning. Holding the lock starves
  other threads and can even lead to lockups.

- Keep a generation count pcpu so that we don't keep spining if a thread
  has left and re-entered an epoch section.

- Actually removed the callback from the callback list so that we don't
  double free. Sigh ...

Approved by:	sbruno@
2018-05-11 04:54:12 +00:00
Matt Macy
06bf2a6aef Add simple preempt safe epoch API
Read locking is over used in the kernel to guarantee liveness. This API makes
it easy to provide livenes guarantees without atomics.

Includes epoch_test kernel module to stress test the API.

Documentation will follow initial use case.

Test case and improvements to preemption handling in response to discussion
with mjg@

Reviewed by:	imp@, shurd@
Approved by:	sbruno@
2018-05-10 17:55:24 +00:00
Andrew Gallatin
d5cdcc3a06 Fix the build after r333457
In r333457, the arguments to kern_pwritev() were accidentally
re-ordered as part of ANSIfication, breaking the build.
2018-05-10 13:19:42 +00:00
Ed Maste
cc3c9df80f ANSIfy sys_generic.c 2018-05-10 11:36:16 +00:00
Matt Macy
36688f706e Add taskqgroup_config_gtask_deinit to support teardown after
taskqgroup_config_gtask_init.

Approved by:	sbruno
2018-05-09 18:51:35 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Konstantin Belousov
55c9d75e6b Avoid calls to bzero() before ireloc.
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.

Tested by:	mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D15367
2018-05-09 14:39:24 +00:00
Matt Macy
ad738f3791 Reduce overhead of ktrace checks in the common case.
KTRPOINT() checks both if we are tracing _and_ if we are recursing within
ktrace. The second condition is only ever executed if ktrace is actually
enabled. This change moves the check out of the hot path in to the functions
themselves.

Discussed with mjg@

Reported by:	mjg@
Approved by:	sbruno@
2018-05-09 00:00:47 +00:00
Mateusz Guzik
2824088536 Inlined sched_userret.
The tested condition is rarely true and it induces a function call
on each return to userspace.

Bumps getuid rate by about 1% on Broadwell.
2018-05-07 23:36:16 +00:00
Mateusz Guzik
75e9b455a9 Change trap_enotcap to bool and annotate with __read_frequently
It is read on each return to user space.
2018-05-07 23:10:12 +00:00
Mateusz Guzik
79ca7cbf09 Avoid calls to syscall_thread_enter/exit for statically defined syscalls
The entire mechanism is rarely used and is quite not performant due to
atomci ops on the syscall table. It also has added overhead for completely
unrelated syscalls.

Reduce it by avoiding the func calls if possible (which consistutes vast
majority of cases).

Provides about 3% syscall rate speed up for getuid on Broadwell.
2018-05-07 22:29:32 +00:00
Warner Losh
ad7142757b Add device_quiet_children() and device_has_quiet_children()
If you add a child to a device that has quiet children, we'll
automatically set the quiet flag on the children, and its
children.

This is indended for things like CPU that have a large amount of
repetition in booting that adds nothing.
2018-05-07 21:09:08 +00:00
Andrew Gallatin
e7bd0750af Boost thread priority while changing CPU frequency
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency.   This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields.   This can lead to poor perfomance,
and to things potentially getting stuck on Giant.

Reviewed by:	kib (imp reviewed earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15246
2018-05-07 15:24:03 +00:00
Mark Johnston
bd92e6b6f5 Refactor some of the MI kernel dump code in preparation for netdump.
- Add clear_dumper() to complement set_dumper().
- Drain netdump's preallocated mbuf pool when clearing the dumper.
- Don't do bounds checking for dumpers with mediasize 0.
- Add dumper callbacks for initialization for writing out headers.

Reviewed by:	sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15252
2018-05-06 00:22:38 +00:00
Mark Johnston
5475ca5aca Add an mbuf allocator for netdump.
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.

Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.

Reviewed by:	cem (earlier version), julian, sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15251
2018-05-06 00:19:48 +00:00
Mark Johnston
c2ba2d1b0e Style.
MFC after:	3 days
2018-05-06 00:11:30 +00:00
Andriy Gapon
bd3afae0ca for bus suspend, detach and shutdown iterate children in reverse order
For most buses all children are equal, so the order does not matter.
Other buses, such as acpi, carefully order their child devices to
express implicit dependencies between them.  For such buses it is safer
to bring down devices in the reverse order.

I believe that this is the reason why hpet_suspend had to be disabled.
Some drivers depend on a working event timer until they are suspended.
But previously we would suspend hpet very early.

I tested this change by makinbg hpet_suspend actually stop HPET timers
and tested that too.

Note that this change is not a complete solution as it does not take
into account bus passes.
A better approach would be to track the actual attach order of the
devices and to use the reverse of that.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15291
2018-05-05 05:19:32 +00:00
Mateusz Guzik
5ec2c93667 tc: bcopy -> memcpy 2018-05-04 22:48:10 +00:00
Jamie Gritton
0e5c6bd436 Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of.  This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by:	kib
Differential Revision:	D14681
2018-05-04 20:54:27 +00:00
Mark Johnston
1b5c869d64 Fix some races introduced in r332974.
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15280
2018-05-04 17:17:30 +00:00
Matt Macy
748ff486b0 dup1_processes -t 96 -s 5 on a dual 8160
x dup_before
+ dup_after
+------------------------------------------------------------+
|             x                                            + |
|x    x   x   x                                         ++ ++|
|   |____AM___|                                          |AM||
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5  1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08     341205.71
+   5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08     96232.829
Difference at 95.0% confidence
        3.08952e+06 +/- 365604
        2.03266% +/- 0.245071%
        (Student's t, pooled s = 250681)

Reported by:	mjg@
MFC after:	1 week
2018-05-04 06:51:01 +00:00
Konstantin Belousov
7035cf14ee Implement support for ifuncs in the kernel linker.
Required MD bits are only provided for x86.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:37:46 +00:00
Stephen Hurd
f3e1324b41 Separate list manipulation locking from state change in multicast
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969
2018-05-02 19:36:29 +00:00
Mark Johnston
20f85b1ddd Print the dump progress indicator after calling dump_start().
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.

MFC after:	1 week
2018-05-01 17:32:43 +00:00
Nathan Whitehorn
ee900504cf Report the kernel base address properly in kldstat when using PowerPC kernels
loaded at addresses other than their link address.
2018-05-01 04:06:59 +00:00
Ed Maste
2216c6933c Disable connectat/bindat with AT_FDCWD in capmode
Previously it was possible to connect a socket (which had the
CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in
capabilties mode.  This combination should be treated the same as a call
to connect (i.e. forbidden in capabilities mode).  Similarly for bindat.

Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the
documentation and add tests.

PR:		222632
Submitted by:	Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by:	Domagoj Stolfa
MFC after:	1 week
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D15221
2018-04-30 17:31:06 +00:00
Mateusz Guzik
9d68f7741f systrace: track it like sdt probes
While here predict false.

Note the code is wrong (regardless of this change). Dereference of the
pointer can race with module unload. A fix would set the probe to a
nop stub instead of NULL.
2018-04-27 15:16:34 +00:00
Emmanuel Vadot
ee710ecf32 clk: Put the sysctls under hw.clock instead of clock
This is more consistant with hw.regulator and other hardware related
sysctls.
2018-04-27 00:12:00 +00:00
Mark Johnston
5cd29d0f3c Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
Sean Bruno
7875017ca9 Revert r332894 at the request of the submitter.
Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks
2018-04-24 19:55:12 +00:00
Conrad Meyer
65df124845 Do not totally silence suppressed secondary kasserts unless debug.kassert.do_log is disabled
To totally silence and ignore secondary kassert violations after a primary
panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1.

Additional assertion warnings shouldn't block core dump and may alert the
developer to another erroneous condition.  Secondary stack traces may be
printed, identically to the unsuppressed case where panic() is reentered --
controlled via debug.trace_all_panics.

Sponsored by:	Dell EMC Isilon
2018-04-24 19:10:51 +00:00
Conrad Meyer
07aa6ea677 Fix debug.kassert.do_log description text
This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only
since it was originally committed in r243980.

Sponsored by:	Dell EMC Isilon
2018-04-24 18:59:40 +00:00
Conrad Meyer
ad1fc31570 panic: Optionally, trace secondary panics
To diagnose and fix secondary panics, it is useful to have a stack trace.
When panic tracing is enabled, optionally trace secondary panics as well.

The option is configured with the tunable/sysctl debug.trace_all_panics.

(The original concern that inspired only tracing the primary panic was
likely that the secondary trace may scroll the original panic message or trace
off the screen.  This is less of a concern for serial consoles with logging.
Not everything has a serial console, though, so the behavior is optional.)

Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2018-04-24 18:54:20 +00:00
Jonathan T. Looney
18959b695d Update r332860 by changing the default from suppressing post-panic
assertions to not suppressing post-panic assertions.

There are some post-panic assertions that are valuable and we shouldn't
default to disabling them.  However, when a user trips over them, the
user can still adjust the tunable/sysctl to suppress them temporarily to
get conduct troubleshooting (e.g. get a core dump).

Reported by:	cem, markj
2018-04-24 18:47:35 +00:00
Conrad Meyer
b543c98cab lockmgr: Add missed neutering during panic
r313683 introduced new lockmgr APIs that missed the panic-time neutering
present in the rest of our locks.  Correct that by adding the usual check.

Additionally, move the __lockmgr_args neutering above the assertions at the
top of the function.  Drop the interlock unlock because we shouldn't have
an unneutered interlock either.  No point trying to unlock it.

PR:		227749
Reported by:	jtl
Sponsored by:	Dell EMC Isilon
2018-04-24 18:41:14 +00:00
Mateusz Guzik
d357c16adc lockf: change the owner hash from pid to vnode-based
This adds a bit missed due to the patch split, see r332882

Tested by:	pho
2018-04-24 06:10:36 +00:00
Mateusz Guzik
7cd794214a dtrace: depessimize dtmalloc when dtrace is active
Each malloc/free was testing dtrace_malloc_enabled and forcing
extra reads from the malloc type struct to see if perhaps a
dtmalloc probe was on.

Treat it like lockstat and sdt: have a global bolean.
2018-04-24 01:06:20 +00:00
Mateusz Guzik
4c5209cb21 lockstat: track lockstat just like sdt probes
In particular flip the frequently tested var to bool.
2018-04-24 01:04:10 +00:00
Mateusz Guzik
c9e05ccd62 malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default)
malloc was showing at the top of profile during while running microbenchmarks.

#define DTMALLOC_PROBE_MAX              2
struct malloc_type_internal {
        uint32_t        mti_probes[DTMALLOC_PROBE_MAX];
        u_char          mti_zone;
        struct malloc_type_stats        mti_stats[MAXCPU];
};

Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone
(which we know is 0) + part of malloc stats of the first cpu which on top
induces false-sharing.

In particular will-it-scale lock1_processes -t 128 -s 10:
before: average:45879692
after:  average:51655596

Note the counters can be padded but the right fix is to move them to
counter(9), leaving the struct read-only after creation (modulo dtrace
probes).
2018-04-23 22:28:49 +00:00
Sean Bruno
7b7796eea5 Load balance sockets with new SO_REUSEPORT_LB option
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-04-23 19:51:00 +00:00
Mateusz Guzik
833dc05a6e lockf: add per-chain locks to the owner hash
This combined with previous changes significantly depessimizes the behaviour
under contentnion.

In particular the lock1_processes test (locking/unlocking separate files)
from the will-it-scale suite was executed with 128 concurrency on a
4-socket Broadwell with 128 hardware threads.

Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%)
For reference single-process is ~1680000 (i.e. on stock kernel the resulting
perf is less than *half* of the single-threaded run),

Note this still does not really scale all that well as the locks were just
bolted on top of the current implementation. Significant room for improvement
is still here. In particular the top performance fluctuates depending on the
extent of false sharing in given run (which extends beyond the file).
Added chain+lock pairs were not padded w.r.t. cacheline size.

One big ticket item is the hash used for spreading threads: it used to be the
process pid (which basically serialized all threaded ops). Temporarily the
vnode addr was slapped in instead.

Tested by:      pho
2018-04-23 08:23:10 +00:00
Mateusz Guzik
63286976b5 lockf: skip locking the graph if not necessary (common case)
Tested by:      pho
2018-04-23 07:54:02 +00:00
Mateusz Guzik
717df0b0e8 lockf: perform wakeup onlly when there is anybody waiting
Tested by:      pho
2018-04-23 07:52:56 +00:00
Mateusz Guzik
c72ead2815 lockf: skip the hard work in lf_purgelocks if possible
Tested by:      pho
2018-04-23 07:52:10 +00:00
Mateusz Guzik
0d3323f557 lockf: free state only when recycling the vnode
This avoids malloc/free cycles when locking/unlocking the vnode when
nobody is contending.

Tested by:	pho
2018-04-23 07:51:19 +00:00
Tijl Coosemans
7dfbbc613b Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of
kproc_suspend_check.  In r329612 bufspacedaemon was turned into a thread
of the bufdaemon process causing both to call kproc_suspend_check with the
same proc argument and that function contains the following while loop:

while (SIGISMEMBER(p->p_siglist, SIGSTOP)) {
	wakeup(&p->p_siglist);
	msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0);
}

So one thread wakes up the other and the other wakes up the first again,
locking up UP machines on shutdown.

Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they
run after the syncer has shutdown, because the syncer can cause a
situation where bufdaemon help is needed to proceed.

PR:		227404
Reviewed by:	kib
Tested by:	cy, rmacklem
2018-04-22 16:05:29 +00:00
Mateusz Guzik
7d853f62bf lockf: slightly depessimize
1. check if P_ADVLOCK is already set and if so, don't lock to set it
(stolen from DragonFly)
2. when trying for fast path unlock, check that we are doing unlock
first instead of taking the interlock for no reason (e.g. if we want
to *lock*). whilere make it more likely that falling fast path will
not take the interlock either by checking for state

Note the code is severely pessimized both single- and multithreaded.
2018-04-22 09:30:07 +00:00