Commit Graph

16150 Commits

Author SHA1 Message Date
Matt Macy
1f4beb6312 epoch(9): cleanups, additional debug checks, and add global_epoch
- GC the _nopreempt routines
    - to really benefit we'd need a separate routine
    - they're not currently in use
    - they complicate the API for no benefit at this time

- check that we're actually in a epoch section at exit

- handle epoch_call() early in boot

- Fix copyright declaration language

Approved by:	sbruno@
2018-05-13 23:24:48 +00:00
Konstantin Belousov
2ebc882927 Detect and optimize reads from the hole on UFS.
- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
  was performed before the buffer is created, and EJUSTRETURN returned
  in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
  allocating the pages for it), copying from zero_region instead.

The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.

Requested and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D14917
2018-05-13 09:47:28 +00:00
Matt Macy
f1401123c5 hwpmc/epoch - don't reference domain if NUMA is not set
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.

Reported by:	pho@, np@
Approved by:	sbruno@
2018-05-12 20:00:29 +00:00
Matt Macy
e6b475e0af hwpmc(9): Make pmclog buffer pcpu and update constants
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.

Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &

Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total

After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total

For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &

Collecting 10 samples of `make -j96 buildkernel` from each:

x before
+ after

real time:
    N           Min           Max        Median           Avg        Stddev
x  10          76.4        127.62        84.845        88.577     15.100031
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        -28.398 +/- 10.0344
        -32.0602% +/- 7.69825%
        (Student's t, pooled s = 10.6794)

system time:
    N           Min           Max        Median           Avg        Stddev
x  10       2277.96       6948.53       2949.47      3341.492     1385.2677
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        -2277.47 +/- 920.425
        -68.1574% +/- 8.77623%
        (Student's t, pooled s = 979.596)

x no pmc
+ pmc running
real time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10          76.4        127.62        84.845        88.577     15.100031
Difference at 95.0% confidence
        29.73 +/- 10.0335
        50.5208% +/- 17.0525%
        (Student's t, pooled s = 10.6785)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        1.332 +/- 0.248939
        2.2635% +/- 0.426506%
        (Student's t, pooled s = 0.264942)

system time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10       2277.96       6948.53       2949.47      3341.492     1385.2677
Difference at 95.0% confidence
        2309.97 +/- 920.443
        223.937% +/- 89.3039%
        (Student's t, pooled s = 979.616)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        32.493 +/- 16.0042
        3.15% +/- 1.5794%
        (Student's t, pooled s = 17.0331)

Reviewed by:	jeff@
Approved by:	sbruno@
Differential Revision:	https://reviews.freebsd.org/D15155
2018-05-12 01:26:34 +00:00
Matt Macy
8dcbd0eae6 epoch(9): always set inited in epoch_init
- set inited in the !usedomains case

Reported by:	jhibbits
Approved by:	sbruno
2018-05-11 18:37:14 +00:00
Matt Macy
4aa302dfc9 epoch(9): callback task fixes
- initialize the pcpu STAILQ in the NUMA case
- don't enqueue the callback task if there isn't sufficient work to be done

Reported by:	pho@
Approved by:	sbruno@
2018-05-11 08:16:56 +00:00
Mateusz Guzik
85c1b3c1cb rmlock: partially depessimize lock/unlock fastpath
Previusly the slow path was folded in and partially jumped over in the
common case.
2018-05-11 06:59:54 +00:00
Matt Macy
b2cb28963b epoch(9): fix priority handling, make callback lists pcpu, and other fixes
- Lend priority to preempted threads in epoch_wait to handle the case
  in which we've had priority lent to us. Previously we borrowed the
  priority of the lowest priority preempted thread. (pointed out by mjg@)

- Don't attempt allocate memory per-domain on powerpc, we don't currently
  handle empty sockets (as is the case on jhibbits Talos' board).

- Handle deferred callbacks as pcpu lists and poll the lists periodically.
  Currently the interval is 1/hz.

- Drop the thread lock when adaptive spinning. Holding the lock starves
  other threads and can even lead to lockups.

- Keep a generation count pcpu so that we don't keep spining if a thread
  has left and re-entered an epoch section.

- Actually removed the callback from the callback list so that we don't
  double free. Sigh ...

Approved by:	sbruno@
2018-05-11 04:54:12 +00:00
Matt Macy
06bf2a6aef Add simple preempt safe epoch API
Read locking is over used in the kernel to guarantee liveness. This API makes
it easy to provide livenes guarantees without atomics.

Includes epoch_test kernel module to stress test the API.

Documentation will follow initial use case.

Test case and improvements to preemption handling in response to discussion
with mjg@

Reviewed by:	imp@, shurd@
Approved by:	sbruno@
2018-05-10 17:55:24 +00:00
Andrew Gallatin
d5cdcc3a06 Fix the build after r333457
In r333457, the arguments to kern_pwritev() were accidentally
re-ordered as part of ANSIfication, breaking the build.
2018-05-10 13:19:42 +00:00
Ed Maste
cc3c9df80f ANSIfy sys_generic.c 2018-05-10 11:36:16 +00:00
Matt Macy
36688f706e Add taskqgroup_config_gtask_deinit to support teardown after
taskqgroup_config_gtask_init.

Approved by:	sbruno
2018-05-09 18:51:35 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Konstantin Belousov
55c9d75e6b Avoid calls to bzero() before ireloc.
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.

Tested by:	mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D15367
2018-05-09 14:39:24 +00:00
Matt Macy
ad738f3791 Reduce overhead of ktrace checks in the common case.
KTRPOINT() checks both if we are tracing _and_ if we are recursing within
ktrace. The second condition is only ever executed if ktrace is actually
enabled. This change moves the check out of the hot path in to the functions
themselves.

Discussed with mjg@

Reported by:	mjg@
Approved by:	sbruno@
2018-05-09 00:00:47 +00:00
Mateusz Guzik
2824088536 Inlined sched_userret.
The tested condition is rarely true and it induces a function call
on each return to userspace.

Bumps getuid rate by about 1% on Broadwell.
2018-05-07 23:36:16 +00:00
Mateusz Guzik
75e9b455a9 Change trap_enotcap to bool and annotate with __read_frequently
It is read on each return to user space.
2018-05-07 23:10:12 +00:00
Mateusz Guzik
79ca7cbf09 Avoid calls to syscall_thread_enter/exit for statically defined syscalls
The entire mechanism is rarely used and is quite not performant due to
atomci ops on the syscall table. It also has added overhead for completely
unrelated syscalls.

Reduce it by avoiding the func calls if possible (which consistutes vast
majority of cases).

Provides about 3% syscall rate speed up for getuid on Broadwell.
2018-05-07 22:29:32 +00:00
Warner Losh
ad7142757b Add device_quiet_children() and device_has_quiet_children()
If you add a child to a device that has quiet children, we'll
automatically set the quiet flag on the children, and its
children.

This is indended for things like CPU that have a large amount of
repetition in booting that adds nothing.
2018-05-07 21:09:08 +00:00
Andrew Gallatin
e7bd0750af Boost thread priority while changing CPU frequency
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency.   This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields.   This can lead to poor perfomance,
and to things potentially getting stuck on Giant.

Reviewed by:	kib (imp reviewed earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15246
2018-05-07 15:24:03 +00:00
Mark Johnston
bd92e6b6f5 Refactor some of the MI kernel dump code in preparation for netdump.
- Add clear_dumper() to complement set_dumper().
- Drain netdump's preallocated mbuf pool when clearing the dumper.
- Don't do bounds checking for dumpers with mediasize 0.
- Add dumper callbacks for initialization for writing out headers.

Reviewed by:	sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15252
2018-05-06 00:22:38 +00:00
Mark Johnston
5475ca5aca Add an mbuf allocator for netdump.
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.

Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.

Reviewed by:	cem (earlier version), julian, sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15251
2018-05-06 00:19:48 +00:00
Mark Johnston
c2ba2d1b0e Style.
MFC after:	3 days
2018-05-06 00:11:30 +00:00
Andriy Gapon
bd3afae0ca for bus suspend, detach and shutdown iterate children in reverse order
For most buses all children are equal, so the order does not matter.
Other buses, such as acpi, carefully order their child devices to
express implicit dependencies between them.  For such buses it is safer
to bring down devices in the reverse order.

I believe that this is the reason why hpet_suspend had to be disabled.
Some drivers depend on a working event timer until they are suspended.
But previously we would suspend hpet very early.

I tested this change by makinbg hpet_suspend actually stop HPET timers
and tested that too.

Note that this change is not a complete solution as it does not take
into account bus passes.
A better approach would be to track the actual attach order of the
devices and to use the reverse of that.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15291
2018-05-05 05:19:32 +00:00
Mateusz Guzik
5ec2c93667 tc: bcopy -> memcpy 2018-05-04 22:48:10 +00:00
Jamie Gritton
0e5c6bd436 Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of.  This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by:	kib
Differential Revision:	D14681
2018-05-04 20:54:27 +00:00
Mark Johnston
1b5c869d64 Fix some races introduced in r332974.
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15280
2018-05-04 17:17:30 +00:00
Matt Macy
748ff486b0 dup1_processes -t 96 -s 5 on a dual 8160
x dup_before
+ dup_after
+------------------------------------------------------------+
|             x                                            + |
|x    x   x   x                                         ++ ++|
|   |____AM___|                                          |AM||
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5  1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08     341205.71
+   5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08     96232.829
Difference at 95.0% confidence
        3.08952e+06 +/- 365604
        2.03266% +/- 0.245071%
        (Student's t, pooled s = 250681)

Reported by:	mjg@
MFC after:	1 week
2018-05-04 06:51:01 +00:00
Konstantin Belousov
7035cf14ee Implement support for ifuncs in the kernel linker.
Required MD bits are only provided for x86.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:37:46 +00:00
Stephen Hurd
f3e1324b41 Separate list manipulation locking from state change in multicast
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969
2018-05-02 19:36:29 +00:00
Mark Johnston
20f85b1ddd Print the dump progress indicator after calling dump_start().
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.

MFC after:	1 week
2018-05-01 17:32:43 +00:00
Nathan Whitehorn
ee900504cf Report the kernel base address properly in kldstat when using PowerPC kernels
loaded at addresses other than their link address.
2018-05-01 04:06:59 +00:00
Ed Maste
2216c6933c Disable connectat/bindat with AT_FDCWD in capmode
Previously it was possible to connect a socket (which had the
CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in
capabilties mode.  This combination should be treated the same as a call
to connect (i.e. forbidden in capabilities mode).  Similarly for bindat.

Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the
documentation and add tests.

PR:		222632
Submitted by:	Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by:	Domagoj Stolfa
MFC after:	1 week
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D15221
2018-04-30 17:31:06 +00:00
Mateusz Guzik
9d68f7741f systrace: track it like sdt probes
While here predict false.

Note the code is wrong (regardless of this change). Dereference of the
pointer can race with module unload. A fix would set the probe to a
nop stub instead of NULL.
2018-04-27 15:16:34 +00:00
Emmanuel Vadot
ee710ecf32 clk: Put the sysctls under hw.clock instead of clock
This is more consistant with hw.regulator and other hardware related
sysctls.
2018-04-27 00:12:00 +00:00
Mark Johnston
5cd29d0f3c Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
Sean Bruno
7875017ca9 Revert r332894 at the request of the submitter.
Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks
2018-04-24 19:55:12 +00:00
Conrad Meyer
65df124845 Do not totally silence suppressed secondary kasserts unless debug.kassert.do_log is disabled
To totally silence and ignore secondary kassert violations after a primary
panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1.

Additional assertion warnings shouldn't block core dump and may alert the
developer to another erroneous condition.  Secondary stack traces may be
printed, identically to the unsuppressed case where panic() is reentered --
controlled via debug.trace_all_panics.

Sponsored by:	Dell EMC Isilon
2018-04-24 19:10:51 +00:00
Conrad Meyer
07aa6ea677 Fix debug.kassert.do_log description text
This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only
since it was originally committed in r243980.

Sponsored by:	Dell EMC Isilon
2018-04-24 18:59:40 +00:00
Conrad Meyer
ad1fc31570 panic: Optionally, trace secondary panics
To diagnose and fix secondary panics, it is useful to have a stack trace.
When panic tracing is enabled, optionally trace secondary panics as well.

The option is configured with the tunable/sysctl debug.trace_all_panics.

(The original concern that inspired only tracing the primary panic was
likely that the secondary trace may scroll the original panic message or trace
off the screen.  This is less of a concern for serial consoles with logging.
Not everything has a serial console, though, so the behavior is optional.)

Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2018-04-24 18:54:20 +00:00
Jonathan T. Looney
18959b695d Update r332860 by changing the default from suppressing post-panic
assertions to not suppressing post-panic assertions.

There are some post-panic assertions that are valuable and we shouldn't
default to disabling them.  However, when a user trips over them, the
user can still adjust the tunable/sysctl to suppress them temporarily to
get conduct troubleshooting (e.g. get a core dump).

Reported by:	cem, markj
2018-04-24 18:47:35 +00:00
Conrad Meyer
b543c98cab lockmgr: Add missed neutering during panic
r313683 introduced new lockmgr APIs that missed the panic-time neutering
present in the rest of our locks.  Correct that by adding the usual check.

Additionally, move the __lockmgr_args neutering above the assertions at the
top of the function.  Drop the interlock unlock because we shouldn't have
an unneutered interlock either.  No point trying to unlock it.

PR:		227749
Reported by:	jtl
Sponsored by:	Dell EMC Isilon
2018-04-24 18:41:14 +00:00
Mateusz Guzik
d357c16adc lockf: change the owner hash from pid to vnode-based
This adds a bit missed due to the patch split, see r332882

Tested by:	pho
2018-04-24 06:10:36 +00:00
Mateusz Guzik
7cd794214a dtrace: depessimize dtmalloc when dtrace is active
Each malloc/free was testing dtrace_malloc_enabled and forcing
extra reads from the malloc type struct to see if perhaps a
dtmalloc probe was on.

Treat it like lockstat and sdt: have a global bolean.
2018-04-24 01:06:20 +00:00
Mateusz Guzik
4c5209cb21 lockstat: track lockstat just like sdt probes
In particular flip the frequently tested var to bool.
2018-04-24 01:04:10 +00:00
Mateusz Guzik
c9e05ccd62 malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default)
malloc was showing at the top of profile during while running microbenchmarks.

#define DTMALLOC_PROBE_MAX              2
struct malloc_type_internal {
        uint32_t        mti_probes[DTMALLOC_PROBE_MAX];
        u_char          mti_zone;
        struct malloc_type_stats        mti_stats[MAXCPU];
};

Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone
(which we know is 0) + part of malloc stats of the first cpu which on top
induces false-sharing.

In particular will-it-scale lock1_processes -t 128 -s 10:
before: average:45879692
after:  average:51655596

Note the counters can be padded but the right fix is to move them to
counter(9), leaving the struct read-only after creation (modulo dtrace
probes).
2018-04-23 22:28:49 +00:00
Sean Bruno
7b7796eea5 Load balance sockets with new SO_REUSEPORT_LB option
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-04-23 19:51:00 +00:00
Mateusz Guzik
833dc05a6e lockf: add per-chain locks to the owner hash
This combined with previous changes significantly depessimizes the behaviour
under contentnion.

In particular the lock1_processes test (locking/unlocking separate files)
from the will-it-scale suite was executed with 128 concurrency on a
4-socket Broadwell with 128 hardware threads.

Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%)
For reference single-process is ~1680000 (i.e. on stock kernel the resulting
perf is less than *half* of the single-threaded run),

Note this still does not really scale all that well as the locks were just
bolted on top of the current implementation. Significant room for improvement
is still here. In particular the top performance fluctuates depending on the
extent of false sharing in given run (which extends beyond the file).
Added chain+lock pairs were not padded w.r.t. cacheline size.

One big ticket item is the hash used for spreading threads: it used to be the
process pid (which basically serialized all threaded ops). Temporarily the
vnode addr was slapped in instead.

Tested by:      pho
2018-04-23 08:23:10 +00:00
Mateusz Guzik
63286976b5 lockf: skip locking the graph if not necessary (common case)
Tested by:      pho
2018-04-23 07:54:02 +00:00
Mateusz Guzik
717df0b0e8 lockf: perform wakeup onlly when there is anybody waiting
Tested by:      pho
2018-04-23 07:52:56 +00:00
Mateusz Guzik
c72ead2815 lockf: skip the hard work in lf_purgelocks if possible
Tested by:      pho
2018-04-23 07:52:10 +00:00
Mateusz Guzik
0d3323f557 lockf: free state only when recycling the vnode
This avoids malloc/free cycles when locking/unlocking the vnode when
nobody is contending.

Tested by:	pho
2018-04-23 07:51:19 +00:00
Tijl Coosemans
7dfbbc613b Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of
kproc_suspend_check.  In r329612 bufspacedaemon was turned into a thread
of the bufdaemon process causing both to call kproc_suspend_check with the
same proc argument and that function contains the following while loop:

while (SIGISMEMBER(p->p_siglist, SIGSTOP)) {
	wakeup(&p->p_siglist);
	msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0);
}

So one thread wakes up the other and the other wakes up the first again,
locking up UP machines on shutdown.

Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they
run after the syncer has shutdown, because the syncer can cause a
situation where bufdaemon help is needed to proceed.

PR:		227404
Reviewed by:	kib
Tested by:	cy, rmacklem
2018-04-22 16:05:29 +00:00
Mateusz Guzik
7d853f62bf lockf: slightly depessimize
1. check if P_ADVLOCK is already set and if so, don't lock to set it
(stolen from DragonFly)
2. when trying for fast path unlock, check that we are doing unlock
first instead of taking the interlock for no reason (e.g. if we want
to *lock*). whilere make it more likely that falling fast path will
not take the interlock either by checking for state

Note the code is severely pessimized both single- and multithreaded.
2018-04-22 09:30:07 +00:00
Jonathan T. Looney
44b71282b5 When running with INVARIANTS, the kernel contains extra checks. However,
these assumptions may not hold true once we've panic'd. Therefore, the
checks hold less value after a panic.  Additionally, if one of the checks
fails while we are already panic'd, this creates a double-panic which can
interfere with debugging the original panic.

Therefore, this commit allows an administrator to suppress a response to
KASSERT checks after a panic by setting a tunable/sysctl.  The
tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being
enabled.

Reviewed by:	kib
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D12920
2018-04-21 17:05:00 +00:00
Konstantin Belousov
1302eea7bb Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET
-> PROC_PDEATHSIG_STATUS for consistency with other procctl(2)
operations names.

Requested by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-04-20 15:19:27 +00:00
Andriy Gapon
f87beb93e8 call racct_proc_ucred_changed() under the proc lock
The lock is required to ensure that the switch to the new credentials
and the transfer of the process's accounting data from the old
credentials to the new ones is done atomically.  Otherwise, some updates
may be applied to the new credentials and then additionally transferred
from the old credentials if the updates happen after proc_set_cred() and
before racct_proc_ucred_changed().

The problem is especially pronounced for RACCT_RSS because
- there is a strict accounting for this resource (it's reclaimable)
- it's updated asynchronously by the vm daemon
- it's updated by setting an absolute value instead of applying a delta

I had to remove a call to rctl_proc_ucred_changed() from
racct_proc_ucred_changed() and make all callers of latter call the
former as well.  The reason is that rctl_proc_ucred_changed, as it is
implemented now, cannot be called while holding the proc lock, so the
lock is dropped after calling racct_proc_ucred_changed.  Additionally,
I've added calls to crhold / crfree around the rctl call, because
without the proc lock there is no gurantee that the new credentials,
owned by the process, will stay stable.  That does not eliminate a
possibility that the credentials passed to the rctl will get stale.
Ideally, rctl_proc_ucred_changed should be able to work under the proc
lock.

Many thanks to kib for pointing out the above problems.

PR:		222027
Discussed with:	kib
No comment:	trasz
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D15048
2018-04-20 13:08:04 +00:00
John Baldwin
73c8686e91 Simplify the code to allocate stack for auxv, argv[], and environment vectors.
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().

Reviewed by:	emaste
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15123
2018-04-19 16:00:34 +00:00
Konstantin Belousov
b940886338 Add PROC_PDEATHSIG_SET to procctl interface.
Allow processes to request the delivery of a signal upon death of
their parent process.  Supposed consumer of the feature is PostgreSQL.

Submitted by:	Thomas Munro
Reviewed by:	jilles, mjg
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D15106
2018-04-18 21:31:13 +00:00
John Baldwin
8ce99bb405 Properly do a deep copy of the ioctls capability array for fget_cap().
fget_cap() tries to do a cheaper snapshot of a file descriptor without
holding the file descriptor lock.  This snapshot does not do a deep
copy of the ioctls capability array, but instead uses a different
return value to inform the caller to retry the copy with the lock
held.  However, filecaps_copy() was returning 1 to indicate that a
retry was required, and fget_cap() was checking for 0 (actually
'!filecaps_copy()').  As a result, fget_cap() did not do a deep copy
of the ioctls array and just reused the original pointer.  This cause
multiple file descriptor entries to think they owned the same pointer
and eventually resulted in duplicate frees.

The only code path that I'm aware of that triggers this is to create a
listen socket that has a restricted list of ioctls and then call
accept() which calls fget_cap() with a valid filecaps structure from
getsock_cap().

To fix, change the return value of filecaps_copy() to return true if
it succeeds in copying the caps and false if it fails because the lock
is required.  I find this more intuitive than fixing the caller in
this case.  While here, change the return type from 'int' to 'bool'.

Finally, make filecaps_copy() more robust in the failure case by not
copying any of the source filecaps structure over.  This avoids the
possibility of leaking a pointer into a structure if a similar future
caller doesn't properly handle the return value from filecaps_copy()
at the expense of one more branch.

I also added a test case that panics before this change and now passes.

Reviewed by:	kib
Discussed with:	mjg (not a fan of the extra branch)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D15047
2018-04-17 18:07:40 +00:00
Brooks Davis
cee61c8cac Stop using fuswintr() and suswintr() in the profiler.
Always take the AST path rather than calling MD functions which are
often implemented as always failing. The is the case on amd64, arm,
i386, and powerpc. This optimization (inherited from 4.4 Lite) is a
pessimization on those architectures and is the sole use of these
functions. They will be removed in a seperate commit.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15101
2018-04-17 16:36:53 +00:00
Alan Somers
52c0983128 lio_listio: return EAGAIN instead of EIO when out of resources
This behavior is already documented by the man page, and suggested by POSIX.

Reviewed by:	jhb
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D15099
2018-04-16 18:12:15 +00:00
Konstantin Belousov
d86c1f0dc1 i386 4/4G split.
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.

By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap.  The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.

There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.

copyout(9) was rewritten to use vm_fault_quick_hold().  An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls.  The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline.  If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.

The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done.  The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging.  I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.

Tested by: pho
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D14633
2018-04-13 20:30:49 +00:00
Mateusz Guzik
e0e259a888 locks: extend speculative spin waiting for readers to drain
Now that 10 years have passed since the original limit of 10000 was
committed, bump it a little bit.

Spinning waiting for writers is semi-informed in the sense that we always
know if the owner is running and base the decision to spin on that.
However, no such information is provided for read-locking. In particular
this means that it is possible for a write-spinner to completely waste cpu
time waiting for the lock to be released, while the reader holding it was
preempted and is now waiting for the spinner to go off cpu.

Nonetheless, in majority of cases it is an improvement to spin instead of
instantly giving up and going to sleep.

The current approach is pretty simple: snatch the number of current readers
and performs that many pauses before checking again. The total number of
pauses to execute is limited to 10k. If the lock is still not free by
that time, go to sleep.

Given the previously noted problem of not knowing whether spinning makes
any sense to begin with the new limit has to remain rather conservative.
But at the very least it should also be related to the machine. Waiting
for writers uses parameters selected based on the number of activated
hardware threads. The upper limit of pause instructions to be executed
in-between re-reads of the lock is typically 16384 or 32678. It was
selected as the limit of total spins. The lower bound is set to
already present 10000 as to not change it for smaller machines.

Bumping the limit reduces system time by few % during benchmarks like
buildworld, buildkernel and others. Tested on 2 and 4 socket machines
(Broadwell, Skylake).

Figuring out how to make a more informed decision while not pessimizing
the fast path is left as an exercise for the reader.
2018-04-11 01:43:29 +00:00
Ian Lepore
97603f1da2 Use explicit_bzero() when cleaning values out of the kernel environment.
Sometimes the values contain geli passphrases being communicated from
loader(8) to the kernel, and some day the compiler may decide to start
eliding calls to memset() for a pointer which is not dereferenced again
before being passed to free().
2018-04-10 22:57:56 +00:00
Mateusz Guzik
04457342a3 rw: whack avoidable re-reads in try_upgrade 2018-04-10 22:32:31 +00:00
Stephen Hurd
f422673e10 Make BPF global lock an SX
This allows NIC drivers to sleep on polling config operations.

Submitted by:	Matthew Macy <mmacy@mattmacy.io>
Reviewed by:	shurd
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14982
2018-04-10 19:42:50 +00:00
Mateusz Guzik
a045941bd2 locks: tweak backoff a little bit
Previous limits were chosen when locking primitives had spurious lock
accesses.

Flipping the starting point to 1 (or rather 2 as the first call shifts it)
provides a modest win when mild contention is seen while not hurting worse
cases. Tested on a bunch of one, two and four socket old and new systems
(Westmere, Skylake, Threadreaper and others) by doing concurrent page faults,
buildkernel/buildworld and other stuff (although not all systems got all the
tests).

Another thing is the upper limit. It is semi-arbitrarily chosen as it was
getting out of hand for slightly less small systems (e.g. a 128-thread one).

Note that backoff is fundamentally a speculative bandaid and this change just
makes it fit a little bit better. It remains completely oblivious to the
hardware topology or the contention pattern. This is being experimented with.
2018-04-08 16:34:10 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Brooks Davis
89ea4a30d6 Added SAL annotatations to system calls.
Modify makesyscalls.sh to strip out SAL annotations.

No functional change.

This is based on work I started in CheriBSD and use to validate fat
pointers at the syscall boundary.  Tal Garfinkel reviewed the changes,
added annotations to COMPAT* syscalls and is using them in a record and
playback framework.  One can envision other uses such as a WITNESS-like
validator for copyin/out as speculated on in the review.

As this time we are only annotating sys/kern/syscalls.master as that is
sufficient for userspace work.  If kernel use cases materialize, we can
annotate other syscalls.master as needed.

Submitted by:	Tal Garfinkel <talg@cs.stanford.edu>
Sponsored by:	DARPA, AFRL (in part)
Differential Revision:	https://reviews.freebsd.org/D14285
2018-04-05 20:31:45 +00:00
Jeff Roberson
e5818a53db Implement several enhancements to NUMA policies.
Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.

Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user.  This
also eliminates the need for the complicated interrupt binding code.

Add a sysctl API for viewing and manipulating domainsets.  Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both.  This probably belongs in a dedicated subr file.

Attempt to improve the include situation.

Reviewed by:	kib
Discussed with:	jhb (cpuset parts)
Tested by:	pho (before review feedback)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14839
2018-03-29 02:54:50 +00:00
Jeff Roberson
27a3c9d710 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
Andriy Gapon
f4043145f2 ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code.  Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public.  I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now.  While making the move I've dropped the
vfs_ prefix.

Reviewed by:	mjg
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D14869
2018-03-28 08:55:31 +00:00
Mateusz Guzik
179da98f71 fd: tighten seq protected areas to not contain malloc/free 2018-03-28 03:07:02 +00:00
Konstantin Belousov
fb441a8829 Fix several leaks of kernel stack data through paddings.
It is random collection of fixes for issues not yet corrected,
reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many
issues from that list were already corrected. Most of them are for
compat32, old compat32 or affect both primary host ABI and compat32.

The freebsd32_kldstat(), for instance, was already fixed by using
malloc(M_ZERO).  Patch includes correction to report the supplied
version back, which is just pedantic.

Reviewed by:	brooks, emaste (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14868
2018-03-27 18:05:51 +00:00
Brooks Davis
34a77b9741 Move uio enums to sys/_uio.h.
Include _uio.h instead of uio.h in several headers to reduce header
polution.

Fix a few places that relied on header polution to get the uio.h header.

I have not moved struct uio as many more things that use it rely on
header polution to get other definitions from uio.h.

Reviewed by:	cem, kib, markj
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14811
2018-03-27 15:20:03 +00:00
Andriy Gapon
31260bf042 vfs_donmount: in certain cases try r/o mount if r/w mount fails
If the operation is not an update, if neither r/w nor r/o mode is
explicitly requested, if the error code hints at the possibility of the
media being read-only, and if the fallback is allowed, then we can try
to automatically downgrade to the readonly mode.

This is especially useful for auto-mounting of removable media that
sometimes can happen to be write-protected.

The fallback to r/o is not enabled by default.  It can be requested on a
per-mount basis with a new mount option, 'autoro'.  Or it can be
globally allowed by setting vfs.default_autoro.

Reviewed by:	cem, kib
MFC after:	3 weeks
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D13361
2018-03-27 14:31:42 +00:00
Jeff Roberson
e8cbe51a04 Fix a bug introduced in r329612 that slowly invalidates all clean bufs.
Reported by:	bde
Reviewed by:	bde
Sponsored by:	Netflix, Dell/EMC Isilon
2018-03-26 18:36:17 +00:00
Mark Johnston
803c11a3a6 Use LIST_FOREACH_SAFE in sleepq_chains_remove_matching().
We may remove a sleepqueue from the hash table in
sleepq_resume_thread().

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14847
2018-03-25 20:12:14 +00:00
Konstantin Belousov
ed9e8bc468 Account the size of the vslock-ed memory by the thread.
Assert that all such memory is unwired on return to usermode.

The count of the wired memory will be used to detect the copyout mode.

Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 13:51:27 +00:00
Konstantin Belousov
161bf65f8a In vn_io_fault1(), reduce the scope where pagefaults are disabled.
Most important for the future use, do not call
vm_fault_quick_hold_pages() with disabled pagefaults.

Reported and tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 13:13:52 +00:00
Konstantin Belousov
c398200721 Do not send signals to init directly from shutdown_nice(9), do it from
the task context.

shutdown_nice() is used from the fast interrupt handlers, mostly for
console drivers, where we cannot lock blockable locks.  Schedule the
task in the fast queue to send the signal from the proper context.

Reviewed by:	imp
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-22 20:47:25 +00:00
Jeff Roberson
9a4b4cd3bc Start witness much earlier in boot so that we can shrink the pend list and
make it more immune to further change.

Reviewed by:	markj, imp (Part of D14707)
Sponsored by:	Netflix, Dell/EMC Isilon
2018-03-22 19:11:43 +00:00
Warner Losh
f0d847af61 Drop any recursed taking of Giant once and for all at the top of
kern_reboot(). The shutdown path is now safe to run without Giant.

Discussed with: kib@
Sponsored by: Netflix
2018-03-22 15:34:37 +00:00
Jonathan T. Looney
2529f56ed3 Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by:	gnn (previous version)
Obtained from:	Netflix, Inc.
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D11085
2018-03-22 09:40:08 +00:00
Gleb Smirnoff
27cd06b391 Redo r331328. We need to fix not only type but also format. While
here again notice that we are fixing regression from r331106.
2018-03-22 05:26:27 +00:00
Gleb Smirnoff
5aab68f24a Fix sysctl types broken in r329612. 2018-03-21 23:21:32 +00:00
Mark Johnston
a7defaea9a Elide the object lock in the common case in vfs_vmio_unwire().
The object lock was only needed when attempting to free B_DIRECT
buffer pages, and for testing for invalid pages (and freeing them
if so). Handle the latter by instead moving invalid pages near the head
of the inactive queue, where they will be reclaimed quickly.

Reviewed by:	alc, kib, jeff
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D14778
2018-03-21 21:15:43 +00:00
Warner Losh
3e867f24cb bufshutdown is no longer called with Giant held, so there's no need to
drop or pickup Giant anymore. Remove that code and adjust comments.
2018-03-21 14:46:59 +00:00
Warner Losh
d5292812f8 Remove Giant from init creation and vfs_mountroot.
Sponsored by: Netflix
Discussed with: kib@, mckusick@
Differential Review: https://reviews.freebsd.org/D14712
2018-03-21 14:46:54 +00:00
Conrad Meyer
c37125d9e5 Add missed sys/limits.h include
Apparently header pollution on x86 hid its absense.  Sorry, other arch
users.

Fix the missed header introduced in r331279.

Reported by:	tinderbox
2018-03-21 03:43:40 +00:00
Conrad Meyer
4948f7bf11 Regenerate sysent files after r331279. 2018-03-21 01:17:01 +00:00
Conrad Meyer
e9ac27430c Implement getrandom(2) and getentropy(3)
The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).

getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.

getentropy(3) is added as a compatibility shim for the OpenBSD API.

truss(1) support is included.

Tests for both system calls are provided.  Coverage is believed to be at
least as comprehensive as LTP getrandom(2) test coverage.  Additionally,
instructions for running the LTP tests directly against FreeBSD are provided
in the "Test Plan" section of the Differential revision linked below.  (They
pass, of course.)

PR:		194204
Reported by:	David CARLIER <david.carlier AT hardenedbsd.org>
Discussed with:	cperciva, delphij, jhb, markj
Relnotes:	maybe
Differential Revision:	https://reviews.freebsd.org/D14500
2018-03-21 01:15:45 +00:00
Jamie Gritton
672756aa9f Represent boolean jail options as an array of structures containing the
flag and both the regular and "no" names, instead of two different string
arrays whose indices need to match the flag's bit position.  This makes
them similar to the say "jailsys" options are represented.

Loop through either kind of option array with a structure pointer rather
then an integer index.
2018-03-20 23:08:42 +00:00
Gleb Smirnoff
83fc34ea0d At this point iwmesg isn't initialized yet, so print pointer to lock
rather than panic before panicing.
2018-03-20 22:05:21 +00:00
Mark Johnston
8c7549da2b Drop KTR_CONTENTION.
It is incomplete, has not been adopted in the other locking primitives,
and we have other means of measuring lock contention (lock_profiling,
lockstat, KTR_LOCK). Drop it to slightly de-clutter the mutex code and
free up a precious KTR class index.

Reviewed by:	jhb, mjg
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D14771
2018-03-20 15:51:05 +00:00
Justin Hibbits
2acde6a85a Cast through uintptr_t to narrow the buf domain pointer on 32-bit archs
arg2 is an intmax_t, which on 32-bit architectures is 64 bits, wider than a
pointer.  When &bdomain[i] is added to arg2 it widens from uintptr_t to
intmax_t, then gcc whines when it gets cast to a pointer.  Casting through
uintptr_t silences this warning.
2018-03-20 02:01:30 +00:00
Matt Joras
d6160f6079 Fix initialization of eventhandler mutex.
mtx_init does not do a copy of the name string it is passed. The
eventhandler code incorrectly passed the parameter string directly to
mtx_init instead of using the copy it makes. This was an existing
problem with the code that I dutifully copied over in my changes in r325621.

Reported by:	Anton Rang <rang AT acm.org>
Reviewed by:	rstone, markj
Approved by:	rstone (mentor)
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14764
2018-03-19 22:43:27 +00:00
Mark Johnston
0eb50f9cd2 Have vm_page_{deactivate,launder}() requeue already-queued pages.
In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.

Reviewed by:	alc
Tested by:	pho
MFC after:	2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625
2018-03-18 16:40:56 +00:00
Mateusz Guzik
09bdec20a0 locks: slightly depessimize lockstat
The slow path is always taken when lockstat is enabled. This induces
rdtsc (or other) calls to get the cycle count even when there was no
contention.

Still go to the slow path to not mess with the fast path, but avoid
the heavy lifting unless necessary.

This reduces sys and real time during -j 80 buildkernel:
before: 3651.84s user 1105.59s system 5394% cpu 1:28.18 total
after: 3685.99s user 975.74s system 5450% cpu 1:25.53 total
disabled: 3697.96s user 411.13s system 5261% cpu 1:18.10 total

So note this is still a significant hit.

LOCK_PROFILING results are not affected.
2018-03-17 19:26:33 +00:00
Jeff Roberson
3cec5c77d6 Move the dirty queues inside the per-domain structure. This resolves a bug
where we had not hit global dirty limits but a single queue was starved
for space by dirty buffers.  A single buf_daemon is maintained for now.

Add a bd_speedup() when we are low on bufspace.  This can happen due to SUJ
keeping many bufs locked until a cg block is written.  Document this with
a comment.

Fix sysctls to work with per-domain variables.  Add more ddb debugging.

Reported by:	pho
Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14705
2018-03-17 18:14:49 +00:00
Conrad Meyer
330b675f65 vfs_bio.c: Apply cleanups motivated by Coverity analysis
It is believed that the conditions Coverity indicated were actually
impossible to hit.  So this patch just adds a cleanup to only compute
v_mount once in brelse(), and in vfs_bio_getpages() always initializes error
to zero to appease the static analyzer.

No functional change intended.

Submitted by:	Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14613
2018-03-14 22:11:45 +00:00
Ed Maste
a95659f75f Use C99 boolean type for translate_osrel
Migrate to modern types before creating MD Linuxolator bits for new
architectures.

Reviewed by:	cem
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14676
2018-03-13 16:40:29 +00:00
Ed Maste
b7feabf906 Use C99 designated initializers for struct execsw
It it makes use slightly more clear and facilitates grepping.
2018-03-13 13:09:10 +00:00
Brooks Davis
467e627672 Use the stack for temporary storage in OTIOCCONS.
The old code used the thread's pcb via the uap->data pointer.

Reviewed by:	ed
Approved by:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14674
2018-03-12 23:04:42 +00:00
Brooks Davis
97519ff698 MIPS: Implement fue*word* and casueword* in assembly.
Remove NO_FUEWORD so the 'e' variants are wrapped by the non-'e'
variants.  This is more correct and leaves sparc64 as the outlier.

Reviewed by:	jmallett, kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14603
2018-03-12 22:10:06 +00:00
Ed Maste
5cc6d253f4 ANSIfy sys/kern/imgact_* 2018-03-12 15:45:50 +00:00
Ian Lepore
72721caf04 Make root mount timeout logic work for filesystems other than ufs.
The vfs.mountroot.timeout tunable and .timeout directive in a mount.conf(5)
file allow specifying a wait timeout for the device(s) hosting the root
filesystem to become usable.  The current mechanism for waiting for devices
and detecting their availability can't be used for zfs-hosted filesystems.
See the comment #20 in the PR for some expanded detail on these points.

This change adds retry logic to the actual root filesystem mount.  That is,
insted of relying on device availability using device name lookups, it uses
the kernel_mount() call itself to detect whether the filesystem can be
mounted, and loops until it succeeds or the configured timeout is exceeded.

These changes are based on the patch attached to the PR, but it's rewritten
enough that all mistakes belong to me.

PR:		208882
X-MFC after:	sufficient testing, and hopefully in time for 11.1
2018-03-10 22:07:57 +00:00
Conrad Meyer
dec5441a32 subr_gtaskqueue: Fix braino from r330715
Submitted by:	markj
Sponsored by:	Dell EMC Isilon
2018-03-10 01:53:42 +00:00
Conrad Meyer
8e0e6abc1f subr_gtaskqueue: Fix minor leak of tq_name in error case
Reported by:	cppcheck
Sponsored by:	Dell EMC Isilon
2018-03-10 01:01:01 +00:00
Brooks Davis
dd51fec3b9 Copyout a whole int to cpuset_domain's policy pointer.
The previous code only copied 16-bits and corrupted the target int.

Reviewed by:	kib, markj
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14611
2018-03-09 00:50:40 +00:00
Mark Johnston
bde3b1e1a5 Return E2BIG if we run out of space writing a compressed kernel dump.
ENOSPC causes the MD kernel dump code to retry the dump, but this is
undesirable in the case where we legitimately ran out of space.
2018-03-08 17:04:36 +00:00
Brooks Davis
91a743004c Use umtx_copyin_umtx_time32() in __umtx_op_lock_umutex_compat32().
Non-NULL timeouts where copied in improperly and could produce failures
due to incompatible data structures.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14587
2018-03-06 01:52:04 +00:00
Brooks Davis
aec37bad99 Regen after r330517. 2018-03-05 17:02:50 +00:00
Brooks Davis
1c1b4c66b6 Remove remenants of 1990s efforts to let us run Net/OpenBSD binaries.
No functional change (comments change in some generated files.)

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14571
2018-03-05 17:02:16 +00:00
Mateusz Guzik
0ad122a966 lockmgr: save on sleepq when cmpset fails 2018-03-05 00:30:07 +00:00
Mateusz Guzik
93d41967da lockmgr: whack unused lockmgr_note_exclusive_upgrade 2018-03-04 22:14:20 +00:00
Mateusz Guzik
9f4e008d4a mtx: tidy up recursion handling in thread lock
Normally after grabbing the lock it has to be verified we got the right one
to begin with. However, if we are recursing, it must not change thus the
check can be avoided. In particular this avoids a lock read for non-recursing
case which found out the lock was changed.

While here avoid an irq trip of this happens.

Tested by:	pho (previous version)
2018-03-04 22:01:23 +00:00
Mateusz Guzik
a8e747c5e7 sx: don't do an atomic op in upgrade if it cananot succeed
The code already pays the cost of reading the lock to obtain the waiters
flag. Checking whether there is more than one reader is not a problem and
avoids dirtying the line.

This also fixes a small corner case: if waiters were to show up between
reading the flag and upgrading the lock, the operation would fail even
though it should not. No correctness change here though.
2018-03-04 21:41:05 +00:00
Mateusz Guzik
d94df98c5c locks: fix a corner case in r327399
If there were exactly rowner_retries/asx_retries (by default: 10) transitions
between read and write state and the waiters still did not get the lock, the
next owner -> reader transition would result in the code correctly falling
back to turnstile/sleepq where it would incorrectly think it was waiting
for a writer and decide to leave turnstile/sleepq to loop back. From this
point it would take ts/sq trips until the lock gets released.

The bug sometimes manifested itself in stalls during -j 128 package builds.

Refactor the code to fix the bug, while here remove some of the gratituous
differences between rw and sx locks.
2018-03-04 21:38:30 +00:00
Mateusz Guzik
1c6987ebc5 lockmgr: start decomposing the main routine
The main routine takes 8 args, 3 of which are almost the same for most uses.
This in particular pushes it above the limit of 6 arguments passable through
registers on amd64 making it impossible to tail call.

This is a prerequisite for further cleanups.

Tested by:	pho
2018-03-04 19:12:54 +00:00
Hans Petter Selasky
2077229b56 Allow pause_sbt() to catch signals during sleep by passing C_CATCH flag.
Define pause_sig() function macro helper similarly to other kernel functions
which catch signals. Update outdated function description.

Discussed with:	kib@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-03 18:36:38 +00:00
Hans Petter Selasky
54fc03834a Correct the return code from pause() during cold startup from zero to
EWOULDBLOCK. This also matches the description in pause(9).

Discussed with:	kib@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-03 18:12:21 +00:00
Brooks Davis
93e48a303a Rename kernel-only members of semid_ds and msgid_ds.
This deliberately breaks the API in preperation for future syscall
revisions which will remove these nonstandard members.

In an exp-run a single port (devel/qemu-user-static) was found to
use them which it did becuase it emulates system calls.  This has
been fixed in the ports tree.

PR:		224443 (exp-run)
Reviewed by:	kib, jhb (previous version)
Exp-run by:	antoine
Sponsored by:	DARPA, AFRP
Differential Revision:	https://reviews.freebsd.org/D14490
2018-03-02 22:10:48 +00:00
Mateusz Guzik
c505b59961 sx: fix adaptive spinning broken in r327397
The condition was flipped.

In particular heavy multithreaded kernel builds on zfs started suffering
due to nested sx locks.

For instance make -s -j 128 buildkernel:

before: 3326.67s user 1269.62s system 6981% cpu 1:05.84 total
after: 3365.55s user 911.27s system 6871% cpu 1:02.24 total

ps.
      .-'---`-.			      .-'---`-.
    ,'          `.		    ,'          `.
    |             \		    |             \
    |              \		    |              \
    \           _  \		    \           _  \
    ,\  _    ,'-,/-)\		    ,\  _    ,'-,/-)\
    ( * \ \,' ,' ,'-)		    ( * \ \,' ,' ,'-)
     `._,)     -',-')		     `._,)     -',-')
       \/         ''/		       \/         ''/
        )        / /		        )        / /
       /       ,'-'		       /       ,'-'
2018-03-02 21:26:27 +00:00
Mateusz Guzik
9d4e369ae8 Don't generate data in sysctl_out_proc unless we intend to copy out.
The first call is used to gauge how much spaces is needed. Just computing
the size instead of generating the output allows to not take the proctree
lock.
2018-02-25 15:16:58 +00:00
Jeff Roberson
1c2529ab32 Fix issues with sparse cpu allocation. Consistently use mp_maxid + 1.
Reported by:	pho
Reviewed by:	markj
Sponsored by:	Netflix, Dell/EMC Isilon
2018-02-25 00:35:21 +00:00
Conrad Meyer
63901c0171 kern/sys_generic.c: style(9) return(foo) -> return (foo)
No functional change.

Sponsored by:	Dell EMC Isilon
2018-02-24 01:15:33 +00:00
Jeff Roberson
5f8cd1c0bf Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.

This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload.  This is a reimplementation of work done by myself and mlaier at
Isilon.

Reviewed by:	bsdimp
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14402
2018-02-23 22:51:51 +00:00
Kirk McKusick
16680b6af5 Include error number in the "fsync: giving up on dirty" message
(in case it ever starts happening again in spite of 328444).

Submitted by: Andreas Longwitz <longwitz at incore.de>
2018-02-23 21:57:10 +00:00
Konstantin Belousov
4c8a8cfcde Restore UP build.
Reviewed by:	truckman
Sponsored by:	The FreeBSD Foundation
2018-02-23 18:26:31 +00:00
Ed Maste
315fbaeca2 Correct pseudo misspelling in sys/ comments
contrib code and #define in intel_ata.h unchanged.
2018-02-23 18:15:50 +00:00
Don Lewis
97e9382d56 Decrease latency by not wrapping the idle loop's potentially lengthy
search for a thread to steal inside a critical section.  Since this
allows the search to be preempted, restart the search if preemption
happens since the search results found earlier may no longer be
valid.

Decrease the latency of starting a thread that may be assigned to
this CPU during the search by polling for incoming threads during
the search and switching to that thread instead of continuing the
search.

Test for stale search results and restart the search before going
through the expense of calling tdq_lock_pair().  Retry some tests
after grabbing the locks since things may have changed while waiting
to get both locks.

Eliminate special case handling for stealing from an SMT peer that
uses 1 as the steal threshold.  This can only succeed if a thread
has been assigned but our SMT peer has not yet started executing
it.  This is quite rare and when it happens the other SMT thread
is generally waiting for the same tdq lock that we hold.  Basically
both SMT threads are racing to grab the same spin lock.

Add the kern.sched.always_steal knob from a ULE patch by jeff@.

Incorporate another idea from Jeff's ULE patch.  If the sched_switch()
detects that the CPU is about to go idle, try to steal a thread
before switching to the idle thread.  Since the search for a thread
to steal has to be done inside a critical section in this context,
limit the impact on latency by adding the knob kern.sched.trysteal_limit
to limit the topological distance of the search and don't restart
the search if we detect stale results.  If this search can't find
an stealable thread, the idle loop can do a more complete search.
Also poll for threads being assigned to this CPU during the search
and switch to them instead of continuing the search.  This change
is responsibile for the majority of the improvement in parallel
buildworld times.

In sched_balance_group() change the minimum threshold from stealing
a thread from 1 to 2.  Poaching a newly assigned thread from a CPU
that is waking up hasn't yet switched to that thread from idle is
likely very rare and is likely to have the same lock race as is
seen when stealing threads in the idle loop.  Also use tdq_notify()
to kick the destintation CPU instead of always sending an IPI.
Update a stale comment, the number of transferable threads is not
calculated.

Reviewed by:	kib (earlier version)
Comments by:	avg, jeff, mav
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12130
2018-02-23 00:12:51 +00:00
Mateusz Guzik
a0c722bdbf Fix up sysctl vfs.buffercache broken in r329612
Sample problem:
top: sysctl(vfs.bufspace...) expected 8, got 4

Reported by:	O. Hartmann <ohartmann walstatt.org>
2018-02-22 20:39:25 +00:00
Eric van Gyzen
0127914caa sched_ule: update a comment to reflect reality
MFC after:	3 days
Sponsored by:	Dell EMC
2018-02-22 17:09:26 +00:00
Jeff Roberson
683ca3a432 Fix the broken subqueue assignment for the cleanq.
Reported by:	pho
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-02-20 21:27:17 +00:00
Mateusz Guzik
500ca73d43 mtx: add debug assertions to mtx_spin_wait_unlocked 2018-02-20 20:39:34 +00:00
Mateusz Guzik
862db53fb5 Fix reaping on process fd close broken after r329449
The only consumer of proc_reap other than proc_to_reap was not updated
to not PROC_SLOCK.

Reported by:    Juan Ramon Molina Menor <listjm club.fr>
2018-02-20 20:19:38 +00:00
Brooks Davis
b81e88d296 Reduce duplication in dynamic syscall registration code.
Remove the unused syscall_(de)register() functions in favor of the
better documented and easier to use syscall_helper_(un)register(9)
functions.

The default and freebsd32 versions differed in which array of struct
sysents they used and a few missing updates to the 32-bit code as
features were added to the main code.

Reviewed by:	cem
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14337
2018-02-20 18:08:57 +00:00
Mateusz Guzik
681a1b752c Make killpg1 perform process validity checks without proc lock held. 2018-02-20 10:52:07 +00:00
Mateusz Guzik
81d68271d7 Reduce contention on the proctree lock during heavy package build.
There is a proctree -> allproc ordering established.

Most of the time it is either xlock -> xlock or slock -> slock.

On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.

The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.

Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.

This results in about 50% wait time reduction during -j 128 package build.
2018-02-20 02:18:30 +00:00
Jeff Roberson
06220fa737 Further parallelize the buffer cache.
Provide multiple clean queues partitioned into 'domains'.  Each domain manages
its own bufspace and has its own bufspace daemon.  Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14274
2018-02-20 00:06:07 +00:00
Mateusz Guzik
2ca66c1ef5 Fix process exit vs reap race introduced in r329449
The race manifested itself mostly in terms of crashes with "spin lock
held too long".

Relevant parts of respective code paths:

exit:				reap:
PROC_LOCK(p);
PROC_SLOCK(p);
p->p_state == PRS_ZOMBIE
PROC_UNLOCK(p);
				PROC_LOCK(p);
/* exit work */
				if (p->p_state == PRS_ZOMBIE) /* true */
					proc_reap()
					free proc
/* more exit work */
PROC_SUNLOCK(p);

Thus a still exiting process is reaped.

Prior to the change the zombie check was followed by slock/sunlock trip
which prevented the problem.

Even code prior to this commit has a bug: the proc is still accessed for
statistic collection purposes. However, the severity is rather small and
the bug may be fixed in a future commit.

Reported by:	many
Tested by:	allanjude
2018-02-19 00:54:08 +00:00
Mateusz Guzik
d257698833 mtx: add mtx_spin_wait_unlocked
The primitive can be used to wait for the lock to be released. Intended
usage is for locks in structures which are about to be freed.

The benefit is the avoided interrupt enable/disable trip + atomic op to
grab the lock and shorter wait if the lock is held (since there is no
worry someone will contend on the lock, re-reads can be more aggressive).

Briefly discussed with:	 kib
2018-02-19 00:38:14 +00:00
Mateusz Guzik
7beb60820f exit: get rid of PROC_SLOCK when checking a process to report, take #2
The suspension counter needs synchronisation through slock, but we don't
need it to check if inspecting the counter is necessary to begin with.
In the common case it is not, thus avoid the lock if possible.

Reviewed by:	kib
Tested by:	pho
2018-02-18 21:07:15 +00:00
Mariusz Zaborski
965cd21173 Fix broken assertion in r329520.
Reported by:	pho@ lwhsu@
2018-02-18 20:04:39 +00:00
Brooks Davis
7a095112b2 Correct/improve the descriptions if kern.ipc.(shmsegs,sema,msqids).
The description of kern.ipc.shmsegs was wrong since 2005.  I updated the
others (which were more correct) to match.

PR:		225933
Reviewed by:	cem
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14391
2018-02-18 19:19:36 +00:00
Mariusz Zaborski
20641651ec Use the fdeget_locked function instead of the fget_locked in the
sys_capability.

Reviewed by:	pjd@ (earlier version)
Discussed with:	mjg@
2018-02-18 15:27:24 +00:00
Mateusz Guzik
8bf6ff2226 Revert r329448.
Turns out is is actually racy, reproducible with stress2/misc/truss.sh

Requested by:	kib
2018-02-17 17:23:43 +00:00
Mateusz Guzik
e4ccf57fdc Undo LOCK_PROFILING pessimisation after r313454 and r313455
With the option used to compile the kernel both sx and rw shared ops would
always go to the slow path which added avoidable overhead even when the
facility is disabled.

Furthermore the increased time spent doing uncontested shared lock acquire
would be bogusly added to total wait time, somewhat skewing the results.

Restore old behaviour of going there only when profiling is enabled.

This change is a no-op for kernels without LOCK_PROFILING (which is the
default).
2018-02-17 12:07:09 +00:00
Mateusz Guzik
ad58e5e86c exit: stop doing PROC_SLOCK just to call proc_reap
It immediately does PROC_SUNLOCK anyway and the lock plays no role.
2018-02-17 09:03:11 +00:00
Mateusz Guzik
9c0e785c58 exit: get rid of PROC_SLOCK when checking a process to report
All accessed fields are protected with already held process lock.
2018-02-17 08:48:45 +00:00
Mateusz Guzik
015cd8dc93 On process exit signal the parent after dropping the proctree lock. 2018-02-17 00:24:50 +00:00
Mateusz Guzik
7e588b9219 Unref the prison after proctree is dropped. 2018-02-17 00:23:56 +00:00
Mateusz Guzik
65f29b9caa Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped.
There is a significant contention on the lock during -j 128 package build.
This change drops total wait time on this lock by 60%.
2018-02-17 00:23:28 +00:00
Mateusz Guzik
6776bfeb8f Tidy up kern_wait6
- don't relock curproc in msleep
- don't relock proctree if P_STATCHILD is spotted
- reformat the proc_to_reap call in the main loop
2018-02-17 00:21:50 +00:00
Brooks Davis
aff4f2d315 Reduce duplication in __acl_*_(file|link).
Add const to new kern_ functions and push down as required.

Reviewed by:	rwatson
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14174
2018-02-15 21:24:43 +00:00
Mark Johnston
05f0f0e9ea Fix the test for SET_FOREACH termination.
Unlike the queue(3) _FOREACH macros, the iterator for a SET_FOREACH is
not NULL after the end of the set is reached.
2018-02-15 17:35:40 +00:00
Mateusz Guzik
f795032b47 rwlock: diff-reduction of runlock compared to sx sunlock 2018-02-14 20:37:33 +00:00
Bryan Drewery
70c144dc78 nanosleep(2): Fix bogus incrementing of rmtp by tc_tick_sbt on [EINTR].
sbt is the time in the future that the tsleep_sbt() is expected to be completed
at.  sbtt is the current time.  Depending on the precision with sysctl
kern.timecounter.alloweddeviation the start time may be incremented by
tc_tick_sbt.  The same increment is needed for the current time of sbtt before
calculating the difference.  The impact of missing this increment is that rmtp
may increase by one tc_tick_sbt on every early [EINTR] return.  If the same
struct is passed in for rqtp as rmtp this can result in rqtp effectively
incrementing by tc_tick_sbt and sleeping longer than originally intended.

This problem was introduced in r247797.

Reviewed by:	kib, markj, vangyzen (all on an older version of the test)
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D14362
2018-02-14 18:43:50 +00:00
Mark Johnston
6026dcd7ca Add support for zstd-compressed user and kernel core dumps.
This works similarly to the existing gzip compression support, but
zstd is typically faster and gives better compression ratios.

Support for this functionality must be configured by adding ZSTDIO to
one's kernel configuration file. dumpon(8)'s new -Z option is used to
configure zstd compression for kernel dumps. savecore(8) now recognizes
and saves zstd-compressed kernel dumps with a .zst extension.

Submitted by:	cem (original version)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13101,
			https://reviews.freebsd.org/D13633
2018-02-13 19:28:02 +00:00
Ian Lepore
157f3d7649 Fix bad indentation. Whitespace only, no functional changes.
Reported by:	bde@
2018-02-13 17:38:08 +00:00
Jeff Roberson
e958ad4cf3 Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
Ian Lepore
bd54c5acba Add a new sysctl, debug.clock_do_io, to allow manully triggering a one-shot
read or write of all registered realtime clocks.  In the read case, the
values read are simply discarded.  For writes, there's no alternative but
to actually write the current system time to the device.
2018-02-12 17:41:11 +00:00
Ian Lepore
45eee6db6f Add a set of convenience routines for RTC drivers to use for debug output,
and a debug.clock_show_io sysctl to control debugging output.
2018-02-12 17:33:14 +00:00
Ian Lepore
aedc51f11a Replace the existing print_ct() private debugging function with a set of
three public functions to format and print the three major data structures
used by realtime clock drivers (clocktime, bcd_clocktime, and timespec).
2018-02-12 16:25:56 +00:00
Kirk McKusick
13a025d5d8 Merge biodone_finish() back into biodone(). The primary purpose is
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.

Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:

------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines

Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
    - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
    - b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by:    kan
Reviewed by:            phk
Silence on:             arch@

------------------------------------------------------------------------

It does not appear to ever have been used for anything else.  XFS was
disconnected in r241607:

------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines

Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.

This is not targeted for MFC.

------------------------------------------------------------------------

and removed entirely in r247631:

------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines

Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.

This is not targeted for MFC.

------------------------------------------------------------------------

Since XFS support is gone, there is no reason to retain biodone_finish().

Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
2018-02-09 19:50:47 +00:00
Gleb Smirnoff
f7d3578564 Fix boot_pages exhaustion on machines with many domains and cores, where
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we  need to postpone switch
off of this zone from startup_alloc() until full launch of VM.

o Always supply number of VM zones to uma_startup_count(). On machines
  with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
  a page. In the latter case account VM zones for number of allocations
  from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
  itself any zone that is already capable of running real alloc.
  In worst case scenario we may leak a single page here. See comment
  in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
  extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
  pages calculation, we are guaranteed to use all of the boot_pages
  in the early single threaded stage.

Reported & tested by:	mav
2018-02-09 04:45:39 +00:00
Andriy Gapon
b2387aa652 exec_map_first_page: fix an inverse condition introduced in r254138
While the bug itself was serious, as we could either pass a non-busied
page to vm_pager_get_pages() or leak a busy page, it could only be
triggered under a very rare condition where the page is already inserted
into the object, but it is not valid yet.

Reviewed by:	kib
MFC after:	2 weeks
2018-02-07 21:51:59 +00:00
Gleb Smirnoff
5073a08328 Fix three miscalculations in amount of boot pages:
o Most of startup zones have struct uma_slab embedded into the slab,
  so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
  when calculating how many pages would certain kind of allocations
  require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
  need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
  and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by:	pho [1], jtl [2]
2018-02-07 18:32:51 +00:00
Mark Johnston
1d3a1bcfac Dequeue wired pages lazily.
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by:	jeff
Discussed with:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D11943
2018-02-07 16:57:10 +00:00
Ian Lepore
ce75945d3c Use const pointers for input data not modified by clock utility functions. 2018-02-06 22:17:01 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Gleb Smirnoff
ae941b1b4e Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC.
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
  allowed several other SYSINITs to sneak in before it, thus bumping
  requirement for amount of boot pages.
2018-02-06 22:06:59 +00:00
Gleb Smirnoff
f4bef67c9c Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
  vm_page_startup() finishes. After this stage we don't yet enable buckets,
  but we can ask VM for pages. Rename stages to meaningful names while here.
  New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
  BOOT_RUNNING.
  Enabling page alloc earlier allows us to dramatically reduce number of
  boot pages required. What is more important number of zones becomes
  consistent across different machines, as no MD allocations are done before
  the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
  startup_alloc(), however that may change, so vm_page_startup() provides
  its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
  functions calculates sizes of zones zone and kegs zone, and calculates how
  many pages UMA will need to bootstrap.
  It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
  mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
  but also as args.size.

Reviewed by:		imp, gallatin (earlier version)
Differential Revision:	https://reviews.freebsd.org/D14054
2018-02-06 04:16:00 +00:00
Kirk McKusick
47806d1b93 Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
  check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
  never mark it dirty, thus do not write it out, and hence never
  update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
  check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
  the old check hash is still in the VM cache, but some other pages
  are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
  from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
  error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

  - bufdone()
  -- b_ckhashcalc (calculates check hash)
  -- bufdone_finish()
  --- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
  in memory (but not all pages, or we wouldn't be doing the read),
  the I/O is done with bogus_page mapped in for the pages that exist
  in the VM cache. This mapping is done to avoid corrupting the
  cached pages if there is any I/O overrun. The vfs_vmio_iodone()
  function is responsible for replacing the bogus_page(s) with the
  cached ones. But we were calculating the check hash before the
  bogus_page(s) were replaced. Hence, when we were calculating the
  check hash, we were partly reading from bogus_page, which means
  we calculated a bad check hash (e.g., because multiple pages have
  been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
John Baldwin
15746ef43a Ignore relocation tables for non-memory-resident sections.
As a followup to r328101, ignore relocation tables for ELF object
sections that are not memory resident.  For modules loaded by the
loader, ignore relocation tables whose associated section was not
loaded by the loader (sh_addr is zero).  For modules loaded at runtime
via kldload(2), ignore relocation tables whose associated section is
not marked with SHF_ALLOC.

Reported by:	Mori Hiroki <yamori813@yahoo.co.jp>, adrian
Tested on:	mips, mips64
MFC after:	1 month
Sponsored by:	DARPA / AFRL
2018-02-05 23:35:33 +00:00
John Baldwin
d722231bca Always give ELF brands a chance to veto a match.
If a brand provides a header_supported hook, check it when trying to
find a brand based on a matching interpreter as well as in the final
loop for the fallback brand. Previously a brand might reject a binary
via a header_supported hook in one of the earlier loops, but still be
chosen by one of these later loops.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D13945
2018-02-05 23:27:42 +00:00
Brooks Davis
0a2c60c371 Reduce duplication in extattr_*_(file|link) syscalls.
Reviewed by:	rwatson
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14173
2018-02-05 19:06:34 +00:00
Brooks Davis
02bc058f79 ANSIfy syscall implementations.
Reviewed by:	rwatson
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14172
2018-02-05 18:58:55 +00:00
Brooks Davis
0fd25723bc Add kern.ipc.{msqids,semsegs,sema} sysctls for FreeBSD32.
Stop leaking kernel pointers though theses sysctls and make sure that the
padding in the structures is zeroed on allocation to avoid other leaks.

Reviewed by:	gordon, kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D13459
2018-02-02 18:03:12 +00:00
Hans Petter Selasky
64282a1274 Slightly bump the maximum OID path for loading tunable SYSCTLs.
Coming updates to the mlx5en(4) driver will require this.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-02-02 12:42:46 +00:00
Kirk McKusick
5a35a04255 One of the vnode fields listed by vn_printf is the union of pointers
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.

Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
2018-01-31 22:49:50 +00:00
Ed Maste
37880089ac makesyscalls: permit a range of syscall numbers for UNIMPL
Some ABIs have large gaps in syscall numbers.  Allow gaps to be filled
as ranges of UNIMPL, with an entry like:

248-1023	AUE_NULL	UNIMPL	unimplemented

Reviewed by:	jhb, gnn
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14122
2018-01-30 18:29:38 +00:00
Bryan Drewery
595109196a Don't use an .OBJDIR for 'make sysent'.
Reported by:	emaste, jhb
Sponsored by:	Dell EMC
2018-01-29 19:14:15 +00:00
Li-Wen Hsu
99178c2f09 Fix LINT build after r328508, add forgotten part in format string
Reviewed by:	delphij
Differential Revision:	https://reviews.freebsd.org/D14089
2018-01-29 02:29:08 +00:00
Warner Losh
7faed6e3e9 Create deprecation management functions.
gone_in(majar, msg);	If we're running in FreeBSD major, tell
			the user this code may be deleted soon.
			If we're running in FreeBSD major - 1,
			the the user is deprecated and will
			be gone in major.
			Otherwise say nothing.

gone_in_dev(dev, major, msg) Just like gone_in, except use device_printf.

New tunable / sysctl debug.oboslete_panic: 0 - don't panic,
	1 - panic in major or newer , 2 - panic in major - 1 or newer
	default: 0

if NO_OBSOLETE_CODE is defined, then both of these turn into compile
time errors when building for major. Add options NO_OBSOLETE_CODE to
kernel build system.

This lets us tag code that's going away so users know it will be gone,
as well as automatically manage things.

Differential Review: https://reviews.freebsd.org/D13818
2018-01-29 00:14:39 +00:00
Warner Losh
55cf33a584 Add the DF_SUSPENDED flag to flags that are printed. 2018-01-28 05:13:17 +00:00
Kirk McKusick
4d93711d80 For many years the message "fsync: giving up on dirty" has occationally
appeared on UFS/FFS filesystems. In some cases it was promptly followed
by a panic of "softdep_deallocate_dependencies: dangling deps". This fix
should eliminate both of these occurences.

Submitted by: Andreas Longwitz <longwitz at incore.de>
Reviewed by:  kib
Tested by:    Peter Holm (pho)
PR:           225423
MFC after:    1 week
2018-01-26 18:17:11 +00:00
Li-Wen Hsu
5a70796a71 Fix build for architectures where size_t is not unsigned long
Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D14045
2018-01-25 06:37:14 +00:00
Conrad Meyer
bd555da94b malloc(9): Change nominal size to size_t to match standard C
No functional change -- size_t matches unsigned long on all platforms.

Reported by:	bde
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2018-01-24 19:37:18 +00:00
Wojciech Macek
072c8a3b39 Reverting r328320 2018-01-24 13:57:01 +00:00
Wojciech Macek
4d249cdd4c ULE: provide defaults to ts_cpu
Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0.
This had no practical effect since a real CPU was chosen immediately by the scheduler. However,
on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU
when assigned the initial choice of the old one. This caused an attempt to use illegal memory
and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP
explicitly and add some asserts to see that this problem does not recur.

Authored by:           Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by:          Wojciech Macek <wma@semihalf.com>
Obtained from:         Semihalf
Differential revision: https://reviews.freebsd.org/D13932
2018-01-24 07:54:05 +00:00
Pedro F. Giffuni
44c514b142 Forgot to sort here in r328238. 2018-01-22 02:26:10 +00:00
Pedro F. Giffuni
d821d36419 Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	3 weeks
2018-01-22 02:08:10 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Nathan Whitehorn
9a8196ce19 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
Andriy Gapon
3e4f610dad correct read-ahead calculations in vfs_bio_getpages
Previously the calculations were done as if the requested region
ended at the start of the last requested page, not its end.
The problem as actually quite minor as it affected only stats and
page prefaulting, not the actual page data, and only with specific
parameters.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2018-01-18 12:59:04 +00:00
Wojciech Macek
5b3e8b0725 KDB: restart only CPUs stopped by KDB
There is a case when not all CPUs went online. In that situation,
restart only APs which were operational before entering KDB.

Created by:            Wojciech Macek <wma@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           nwhitehorn
Differential revision: https://reviews.freebsd.org/D13949
Sponsored by:          QCM Technologies
2018-01-18 07:38:54 +00:00
John Baldwin
58c4aee0d7 Require the SHF_ALLOC flag for program sections from kernel object modules.
ELF object files can contain program sections which are not supposed
to be loaded into memory (e.g. .comment).  Normally the static linker
uses these flags to decide which sections are allocated to loadable
program segments in ELF binaries and shared objects (including kernels
on all architectures and kernel modules on architectures other than
amd64).

Mapping ELF object files (such as amd64 kernel modules) into memory
directly is a bit of a grey area.  ELF object files are intended to be
used as inputs to the static linker.  As a result, there is not a
standardized definition for what the memory layout of an ELF object
should be (none of the section headers have valid virtual memory
addresses for example).

The kernel and loader were not checking the SHF_ALLOC flag but loading
any program sections with certain types such as SHT_PROGBITS.  As a
result, the kernel and loader would load into RAM some sections that
weren't marked with SHF_ALLOC such as .comment that are not loaded
into RAM for kernel modules on other architectures (which are
implemented as ELF shared objects).  Aside from possibly requiring
slightly more RAM to hold a kernel module this does not affect runtime
correctness as the kernel relocates symbols based on the layout it
uses.

Debuggers such as gdb and lldb do not extract symbol tables from a
running process or kernel.  Instead, they replicate the memory layout
of ELF executables and shared objects and use that to construct their
own symbol tables.  For executables and shared objects this works
fine.  For ELF objects the current logic in kgdb (and probably lldb
based on a simple reading) assumes that only sections with SHF_ALLOC
are memory resident when constructing a memory layout.  If the
debugger constructs a different memory layout than the kernel, then it
will compute different addresses for symbols causing symbols in the
debugger to appear to have the wrong values (though the kernel itself
is working fine).  The current port of mdb does not check SHF_ALLOC as
it replicates the kernel's logic in its existing kernel support.

The bfd linker sorts the sections in ELF object files such that all of
the allocated sections (sections with SHF_ALLOCATED) are placed first
followed by unallocated sections.  As a result, when kgdb composed a
memory layout using only the allocated sections, this layout happened
to match the layout used by the kernel and loader.  The lld linker
does not sort the sections in ELF object files and mixed allocated and
unallocated sections.  This resulted in kgdb composing a different
memory layout than the kernel and loader.

We could either patch kgdb (and possibly in the future lldb) to use
custom handling when generating memory layouts for kernel modules that
are ELF objects, or we could change the kernel and loader to check
SHF_ALLOCATED.  I chose the latter as I feel we shouldn't be loading
things into RAM that the module won't use.  This should mostly be a
NOP when linking with bfd but will allow the existing kgdb to work
with amd64 kernel modules linked with lld.

Note that we only require SHF_ALLOC for "program" sections for types
like SHT_PROGBITS and SHT_NOBITS.  Other section types such as symbol
tables, string tables, and relocations must also be loaded and are not
marked with SHF_ALLOC.

Reported by:	np
Reviewed by:	kib, emaste
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D13926
2018-01-17 22:51:59 +00:00