Commit Graph

16353 Commits

Author SHA1 Message Date
Matt Macy
6573d7580b epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
2018-07-04 02:47:16 +00:00
Matt Macy
8bedbb4d42 expose thread_lite definition to tied modules 2018-07-03 02:50:07 +00:00
Matt Macy
6443773dab make critical_{enter, exit} inline
Avoid pulling in all of the <sys/proc.h> dependencies by
automatically generating a stripped down thread_lite exporting
only the fields of interest. The field declarations are type checked
against the original and the offsets of the generated result is
automatically checked.

kib has expressed disagreement and would have preferred to simply
use genassym style offsets (which loses type check enforcement).
jhb has expressed dislike of it due to header pollution and a
duplicate structure. He would have preferred to just have defined
thread in _thread.h. Nonetheless, he admits that this is the only
viable solution at the moment.

The impetus for this came from mjg's D15331:
"Inline critical_enter/exit for amd64"

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D16078
2018-07-03 01:55:09 +00:00
Mariusz Zaborski
0dea6e3c98 core(5): overwrite the oldest core dump
The '%I' format in the kern.corefile sysctl limits the number of
core files that a process can generate to the number stored in the
debug.ncores sysctl. The '%I' format is replaced by the single digit
index. Previously, if all indexes were taken the kernel would overwrite
only a core file with the highest index in a filename.
Currently the system will create a new core file if there is a free
index or if all slots are taken it will overwrite the oldest one.

Reviewed by:	kib(code), bcr (updating)
Differential Revision:	https://reviews.freebsd.org/D15991
Differential Revision:	https://reviews.freebsd.org/D16084
2018-07-01 17:28:46 +00:00
Gleb Smirnoff
95dce07dea Correct r335242. Use unsigned cast instead of abs(). Using abs() gives
incorrect result when ticks has already wrapped, and are about to reach
the cr_ticks value (cr_ticks - ticks < hz).

Submitted by:	bde
2018-06-27 22:00:50 +00:00
Warner Losh
bc6cb3f6b4 Remove devctl_safe_quote since it's now unused.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16026
2018-06-27 04:11:19 +00:00
Warner Losh
349fcda430 Fix devctl generation for core files.
We have a problem with vn_fullpath_global when the file exists. Work
around it by printing the full path if the core file name starts with /,
or current working directory followed by the filename if not.

Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16026
2018-06-27 04:11:09 +00:00
Warner Losh
ab531b8825 Create new devctl_safe_quote_sb to copy a source string into a struct
sbuf to make it safe. Callers are expected to add the " " around it,
if needed.

Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16026
2018-06-27 04:10:48 +00:00
Matt Macy
74333b3dee fix assert and conditionally allow mutexes to be held across epoch_wait_preempt 2018-06-24 18:57:06 +00:00
Matt Macy
0bcfb47363 epoch(9): Don't trigger taskq enqueue before the grouptaskqs are setup
If EARLY_AP_STARTUP is not defined it is possible for an epoch to be
allocated prior to it being possible to call epoch_call without
issue.

Based on patch by andrew@

PR:		229014
Reported by:	andrew
2018-06-23 07:14:08 +00:00
Colin Percival
7e8db78116 Improve the accuracy of the POSIX "process CPU-time" clocks by adding the
used portion of the current thread's time slice if the current thread
belongs to the process being queried (i.e., if clock_gettime is invoked
with a clock ID of CLOCK_PROCESS_CPUTIME_ID or the value provided by
passing getpid(2) to clock_getcpuclockid(3)).

The CLOCK_VIRTUAL and CLOCK_PROF timers already make this adjustment via
long-standing code in calcru(), but since those timers are not specified
by POSIX it seems useful to add it here so that the higher accuracy is
available to code which aims to be portable.

PR:		228669
Reported by:	Graham Percival
Reviewed by:	kib
MFC after:	1 week
2018-06-22 10:23:32 +00:00
Matt Macy
ae25f40b72 epoch(9): make non-preemptible variant work early boot 2018-06-22 00:47:18 +00:00
Kyle Evans
03d7aee8a7 subr_hints: Fix acpi unit hinting (at the very least)
The refactoring in r335479 overlooked the fact that the dynamic kenv can
also be switched to if hintmode == 0. This is problematic because the
checkmethod bits are only ever ran once, but it worked previously because
the use_kenv was a global state and the first lookup would enable it if
occurring after the dynamic environment has been setup.

Extending our local definition of use_kenv to include all non-STATIC
hintmodes as long as the dynamic_kenv is setup fixes this. We still have
potential issues if the dynamic kenv comes up while we're doing an anchored
search through the environment, but this is not much of a concern right now
because:

1.) The dynamic environment comes up super early in boot, just after kmem

2.) This is going to get rewritten to provide a safer mechanism for the
anchored searches, ensuring that we continue using the same environment
chain (dynamic env or static fallback) for all anchored search invocations

Reported by:	mmamcy
X-MFC-With: r335479
2018-06-21 21:50:00 +00:00
Konstantin Belousov
6e22bbf66e fork: avoid endless wait with PTRACE_FORK and RFSTOPPED.
An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the
fork_return() in its context, so parent is stuck forever.  Triggered
when trying to ptrace linux process.  Instead of waiting for the new
thread to clear TDB_STOPATFORK, tag it as traced and reparent to the
debugger in do_fork(), and let it only notify the debugger when run.

Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by:	jhb
MFC after:	1 week
X-MFC-Note:	keep p_dbgwait placeholder intact
Differential revision:	https://reviews.freebsd.org/D15857
2018-06-21 21:12:49 +00:00
Konstantin Belousov
ac4bc0c171 Update proc->p_ptevents annotation to reflect the actual locking.
Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by:	jhb
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15954
2018-06-21 21:07:25 +00:00
Justin Hibbits
22c1b4c0f1 Introduce PMCR-based cpufreq(4) driver, for IBM POWER8 and POWER9 systems
Summary: POWER8 and POWER9 use a single CPU register, per core, to change clock
speed.  Everything else is handled by the on-chip controller.  This change
necessitates a change to the cpufreq global kernel driver to bump supported
levels, as the device tree for these systems can have theoretically 256
different options.  On my POWER9 Talos, the list consists of 100 items.  At
16.67MHz intervals, that allows for a change of roughly 1.67GHz between lowest
and highest.

This has only been tested on the POWER9.  However, since they're similar, this
should work on POWER8 as well.

Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15932
2018-06-21 14:26:43 +00:00
Kyle Evans
770488d202 subr_hints: simplify a little bit
Some complexity exists in these bits that isn't needed. The sysctl handler,
upon change to '2', runs through the current set of hints and sets them in
the kenv.

However, this isn't at all necessary if we're pulling hints from the kenv,
static or dynamic, as the former will get added to the latter in
init_dynamic_kenv (see: kern_environment.c). We can reduce this
configuration to just adding static_hints to the kenv if we were previously
using them.

The changes in res_find are minimal and based on the observation that once
use_kenv gets set to '1' it will never be reset to '0', and it gets set to
'1' as soon as we hit fallback mode. Later work will refactor res_find a
little bit and eliminate this now-local, because it's become clear that
there's some funkiness revolving around use_kenv=1 and it being used to
imply that we're certainly looking at the dynamic_kenv.

Reviewed by:	ray
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D15940
2018-06-21 14:04:02 +00:00
Hans Petter Selasky
ce70c57262 Permit the kernel environment to set an array of numeric values for a single
sysctl(9) node.

Reviewed by:		kib@, imp@, jhb@
Differential Revision:	https://reviews.freebsd.org/D15802
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-06-20 20:04:20 +00:00
Kyle Evans
c7962400c9 Add debug.verbose_sysinit tunable for VERBOSE_SYSINIT
VERBOSE_SYSINIT is currently an all-or-nothing option. debug.verbose_sysinit
adds an option to have the code compiled in but quiet by default so that
getting this information from a device in the field doesn't necessarily
require distributing a recompiled kernel.

Its default is VERBOSE_SYSINIT's value as defined in the kernconf. As such,
the default behavior for simply omitting or including this option is
unchanged.

MFC after:	1 week
2018-06-20 19:23:56 +00:00
Emmanuel Vadot
78442297f5 Add pmap_mapdev_attr for arm64
This is needed for efifb.
arm and ricv pmap (the two arch with arm64 that uses subr_devmap) have very
different implementation so for now only add this for arm64.

Tested with efifb on Pine64 with a few other patches.

Reviewed by:	cognet
Differential Revision:	https://reviews.freebsd.org/D15294
2018-06-20 16:07:35 +00:00
Bjoern A. Zeeb
7938a4425a Instead of using hand-rolled loops where not needed switch them
to FOREACH_PROC_IN_SYSTEM() to have a single pattern to look for.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D15916
2018-06-20 11:42:06 +00:00
Bjoern A. Zeeb
7ffbcfe281 Sometimes it is helpful to get the path for a vnode.
Implement a ddb function walking the namecache to do this.

Reviewed by:		jhb, mjg
Inspired by:		gdb macro from jhb (old version)
Sponsored by:		iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D14898
2018-06-20 08:34:29 +00:00
Matt Macy
9e58ff6ff9 convert inpcbinfo hash and info rwlocks to epoch + mutex
- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
  INP_INFO_RLOCK which can no longer fail

When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.

Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.

On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%

Tested by LLNW and pho@

Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686
2018-06-19 01:54:00 +00:00
Andrey V. Elsukov
20efcfc602 Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9).
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.

Reviewed by:	melifaro, olivier
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D15789
2018-06-16 08:26:23 +00:00
Gleb Smirnoff
61f63f47b3 Since 'ticks' is an int, it may wrap around and cr_ticks at a certain
counter_rate will be greater than ticks, resulting in counter_ratecheck()
failure. To fix this take an absolute value of the difference between
ticks and cr_ticks.

Reported by:	jtl
Sponsored by:	Netflix
2018-06-15 21:36:16 +00:00
Bryan Drewery
03bd1b693e proc0_post: Fix some locking issues
- Filter out PRS_NEW procs as rufetch() tries taking the thread lock
  which may not yet be initialized.
- Hold PROC_LOCK to ensure stability of iterating the threads.
- p_rux fields are protected by the process statlock as well.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D15809
2018-06-15 00:36:41 +00:00
Olivier Houchard
78bcf87e3e Use M_EXEC when calling malloc() to allocate the memory to store the module,
as it'll contain executable code.
2018-06-14 23:10:10 +00:00
Brooks Davis
7d87c005da Regen after 335177 (rename sys_obreak to sys_break). 2018-06-14 21:29:31 +00:00
Brooks Davis
9da5364ed9 Name the implementation of brk and sbrk sys_break().
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.

Reviewed by:	emaste, kib (older version)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15638
2018-06-14 21:27:25 +00:00
Jonathan T. Looney
0766f278d8 Make UMA and malloc(9) return non-executable memory in most cases.
Most kernel memory that is allocated after boot does not need to be
executable.  There are a few exceptions.  For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9).  The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.

(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory.  This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory.  This change makes the behavior consistent.)

This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified.  After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.

Allocations that do need executable memory have various choices.  They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.

Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.

PR:		228927
Reviewed by:	alc, kib, markj, jhb (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15691
2018-06-13 17:04:41 +00:00
Warner Losh
a971acbc25 Implement a 'car limit' for bioq.
Allow one to implement a 'car limit' for
bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every
time we queue that many requests, we start over so that we limit the
latency for requests when the software queue depths are large. A value
of '0', the default, means to revert to the old behavior.

Sponsored by: Netflix
2018-06-13 16:48:07 +00:00
Bruce Evans
ab35e1c71b Fix the encoding of major and minor numbers in 64-bit dev_t by restoring
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers.  This
change breaks compatibility with the previous encoding (which was only
used in -current).

Fix truncation to (essentially) 16-bit dev_t in newnfs v3.

Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility.  Extra bits give the much
larger complication that the translations need to compress into fewer
bits.  Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.

The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings.  E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2.  But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102.  ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.

The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them.  I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current.  E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.

I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation.  So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set.  Add some error checking for when bits are lost.  Keep not
doing any error checking for translations for almost everything in
compat/linux.

compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.

compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.

fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself.  Also fix nearby style bugs.

kern/vfs_syscalls.c:
As for freebsd32.  Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.

sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros.  Describe the encoding and
some of the reasons for it.  16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding.  My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits.  This minimizes discontiguities.

Reviewed by:	kib (except for rewrite of the comment in linux_stats.c)
2018-06-13 12:22:00 +00:00
Bruce Evans
372639f944 Fix some bugs found while fixing the representation and translation
of 64-bit dev_t's (but not ones involving dev_t's).

st_size was supposed to be clamped in cvtstat() and linux's copy_stat(),
but the clamping code wasn't aware that st_size is signed, and also had
an obfuscated off-by-1 value for the unsigned limit, so its effect was
to produce a bizarre negative size instead of clamping.

Change freebsd32's copy_ostat() to be no worse than cvtstat().  It was
missing clamping and bzero()ing of padding.

Reviewed by:	kib (except a final fix of the clamp to the signed maximum)
2018-06-13 08:50:43 +00:00
Ed Maste
00ce0c6258 makesyscalls: simplify capenabled pipeline
Replace cat + 2x grep with one grep.

Sponsored by:	Turing Robotic Industries
2018-06-11 18:57:40 +00:00
Matt Macy
0ea9d9376e limit change to fixing controlp handling pending review 2018-06-11 17:10:19 +00:00
Matt Macy
c34bf30069 soreceive_stream: correctly handle edge cases
- non NULL controlp is not an error, returning EINVAL
  would cause X forwarding to fail

- MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still
  want to handle them - punt to soreceive_generic
2018-06-11 16:31:42 +00:00
Mateusz Guzik
0001edb823 counter: add a bit missed in r334858
It happens to be a noop.
2018-06-08 22:06:32 +00:00
Matt Macy
a62b4665f4 AF_UNIX: bring uipc_ready in compliance with new locking protocol
PR:	228742
Submitted by: markj
Reviewed by:	markj
2018-06-08 20:31:59 +00:00
Jonathan T. Looney
1fbe13cf4b Add a socket destructor callback. This allows kernel providers to set
callbacks to perform additional cleanup actions at the time a socket is
closed.

Michio Honda presented a use for this at BSDCan 2018.
(See https://www.bsdcan.org/2018/schedule/events/965.en.html .)

Submitted by:	Michio Honda <micchie at sfc.wide.ad.jp> (previous version)
Reviewed by:	lstewart (previous version)
Differential Revision:	https://reviews.freebsd.org/D15706
2018-06-08 19:35:24 +00:00
Mateusz Guzik
b8af2820f6 uma: fix up r334824
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
2018-06-08 05:40:36 +00:00
Matt Macy
eb7c901995 hwpmc: simplify calling convention for hwpmc interrupt handling
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.

While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.

core2_intr(cpu, tf) ->
  pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
    pmc_add_sample(cpu, ring, pm, tf, inuserspace)

In the process of optimizing the tail call the tf pointer was getting
clobbered:

(kgdb) up
    at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709                                pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205                    error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,

resulting in a crash in pmc_save_kernel_callchain.
2018-06-08 04:58:03 +00:00
Randall Stewart
89e560f441 This commit brings in a new refactored TCP stack called Rack.
Rack includes the following features:
 - A different SACK processing scheme (the old sack structures are not used).
 - RACK (Recent acknowledgment) where counting dup-acks is no longer done
        instead time is used to knwo when to retransmit. (see the I-D)
 - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
        to try not to take a retransmit time-out. (see the I-D)
 - Burst mitigation using TCPHTPS
 - PRR (partial rate reduction) see the RFC.

Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.

Note that any connection that does not support SACK will be kicked
back to the "default" base  FreeBSD stack (currently known as "default").

To build this into your kernel you will need to enable in your
kernel:
   makeoptions WITH_EXTRA_TCP_STACKS=1
   options TCPHPTS

Sponsored by:	Netflix Inc.
Differential Revision:		https://reviews.freebsd.org/D15525
2018-06-07 18:18:13 +00:00
Alan Cox
5b274055d1 When pidctrl_daemon() is called multiple times within an interval, it
should use the cumulative error to calculate the output.
2018-06-07 07:48:50 +00:00
Matt Macy
fcabd54160 AF_UNIX: check for unp == unp2 on disconnect 2018-06-07 04:57:40 +00:00
Alan Cox
e768070ca9 pidctrl_daemon() implements a variation on the classical, discrete PID
controller that tries to handle early invocations of the controller,
in other words, invocations before the expected end of the interval.
However, there were some calculation errors in this early invocation
case.  Notably, if an early invocation occurred while the error was
negative, the derivative term was off by a large amount.  One visible
effect of this error was that processes were being killed by the
virtual memory system's OOM killer when in fact there was plentiful
free memory.

Correct a couple minor errors in the sysctl descriptions, and apply
some style fixes.

Reviewed by:	jeff, markj
2018-06-07 02:54:11 +00:00
Sean Bruno
1a43cff92a Load balance sockets with new SO_REUSEPORT_LB option.
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Obtained from:	DragonflyBSD
Relnotes:	Yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-06-06 15:45:57 +00:00
Justin Hibbits
3f9e1fc8ee Revert r334708
This is the wrong place to put the barrier.
Requested by:	kib,mjg
2018-06-06 15:12:19 +00:00
Justin Hibbits
32c369f40c Add a memory barrier after taking a reference on the vnode holdcnt in _vhold
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.

On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.

It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
2018-06-06 12:57:11 +00:00
Matt Macy
ebfaf69cc0 hwpmc: log name->pid, name->tid mappings
By logging all threads and processes 'pmc filter'
can now filter on process or thread name, relieving
the user of the burden of determining which tid or
pid was which when the sample was taken.

% pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log

% pmc filter -x -T idle pmc.log pmc-noidle.log
2018-06-05 04:26:40 +00:00
Mark Johnston
97bc9a9384 Regen after r334626. 2018-06-04 19:36:47 +00:00
Mark Johnston
9f9c9b22ec Reimplement brk() and sbrk() to avoid the use of _end.
Previously, libc.so would initialize its notion of the break address
using _end, a special symbol emitted by the static linker following
the bss section.  Compatibility issues between lld and ld.bfd could
cause the wrong definition of _end (libc.so's definition rather than
that of the executable) to be used, breaking the brk()/sbrk()
interface.

Avoid this problem and future interoperability issues by simply not
relying on _end.  Instead, modify the break() system call to return
the kernel's view of the current break address, and have libc
initialize its state using an extra syscall upon the first use of the
interface.  As a side effect, this appears to fix brk()/sbrk() usage
in executables run with rtld direct exec, since the kernel and libc.so
no longer maintain separate views of the process' break address.

PR:		228574
Reviewed by:	kib (previous version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D15663
2018-06-04 19:35:15 +00:00
Alan Cox
3e7cb27cdd Use a single, consistent approach to returning success versus failure in
vm_map_madvise().  Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases.  Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic.  vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value.  That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.

Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.

Eliminate a redundant error check from kern_madvise().  Add a comment
explaining where the check is performed.

Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value.  Since MADV_FREE is passed as the
behavior, the return value will always be zero.

Reviewed by:	kib, markj
MFC after:	7 days
2018-06-04 16:28:06 +00:00
Matt Macy
5de96e33d6 hwpmc: support sampling both kernel and user stacks when interrupted in kernel
This adds the -U options to pmcstat which will attribute in-kernel samples
back to the user stack that invoked the system call. It is not the default,
because when looking at kernel profiles it is generally more desirable to
merge all instances of a given system call together.

Although heavily revised, this change is directly derived from D7350 by
Jonathan T. Looney.

Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
2018-06-04 01:10:23 +00:00
Mateusz Guzik
d0a22279db Remove an unused argument to turnstile_unpend.
PR:	228694
Submitted by:	Julian Pszczołowski <julian.pszczolowski@gmail.com>
2018-06-02 22:37:53 +00:00
Mateusz Guzik
34c538c356 malloc: try to use builtins for zeroing at the callsite
Plenty of allocation sites pass M_ZERO and sizes which are small and known
at compilation time. Handling them internally in malloc loses this information
and results in avoidable calls to memset.

Instead, let the compiler take the advantage of it whenever possible.

Discussed with:	jeff
2018-06-02 22:20:09 +00:00
Mark Johnston
3fb14f61e1 Avoid completing I/O when dumping core after a panic.
Filesystem or pager completion callbacks are generally non-functional
after a panic and may trigger deadlocks if invoked in this context
(e.g., by attempting to destroying a buffer mapping).  To avoid this
situation, short-circuit I/O completion in biodone().

Reviewed by:	imp
Discussed with:	mav
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D15592
2018-06-01 23:49:32 +00:00
Ed Maste
b8d908b71e ANSIfy sys/kern 2018-06-01 13:26:45 +00:00
Warner Losh
c580ca4cf4 Make the data returned by devinfo harder to overflow.
Rather than using fixed-length strings, pack them into a string table
to return. Also expand the buffer from ~300 charaters to 3k. This should
be enough, even for USB.

This fixes a problem where USB pnp info is truncated on return to
userland.

Differential Revision: https://reviews.freebsd.org/D15629
2018-05-31 02:57:58 +00:00
Brooks Davis
64b378f1e1 Remove alternative names that are identical to the default.
Verified by make sysent producing no changes.
2018-05-30 22:22:58 +00:00
Ed Maste
f912a970e6 link_elf_obj: correct an error message
Previously we'd report that a file has "no valid symbol table" if it in
fact had two or more.  Change the message to report that there must be
exactly one.
2018-05-30 12:55:27 +00:00
Matt Macy
e445381f13 epoch(9): make epoch closer to style(9) 2018-05-30 03:39:57 +00:00
Stephen Hurd
3e0e6330b5 iflib: mark irq allocation name parameter as constant
The *name parameter passed to iflib_irq_alloc_generic and
iflib_softirq_alloc_generic is never modified. Many places in code pass
string literals and thus should not be modified.

Mark the *name parameter as a const char * instead, so that we enforce
that the name is not modified before passing to bus_describe_intr()

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	kmacy
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D15343
2018-05-29 21:56:39 +00:00
Gleb Smirnoff
147cd40fe7 Revert second chunk of r333860. The warning from gcc is false positive. The
npages won't be ever used in no space case.
2018-05-29 21:45:15 +00:00
Matt Macy
b99aa0fbb2 hwpmc: don't enter epoch section across mmap hook 2018-05-29 18:03:48 +00:00
Brooks Davis
d8b2f0790b Correct pointer subtraction in KASSERT().
The assertion would never fire without truly spectacular future
programming errors.

Reported by:	Coverity
CID:		1391367, 1391368
Sponsored by:	DARPA, AFRL
2018-05-29 17:49:03 +00:00
Andriy Gapon
ec6faf94c4 add support for console resuming, implement it for uart, use on x86
This change adds a new optional console method cn_resume and a kernel
console interface cnresume.  Consoles that may need to re-initialize
their hardware after suspend (e.g., because firmware does not care to do
it) will implement cn_resume.  Note that it is called in rather early
environment not unlike early boot, so the same restrictions apply.
Platform specific code, for platforms that support hardware suspend,
should call cnresume early after resume, before any console output is
expected.

This change fixes a problem with a system of mine failing to resume when
a serial console is used.  I found that the serial port was in a strange
configuration and an attempt to write to it likely resulted in an
infinite loop.

To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER
macro has been extended to support optional methods.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15552
2018-05-29 16:16:24 +00:00
Matt Macy
552b3e1798 witness/hwpmc: fix locking order for pmc locks 2018-05-28 23:14:38 +00:00
Eric van Gyzen
70d66bcf28 kern_cpuset: fix small leak on error path
The "mask" was leaked on some error paths.

Reported by:	Coverity
CID:		1384683
Sponsored by:	Dell EMC
2018-05-26 14:23:11 +00:00
Eric van Gyzen
16b51429d2 kdb_trap: Fix use of uninitialized data
In some cases, other_cpus was used without being initialized.
Thankfully, it was harmless.

Reported by:	Coverity
CID:		1385265
Sponsored by:	Dell EMC
2018-05-26 14:01:44 +00:00
Brooks Davis
659a2e9243 Regen after r334223: make vadvise compat freebsd11. 2018-05-25 20:41:26 +00:00
Brooks Davis
7351a8bdb5 Make vadvise compat freebsd11.
The vadvise syscall (aka ovadvise) is undocumented and has always been
implmented as returning EINVAL.  Put the syscall under COMPAT11 and
provide a userspace implementation.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15557
2018-05-25 20:40:23 +00:00
Matt Macy
acf9fd05d8 AF_UNIX: It is possible for UNIX datagram sockets to be connected
to themselves. The updated code assumed that that could not happen
and would try to lock the unp mutex twice.

There may be a lingering issue here but this fixes it for the
reporter.

PR:	228458
Reported by:	marieheleneka at gmail.com
2018-05-24 21:13:46 +00:00
Matt Macy
c684c14ce3 AF_UNIX: evidently Samba likes to connect a unix socket to itself, fix locking 2018-05-24 18:22:13 +00:00
Matt Macy
a3a734908b AF_UNIX in connectat unp and unp2 can be the same 2018-05-24 18:22:05 +00:00
Conrad Meyer
a0638b33f7 Yank crufty INTR_FILTER option
It was introduced to the tree in r169320 and r169321 in May 2007.

It never got much use and never became a kernel default.  The code
duplicates the default path quite a bit, with slight modifications.  Just
yank out the cruft.  Whatever goals were being aimed for can probably be met
within the existing framework, without a flag day option.

Mostly mechanical change: 'unifdef -m -UINTR_FILTER'.

Reviewed by:	mmacy
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15546
2018-05-24 17:06:00 +00:00
Brooks Davis
5f77b8a88b Avoid two suword() calls per auxarg entry.
Instead, construct an auxargs array and copy it out all at once.

Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.

Return errors on copyout() and suword() failures and handle them in the
caller.

Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).

Reviewed by:	kib
Comments from:	emaste, jhb
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15485
2018-05-24 16:25:18 +00:00
Bjoern A. Zeeb
bb8f162363 Try to be consistent and spell "vnet" lower case like all the
other options (and as we do on command line).

Sponsored by:	iXsystems, Inc.
2018-05-24 15:31:05 +00:00
Bjoern A. Zeeb
36b41cc336 Improve the KASSERT to also have the prison pointer.
Helpful when debugging from ddb.

Sponsored by:		iXsystems, Inc.
2018-05-24 15:28:21 +00:00
Matt Macy
16529dace8 AF_UNIX: assert that we're not acquiring the same lock 2018-05-24 15:28:16 +00:00
Mateusz Guzik
6fee84e35e Remove incorrect owepreempt assertion added in r334062
Yet another preemption request hitting between the counter being 0
and the check being reached will result in the flag no longer being
set.

Note the situation was already present prior to r334062 and is harmless.

Reported by:	pho
Reviewed by:	kib
2018-05-23 10:13:17 +00:00
Matt Macy
8a656309b3 kern_sendit: use pre-initialized rights 2018-05-23 01:48:09 +00:00
Mateusz Guzik
748b15fc02 Move preemption handling out of critical_exit.
In preperataion for making the enter/exit pair inline.

Reviewed by:	kib
2018-05-22 19:24:57 +00:00
Fabien Thomas
f8e73c47d8 Add a SPD cache to speed up lookups.
When large SPDs are used, we face two problems:

- too many CPU cycles are spent during the linear searches in the SPD
  for each packet
- too much contention on multi socket systems, since we use a single
  shared lock.

Main changes:

- added the sysctl tree 'net.key.spdcache' to control the SPD cache
  (disabled by default).
- cache the sp indexes that are used to perform SP lookups.
- use a range of dedicated mutexes to protect the cache lines.

Submitted by: Emeric Poupon <emeric.poupon@stormshield.eu>
Reviewed by: ae
Sponsored by:	Stormshield
Differential Revision: https://reviews.freebsd.org/D15050
2018-05-22 15:54:25 +00:00
Mateusz Guzik
ee252fc995 sx: fixup a braino in r334024
If a thread waiting on sx dropped Giant it would not be properly
reacquired on exit from the routine, later resulting in panics
indicating Giant is not held (when it should be).

The bug was not present in the original patch sent to pho, I wittingly
added it just prior to the commit and only smoke-tested it.

Reported by:	pho
2018-05-22 15:13:25 +00:00
Mateusz Guzik
99ece3a9cd Reduce sdt-related branch-fest in mi_switch.
The code was evaluating flags before resorting to checking if dtrace is
enabled. This was inducing forward jumps in the common case.
2018-05-22 08:27:33 +00:00
Mateusz Guzik
2466d12b09 sx: port over writer starvation prevention measures from rwlock
A constant stream of readers could completely starve writers and this is not
a hypothetical scenario.

The 'poll2_threads' test from the will-it-scale suite reliably starves writers
even with concurrency < 10 threads.

The problem was run into and diagnosed by dillon@backplane.com

There was next to no change in lock contention profile during -j 128 pkg build,
despite an sx lock being at the top.

Tested by:	pho
2018-05-22 07:20:22 +00:00
Mateusz Guzik
9feec7ef69 rw: decrease writer starvation
Writers waiting on readers to finish can set the RW_LOCK_WRITE_SPINNER
bit. This prevents most new readers from coming on. However, the last
reader to unlock also clears the bit which means new readers can sneak
in and the cycle starts over.

Change the code to keep the bit after last unlock.

Note that starvation potential is still there: no matter how many write
spinners are there, there is one bit. After the writer unlocks, the lock
is free to get raided by readers again. It is good enough for the time
being.

The real fix would include counting writers.

This runs into a caveat: the writer which set the bit may now be preempted.
In order to get rid of the problem all attempts to set the bit are preceeded
with critical_enter.

The bit gets cleared when the thread which set it goes to sleep. This way
an invariant holds that if the bit is set, someone is actively spinning and
will grab the lock soon. In particular this means that readers which find
the lock in this transient state can safely spin until the lock finds itself
an owner (i.e. they don't need to block nor speculate how long to spin
speculatively).

Tested by:	pho
2018-05-22 07:16:39 +00:00
Andriy Gapon
27dca831a6 stop and restart kernel event timers in the suspend / resume cycle
I have a system that is very unstable after resuming from suspend-to-RAM
but only if HPET is used as the event timer.  The theory is that SMM
code / firmware could be enabling HPET for its own uses and unexpected
interrupts cause a trouble for it.  Originally I wanted to solve the
problem in hpet_suspend() method, but that was insufficient as the event
timer could get reprogrammed again.

So, it's better, for my case and in general, to stop the event timer(s)
before entering the hardware suspend.

MFC after:	4 weeks
Differential Revision: https://reviews.freebsd.org/D15413
2018-05-21 20:23:04 +00:00
Mark Johnston
13679ebac9 Don't pass a section cookie to CK for non-preemptible epoch sections.
They're only useful when multiple threads may share an epoch record,
and that can't happen with non-preemptible sections.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D15507
2018-05-21 16:03:51 +00:00
Matt Macy
9725c9cef7 AF_UNIX gc unused label
...sigh
2018-05-20 21:37:34 +00:00
Matt Macy
cb8f450b94 AF_UNIX: Don't unlock unp/unp2 if they're not locked
Reported by:	mjg
2018-05-20 21:20:26 +00:00
Matt Macy
e10ef65d23 AF_UNIX: fix LOR introduced by the locking rewrite 2018-05-20 05:50:53 +00:00
Matt Macy
7118990962 Add additional preinitialized cap_rights 2018-05-20 05:13:12 +00:00
Mateusz Guzik
2186ee6e72 vfs: simplify vop_stdlock/unlock
The interlock pointer is non-NULL by definition and the compiler see through
that and eliminates the NULL checks. Just remove them from the code as they
play no role.

No difference in generated assembly.
2018-05-20 04:45:05 +00:00
Matt Macy
d95253403f AF_UNIX: make unpcb lock name line up with what's in witness 2018-05-20 04:32:48 +00:00
Warner Losh
f344fb0b4b Restore the all rights reserved language. Put it on each of the prior
two copyrights. The line originated with the Berkeely Regents, who
we have not approached about removing it (it's honestly too trivial
to be worth that fight). Restore it to rwatson's line as well. He
can decide if he wants it or not on his own. Matt clearly doesn't
want it, per project preference and his own statements on IRC.

Noticed by: rgrimes@
2018-05-19 17:29:57 +00:00
Ed Maste
1b30e10e48 Remove duplicate cap_no_rights from r333874
Archs using in-tree gcc were broken with `warning: redundant
redeclaration of 'cap_no_rights' [-Wredundant-decls]`.

Sponsored by:	The FreeBSD Foundation
2018-05-19 11:37:02 +00:00
Matt Macy
f6a1a10613 Unbreak BeagleBone Black boot by collapsing 29 SYSINITs in to 1
Reported by:	ilya at bakulin.de
2018-05-19 07:31:35 +00:00
Matt Macy
fc2e87be2b intr unbreak KTR/LINT build 2018-05-19 07:04:43 +00:00
Matt Macy
4b06dee1e5 AF_UNIX: switch to annotations to avoid warnings 2018-05-19 05:37:58 +00:00
Matt Macy
acbde29858 capsicum: propagate const correctness 2018-05-19 05:14:05 +00:00
Matt Macy
ba3f7276c0 intr: eliminate / annotate unused stack locals 2018-05-19 05:12:18 +00:00
Matt Macy
7fd6841438 sendfile: annotate unused value and ensure that npages is actually initialized 2018-05-19 05:10:51 +00:00
Matt Macy
e1a92f058f umtx: don't call umtxq_getchain unless the value is needed 2018-05-19 05:09:10 +00:00
Matt Macy
a6c7423a92 cpuset: revert and annotate instead 2018-05-19 05:07:31 +00:00
Matt Macy
6fa5abfdda conf: revert last change and annotate unused var instead 2018-05-19 05:07:03 +00:00
Matt Macy
1c0336c1c1 kevent: annotate unused stack local 2018-05-19 05:06:18 +00:00
Matt Macy
788390df0a lockf: annotate LOCKF_DEBUG only var 2018-05-19 05:04:38 +00:00
Matt Macy
d1230b1159 capsicum: annotate variable only used by debug 2018-05-19 05:02:40 +00:00
Matt Macy
3adccf38e3 turnstile / sleepqueue: annotate variables only used by debug builds 2018-05-19 05:00:16 +00:00
Matt Macy
84482abd21 vfs: annotate variables only used by debug builds as __unused 2018-05-19 04:59:39 +00:00
Matt Macy
a2bb4e080e tty: use __unused annotation instead to silence warnings 2018-05-19 04:48:26 +00:00
Matt Macy
5072a5f465 malloc: avoid possibly returning stack garbage if MALLOC_DEBUG is defined 2018-05-19 04:43:49 +00:00
Matt Macy
39eef2f45a cpuset_thread0: avoid unused assignment on non debug build 2018-05-19 04:14:00 +00:00
Matt Macy
926cfdb8da make_dev: avoid unused assignments on non debug builds 2018-05-19 04:13:20 +00:00
Matt Macy
4949ad7264 mqueue: avoid unused variables 2018-05-19 04:10:53 +00:00
Matt Macy
cd6ba3f086 physio: avoid uninitialized variables 2018-05-19 04:09:58 +00:00
Matt Macy
e9b1074bc7 cache_lookup remove unused variable and initialize used 2018-05-19 04:08:11 +00:00
Matt Macy
ec8d23352b filt_timerdetach: only assign to old if we're going to check it in
a KASSERT
2018-05-19 04:07:00 +00:00
Matt Macy
5cc2d25a2b getnextevent: put variable only used by KTR under ifdef KTR 2018-05-19 04:05:36 +00:00
Matt Macy
bfd0eacb02 simplify control flow so that gcc knows we never pass save to curthread_pflags_restore
without initializing
2018-05-19 04:04:44 +00:00
Matt Macy
3ef78c9c96 tty: conditionally assign to ret value only used by MPASS statement 2018-05-19 04:02:29 +00:00
Matt Macy
02fe8a2409 remove unused locked variable in lockmgr_unlock_fast_path 2018-05-19 03:58:40 +00:00
Matt Macy
ddd4d15ecd signotify: don't create a stack local that isn't used on non-debug builds 2018-05-19 03:57:41 +00:00
Matt Macy
46117e1f0c sysv_msg initialize saved_msgsz 2018-05-19 03:56:39 +00:00
Matt Macy
11d4f748d7 remove unused variable 2018-05-19 03:55:42 +00:00
Matt Macy
1dce110f63 fix uninitialized variable warning in reader locks 2018-05-19 03:52:55 +00:00
Matt Macy
b203713694 fix uninitialized variable warning 2018-05-19 03:49:36 +00:00
Matt Macy
ac8b2d5cb1 sys_process.c fix set but not used warning 2018-05-19 03:48:35 +00:00
Matt Macy
e339e43685 subr_epoch.c fix unused variable warnings 2018-05-19 03:47:37 +00:00
Matt Macy
ae6be8e6f7 pidctrl Actually use the variables that we assign to as seatbelts to prevent divide
by zero

Reviewed by:	jeffr
2018-05-19 02:17:18 +00:00
Matt Macy
c0874c3468 fix gcc8 unused variable and set but not used variable in unix sockets
add copyright from lock rewrite while here
2018-05-19 02:15:40 +00:00
Mateusz Guzik
10391db530 lockmgr: avoid atomic on unlock in the slow path
The code is pretty much guaranteed not to be able to unlock.

This is a minor nit. The code still performs way too many reads.
The altered exclusive-locked condition is supposed to be always
true as well, to be cleaned up at a later date.
2018-05-18 22:57:52 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Matt Macy
20ba6811e6 epoch(9): assert that epoch is allocated post-configure 2018-05-18 18:27:17 +00:00
Ed Maste
891cf3ed44 Use NULL for SYSINIT's last arg, which is a pointer type
Sponsored by:	The FreeBSD Foundation
2018-05-18 17:58:09 +00:00
Matt Macy
70398c2f86 epoch(9): Make epochs non-preemptible by default
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by:	markj
2018-05-18 17:29:43 +00:00
Matt Macy
60b7b90d65 epoch: actually allocate the counters we've assigned sysctls too
Approved by:	sbruno
2018-05-18 02:57:39 +00:00
Matt Macy
5e68a3dfe3 epoch: add non-preemptible "critical" variant
adds:
- epoch_enter_critical() - can be called inside a different epoch,
  starts a section that will acquire any MTX_DEF mutexes or do
  anything that might sleep.
- epoch_exit_critical() - corresponding exit call
- epoch_wait_critical() - wait variant that is guaranteed that any
  threads in a section are running.
- epoch_global_critical - an epoch_wait_critical safe epoch instance

Requested by:   markj
Approved by:	sbruno
2018-05-18 01:52:51 +00:00
Brooks Davis
dedc82ae26 Use strsep() to parse init_path in start_init().
This simplifies the use of the path variable by making it NUL
terminated.  This is a prerequisite for further cleanups.

Reviewed by:	imp
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15467
2018-05-17 23:07:51 +00:00
Matt Macy
a5f1042498 epoch: skip poll function call in hardclock unless there are callbacks pending
Reported by:	mjg
Approved by:	sbruno
2018-05-17 21:39:15 +00:00
Matt Macy
c4d901e9bd epoch(9): schedule pcpu callback task in hardclock if there are callbacks pending
Approved by:	sbruno
2018-05-17 19:57:07 +00:00
Matt Macy
2a45e8282a epoch(9): eliminate the need to wait when polling for callbacks to run
by using ck's own callback handling mechanism we can simply check which
callbacks have had a grace period elapse

Approved by:	sbruno
2018-05-17 19:50:55 +00:00
Matt Macy
d1bcb409f6 epoch(9): fix potential deadlock
Don't acquire a waiting thread's lock while holding our own

Approved by:	sbruno
2018-05-17 19:41:58 +00:00
Matt Macy
766d225326 epoch(9): restore thread priority on exit if it was changed by a waiter
Reported by:	markj
Approved by:	sbruno
2018-05-17 19:08:28 +00:00
Matt Macy
75a67bf3d0 AF_UNIX: make unix socket locking finer grained
This change moves to using a reference count across lock drop / reacquire
to guarantee liveness.

Currently sends on unix sockets contend heavily on read locking the list lock.
unix1_processes in will-it-scale peaks at 6 processes and then declines.

With this change I get a substantial improvement in number of operations per second
with 96 processes:

x before
+ after
    N           Min           Max        Median           Avg        Stddev
x  11       1688420       1696389       1693578     1692766.3     2971.1702
+  10      63417955      71030114      70662504      69576423     2374684.6
Difference at 95.0% confidence
        6.78837e+07 +/- 1.49463e+06
        4010.22% +/- 88.4246%
        (Student's t, pooled s = 1.63437e+06)

And even for 2 processes shows a ~18% improvement.
"Small" iron changes (1, 2, and 4 processes):

x before1
+ after1.2
+------------------------------------------------------------------------+
|                                                                  +     |
|                                                           x      +     |
|                                                           x      +     |
|                                                           x      +     |
|                                                           x     ++     |
|                                                          xx     ++     |
|x                                                       x xx     ++     |
|                                  |__________________A_____M_____AM____||
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10       1131648       1197750     1197138.5     1190369.3     20651.839
+  10       1203840       1205056       1204919     1204827.9     353.27404
Difference at 95.0% confidence
        14458.6 +/- 13723
        1.21463% +/- 1.16683%
        (Student's t, pooled s = 14605.2)

x before2
+ after2.2
+------------------------------------------------------------------------+
|                                                                       +|
|                                                                       +|
|                                                                       +|
|                                                                       +|
|                                                                       +|
|                                                                       +|
|           x                                                           +|
|           x                                                           +|
|         x xx                                                          +|
|x        xxxx                                                          +|
|      |___AM_|                                                         A|
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10       1972843       2045866     2038186.5     2030443.8     21367.694
+  10       2400853       2402196     2401043.5     2401172.7     385.40024
Difference at 95.0% confidence
        370729 +/- 14198.9
        18.2585% +/- 0.826943%
        (Student's t, pooled s = 15111.7)

x before4
+ after4.2
    N           Min           Max        Median           Avg        Stddev
x  10       3986994       3991728     3990137.5     3989985.2     1300.0164
+  10       4799990       4806664     4806116.5       4805194     1990.6625
Difference at 95.0% confidence
        815209 +/- 1579.64
        20.4314% +/- 0.0421713%
        (Student's t, pooled s = 1681.19)

Tested by: pho
Reported by:	mjg
Approved by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15430
2018-05-17 17:59:35 +00:00
Matt Macy
fdf71aeb54 epoch(9): make recursion lighter weight
There isn't any real work to do except bump td_epochnest when recursing.
Skip the additional work in this case.

Approved by:	sbruno
2018-05-17 01:13:40 +00:00
Matt Macy
b8205686b4 epoch(9): Guarantee forward progress on busy sections
Add epoch section to struct thread. We can use this to
ennable epoch counter to advance even if a section is
perpetually occupied by a thread.

Approved by:	sbruno
2018-05-17 00:45:35 +00:00
Matt Macy
6161b98c99 hwpmc: Implement per-thread counters for PMC sampling
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.

The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.

Approved by:	sbruno
Obtained from:	jtl
Sponsored by:	Juniper Networks, Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15335
2018-05-16 22:29:20 +00:00
Jean-Sébastien Pédron
547e74a8be teken, vt(4): New callbacks to lock the terminal once
... to process input, instead of inside each smaller operations such as
appending a character or moving the cursor forward.

In other words, before we were doing (oversimplified):

  teken_input()
    <for each input character>
      vtterm_putchar()
        VTBUF_LOCK()
        VTBUF_UNLOCK()
      vtterm_cursor_position()
        VTBUF_LOCK()
        VTBUF_UNLOCK()

Now, we are doing:

  vtterm_pre_input()
    VTBUF_LOCK()
  teken_input()
    <for each input character>
      vtterm_putchar()
      vtterm_cursor_position()
  vtterm_post_input()
    VTBUF_UNLOCK()

The situation was even worse when the vtterm_copy() and vtterm_fill()
callbacks were involved.

The new callbacks are:
  * struct terminal_class->tc_pre_input()
  * struct terminal_class->tc_post_input()

They are called in teken_input(), surrounding the while() loop.

The goal is to improve input processing speed of vt(4). As a benchmark,
here is the time taken to write a text file of 360 000 lines (26 MiB) on
`ttyv0`:

  * vt(4), unmodified:      1500 ms
  * vt(4), with this patch: 1200 ms
  * syscons(4):              700 ms

This is on a Haswell laptop with a GENERIC-NODEBUG kernel.

At the same time, the locking is changed in the vt_flush() function
which is responsible to draw the text on screen. So instead of
(indirectly) using VTBUF_LOCK() just to read and reset the dirty area
of the internal buffer, the lock is held for about the entire function,
including the drawing part.

The change is mostly visible while content is scrolling fast: before,
lines could appear garbled while scrolling because the internal buffer
was accessed without locks (once the scrolling was finished, the output
was correct). Now, the scrolling appears correct.

In the end, the locking model is closer to what syscons(4) does.

Differential Revision:	https://reviews.freebsd.org/D15302
2018-05-16 09:01:02 +00:00
Ed Maste
ea0939f0af subr_pidctrl: use standard 2-Clause FreeBSD license and disclaimer
Approved by:	jeff
2018-05-15 00:50:09 +00:00
Matt Macy
0f00315cb3 hwpmc: fix load/unload race and vm map LOR
- fix load/unload race by allocating the per-domain list structure at boot

- fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch
  to protect the liveness of elements of the pmc_ss_owners list

Reported by:	pho
Approved by:	sbruno
2018-05-14 00:21:04 +00:00
Matt Macy
0c58f85b8d epoch(9): allow sx locks to be held across epoch_wait()
The INVARIANTS checks in epoch_wait() were intended to
prevent the block handler from returning with locks held.
What it in fact did was preventing anything except Giant
from being held across it. Check that the number of locks
held has not changed instead.

Approved by:	sbruno@
2018-05-14 00:14:00 +00:00
Matt Macy
1f4beb6312 epoch(9): cleanups, additional debug checks, and add global_epoch
- GC the _nopreempt routines
    - to really benefit we'd need a separate routine
    - they're not currently in use
    - they complicate the API for no benefit at this time

- check that we're actually in a epoch section at exit

- handle epoch_call() early in boot

- Fix copyright declaration language

Approved by:	sbruno@
2018-05-13 23:24:48 +00:00
Konstantin Belousov
2ebc882927 Detect and optimize reads from the hole on UFS.
- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
  was performed before the buffer is created, and EJUSTRETURN returned
  in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
  allocating the pages for it), copying from zero_region instead.

The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.

Requested and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D14917
2018-05-13 09:47:28 +00:00
Matt Macy
f1401123c5 hwpmc/epoch - don't reference domain if NUMA is not set
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.

Reported by:	pho@, np@
Approved by:	sbruno@
2018-05-12 20:00:29 +00:00
Matt Macy
e6b475e0af hwpmc(9): Make pmclog buffer pcpu and update constants
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.

Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &

Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total

After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total

For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &

Collecting 10 samples of `make -j96 buildkernel` from each:

x before
+ after

real time:
    N           Min           Max        Median           Avg        Stddev
x  10          76.4        127.62        84.845        88.577     15.100031
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        -28.398 +/- 10.0344
        -32.0602% +/- 7.69825%
        (Student's t, pooled s = 10.6794)

system time:
    N           Min           Max        Median           Avg        Stddev
x  10       2277.96       6948.53       2949.47      3341.492     1385.2677
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        -2277.47 +/- 920.425
        -68.1574% +/- 8.77623%
        (Student's t, pooled s = 979.596)

x no pmc
+ pmc running
real time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10          76.4        127.62        84.845        88.577     15.100031
Difference at 95.0% confidence
        29.73 +/- 10.0335
        50.5208% +/- 17.0525%
        (Student's t, pooled s = 10.6785)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        1.332 +/- 0.248939
        2.2635% +/- 0.426506%
        (Student's t, pooled s = 0.264942)

system time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10       2277.96       6948.53       2949.47      3341.492     1385.2677
Difference at 95.0% confidence
        2309.97 +/- 920.443
        223.937% +/- 89.3039%
        (Student's t, pooled s = 979.616)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        32.493 +/- 16.0042
        3.15% +/- 1.5794%
        (Student's t, pooled s = 17.0331)

Reviewed by:	jeff@
Approved by:	sbruno@
Differential Revision:	https://reviews.freebsd.org/D15155
2018-05-12 01:26:34 +00:00
Matt Macy
8dcbd0eae6 epoch(9): always set inited in epoch_init
- set inited in the !usedomains case

Reported by:	jhibbits
Approved by:	sbruno
2018-05-11 18:37:14 +00:00
Matt Macy
4aa302dfc9 epoch(9): callback task fixes
- initialize the pcpu STAILQ in the NUMA case
- don't enqueue the callback task if there isn't sufficient work to be done

Reported by:	pho@
Approved by:	sbruno@
2018-05-11 08:16:56 +00:00
Mateusz Guzik
85c1b3c1cb rmlock: partially depessimize lock/unlock fastpath
Previusly the slow path was folded in and partially jumped over in the
common case.
2018-05-11 06:59:54 +00:00
Matt Macy
b2cb28963b epoch(9): fix priority handling, make callback lists pcpu, and other fixes
- Lend priority to preempted threads in epoch_wait to handle the case
  in which we've had priority lent to us. Previously we borrowed the
  priority of the lowest priority preempted thread. (pointed out by mjg@)

- Don't attempt allocate memory per-domain on powerpc, we don't currently
  handle empty sockets (as is the case on jhibbits Talos' board).

- Handle deferred callbacks as pcpu lists and poll the lists periodically.
  Currently the interval is 1/hz.

- Drop the thread lock when adaptive spinning. Holding the lock starves
  other threads and can even lead to lockups.

- Keep a generation count pcpu so that we don't keep spining if a thread
  has left and re-entered an epoch section.

- Actually removed the callback from the callback list so that we don't
  double free. Sigh ...

Approved by:	sbruno@
2018-05-11 04:54:12 +00:00
Matt Macy
06bf2a6aef Add simple preempt safe epoch API
Read locking is over used in the kernel to guarantee liveness. This API makes
it easy to provide livenes guarantees without atomics.

Includes epoch_test kernel module to stress test the API.

Documentation will follow initial use case.

Test case and improvements to preemption handling in response to discussion
with mjg@

Reviewed by:	imp@, shurd@
Approved by:	sbruno@
2018-05-10 17:55:24 +00:00
Andrew Gallatin
d5cdcc3a06 Fix the build after r333457
In r333457, the arguments to kern_pwritev() were accidentally
re-ordered as part of ANSIfication, breaking the build.
2018-05-10 13:19:42 +00:00
Ed Maste
cc3c9df80f ANSIfy sys_generic.c 2018-05-10 11:36:16 +00:00
Matt Macy
36688f706e Add taskqgroup_config_gtask_deinit to support teardown after
taskqgroup_config_gtask_init.

Approved by:	sbruno
2018-05-09 18:51:35 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Konstantin Belousov
55c9d75e6b Avoid calls to bzero() before ireloc.
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.

Tested by:	mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D15367
2018-05-09 14:39:24 +00:00
Matt Macy
ad738f3791 Reduce overhead of ktrace checks in the common case.
KTRPOINT() checks both if we are tracing _and_ if we are recursing within
ktrace. The second condition is only ever executed if ktrace is actually
enabled. This change moves the check out of the hot path in to the functions
themselves.

Discussed with mjg@

Reported by:	mjg@
Approved by:	sbruno@
2018-05-09 00:00:47 +00:00
Mateusz Guzik
2824088536 Inlined sched_userret.
The tested condition is rarely true and it induces a function call
on each return to userspace.

Bumps getuid rate by about 1% on Broadwell.
2018-05-07 23:36:16 +00:00
Mateusz Guzik
75e9b455a9 Change trap_enotcap to bool and annotate with __read_frequently
It is read on each return to user space.
2018-05-07 23:10:12 +00:00
Mateusz Guzik
79ca7cbf09 Avoid calls to syscall_thread_enter/exit for statically defined syscalls
The entire mechanism is rarely used and is quite not performant due to
atomci ops on the syscall table. It also has added overhead for completely
unrelated syscalls.

Reduce it by avoiding the func calls if possible (which consistutes vast
majority of cases).

Provides about 3% syscall rate speed up for getuid on Broadwell.
2018-05-07 22:29:32 +00:00
Warner Losh
ad7142757b Add device_quiet_children() and device_has_quiet_children()
If you add a child to a device that has quiet children, we'll
automatically set the quiet flag on the children, and its
children.

This is indended for things like CPU that have a large amount of
repetition in booting that adds nothing.
2018-05-07 21:09:08 +00:00
Andrew Gallatin
e7bd0750af Boost thread priority while changing CPU frequency
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency.   This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields.   This can lead to poor perfomance,
and to things potentially getting stuck on Giant.

Reviewed by:	kib (imp reviewed earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15246
2018-05-07 15:24:03 +00:00
Mark Johnston
bd92e6b6f5 Refactor some of the MI kernel dump code in preparation for netdump.
- Add clear_dumper() to complement set_dumper().
- Drain netdump's preallocated mbuf pool when clearing the dumper.
- Don't do bounds checking for dumpers with mediasize 0.
- Add dumper callbacks for initialization for writing out headers.

Reviewed by:	sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15252
2018-05-06 00:22:38 +00:00
Mark Johnston
5475ca5aca Add an mbuf allocator for netdump.
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.

Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.

Reviewed by:	cem (earlier version), julian, sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15251
2018-05-06 00:19:48 +00:00
Mark Johnston
c2ba2d1b0e Style.
MFC after:	3 days
2018-05-06 00:11:30 +00:00
Andriy Gapon
bd3afae0ca for bus suspend, detach and shutdown iterate children in reverse order
For most buses all children are equal, so the order does not matter.
Other buses, such as acpi, carefully order their child devices to
express implicit dependencies between them.  For such buses it is safer
to bring down devices in the reverse order.

I believe that this is the reason why hpet_suspend had to be disabled.
Some drivers depend on a working event timer until they are suspended.
But previously we would suspend hpet very early.

I tested this change by makinbg hpet_suspend actually stop HPET timers
and tested that too.

Note that this change is not a complete solution as it does not take
into account bus passes.
A better approach would be to track the actual attach order of the
devices and to use the reverse of that.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15291
2018-05-05 05:19:32 +00:00
Mateusz Guzik
5ec2c93667 tc: bcopy -> memcpy 2018-05-04 22:48:10 +00:00
Jamie Gritton
0e5c6bd436 Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of.  This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by:	kib
Differential Revision:	D14681
2018-05-04 20:54:27 +00:00
Mark Johnston
1b5c869d64 Fix some races introduced in r332974.
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15280
2018-05-04 17:17:30 +00:00
Matt Macy
748ff486b0 dup1_processes -t 96 -s 5 on a dual 8160
x dup_before
+ dup_after
+------------------------------------------------------------+
|             x                                            + |
|x    x   x   x                                         ++ ++|
|   |____AM___|                                          |AM||
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5  1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08     341205.71
+   5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08     96232.829
Difference at 95.0% confidence
        3.08952e+06 +/- 365604
        2.03266% +/- 0.245071%
        (Student's t, pooled s = 250681)

Reported by:	mjg@
MFC after:	1 week
2018-05-04 06:51:01 +00:00
Konstantin Belousov
7035cf14ee Implement support for ifuncs in the kernel linker.
Required MD bits are only provided for x86.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:37:46 +00:00
Stephen Hurd
f3e1324b41 Separate list manipulation locking from state change in multicast
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969
2018-05-02 19:36:29 +00:00
Mark Johnston
20f85b1ddd Print the dump progress indicator after calling dump_start().
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.

MFC after:	1 week
2018-05-01 17:32:43 +00:00
Nathan Whitehorn
ee900504cf Report the kernel base address properly in kldstat when using PowerPC kernels
loaded at addresses other than their link address.
2018-05-01 04:06:59 +00:00
Ed Maste
2216c6933c Disable connectat/bindat with AT_FDCWD in capmode
Previously it was possible to connect a socket (which had the
CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in
capabilties mode.  This combination should be treated the same as a call
to connect (i.e. forbidden in capabilities mode).  Similarly for bindat.

Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the
documentation and add tests.

PR:		222632
Submitted by:	Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by:	Domagoj Stolfa
MFC after:	1 week
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D15221
2018-04-30 17:31:06 +00:00
Mateusz Guzik
9d68f7741f systrace: track it like sdt probes
While here predict false.

Note the code is wrong (regardless of this change). Dereference of the
pointer can race with module unload. A fix would set the probe to a
nop stub instead of NULL.
2018-04-27 15:16:34 +00:00
Emmanuel Vadot
ee710ecf32 clk: Put the sysctls under hw.clock instead of clock
This is more consistant with hw.regulator and other hardware related
sysctls.
2018-04-27 00:12:00 +00:00
Mark Johnston
5cd29d0f3c Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
Sean Bruno
7875017ca9 Revert r332894 at the request of the submitter.
Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks
2018-04-24 19:55:12 +00:00
Conrad Meyer
65df124845 Do not totally silence suppressed secondary kasserts unless debug.kassert.do_log is disabled
To totally silence and ignore secondary kassert violations after a primary
panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1.

Additional assertion warnings shouldn't block core dump and may alert the
developer to another erroneous condition.  Secondary stack traces may be
printed, identically to the unsuppressed case where panic() is reentered --
controlled via debug.trace_all_panics.

Sponsored by:	Dell EMC Isilon
2018-04-24 19:10:51 +00:00
Conrad Meyer
07aa6ea677 Fix debug.kassert.do_log description text
This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only
since it was originally committed in r243980.

Sponsored by:	Dell EMC Isilon
2018-04-24 18:59:40 +00:00
Conrad Meyer
ad1fc31570 panic: Optionally, trace secondary panics
To diagnose and fix secondary panics, it is useful to have a stack trace.
When panic tracing is enabled, optionally trace secondary panics as well.

The option is configured with the tunable/sysctl debug.trace_all_panics.

(The original concern that inspired only tracing the primary panic was
likely that the secondary trace may scroll the original panic message or trace
off the screen.  This is less of a concern for serial consoles with logging.
Not everything has a serial console, though, so the behavior is optional.)

Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2018-04-24 18:54:20 +00:00
Jonathan T. Looney
18959b695d Update r332860 by changing the default from suppressing post-panic
assertions to not suppressing post-panic assertions.

There are some post-panic assertions that are valuable and we shouldn't
default to disabling them.  However, when a user trips over them, the
user can still adjust the tunable/sysctl to suppress them temporarily to
get conduct troubleshooting (e.g. get a core dump).

Reported by:	cem, markj
2018-04-24 18:47:35 +00:00
Conrad Meyer
b543c98cab lockmgr: Add missed neutering during panic
r313683 introduced new lockmgr APIs that missed the panic-time neutering
present in the rest of our locks.  Correct that by adding the usual check.

Additionally, move the __lockmgr_args neutering above the assertions at the
top of the function.  Drop the interlock unlock because we shouldn't have
an unneutered interlock either.  No point trying to unlock it.

PR:		227749
Reported by:	jtl
Sponsored by:	Dell EMC Isilon
2018-04-24 18:41:14 +00:00
Mateusz Guzik
d357c16adc lockf: change the owner hash from pid to vnode-based
This adds a bit missed due to the patch split, see r332882

Tested by:	pho
2018-04-24 06:10:36 +00:00
Mateusz Guzik
7cd794214a dtrace: depessimize dtmalloc when dtrace is active
Each malloc/free was testing dtrace_malloc_enabled and forcing
extra reads from the malloc type struct to see if perhaps a
dtmalloc probe was on.

Treat it like lockstat and sdt: have a global bolean.
2018-04-24 01:06:20 +00:00
Mateusz Guzik
4c5209cb21 lockstat: track lockstat just like sdt probes
In particular flip the frequently tested var to bool.
2018-04-24 01:04:10 +00:00
Mateusz Guzik
c9e05ccd62 malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default)
malloc was showing at the top of profile during while running microbenchmarks.

#define DTMALLOC_PROBE_MAX              2
struct malloc_type_internal {
        uint32_t        mti_probes[DTMALLOC_PROBE_MAX];
        u_char          mti_zone;
        struct malloc_type_stats        mti_stats[MAXCPU];
};

Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone
(which we know is 0) + part of malloc stats of the first cpu which on top
induces false-sharing.

In particular will-it-scale lock1_processes -t 128 -s 10:
before: average:45879692
after:  average:51655596

Note the counters can be padded but the right fix is to move them to
counter(9), leaving the struct read-only after creation (modulo dtrace
probes).
2018-04-23 22:28:49 +00:00
Sean Bruno
7b7796eea5 Load balance sockets with new SO_REUSEPORT_LB option
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-04-23 19:51:00 +00:00
Mateusz Guzik
833dc05a6e lockf: add per-chain locks to the owner hash
This combined with previous changes significantly depessimizes the behaviour
under contentnion.

In particular the lock1_processes test (locking/unlocking separate files)
from the will-it-scale suite was executed with 128 concurrency on a
4-socket Broadwell with 128 hardware threads.

Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%)
For reference single-process is ~1680000 (i.e. on stock kernel the resulting
perf is less than *half* of the single-threaded run),

Note this still does not really scale all that well as the locks were just
bolted on top of the current implementation. Significant room for improvement
is still here. In particular the top performance fluctuates depending on the
extent of false sharing in given run (which extends beyond the file).
Added chain+lock pairs were not padded w.r.t. cacheline size.

One big ticket item is the hash used for spreading threads: it used to be the
process pid (which basically serialized all threaded ops). Temporarily the
vnode addr was slapped in instead.

Tested by:      pho
2018-04-23 08:23:10 +00:00
Mateusz Guzik
63286976b5 lockf: skip locking the graph if not necessary (common case)
Tested by:      pho
2018-04-23 07:54:02 +00:00
Mateusz Guzik
717df0b0e8 lockf: perform wakeup onlly when there is anybody waiting
Tested by:      pho
2018-04-23 07:52:56 +00:00
Mateusz Guzik
c72ead2815 lockf: skip the hard work in lf_purgelocks if possible
Tested by:      pho
2018-04-23 07:52:10 +00:00
Mateusz Guzik
0d3323f557 lockf: free state only when recycling the vnode
This avoids malloc/free cycles when locking/unlocking the vnode when
nobody is contending.

Tested by:	pho
2018-04-23 07:51:19 +00:00
Tijl Coosemans
7dfbbc613b Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of
kproc_suspend_check.  In r329612 bufspacedaemon was turned into a thread
of the bufdaemon process causing both to call kproc_suspend_check with the
same proc argument and that function contains the following while loop:

while (SIGISMEMBER(p->p_siglist, SIGSTOP)) {
	wakeup(&p->p_siglist);
	msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0);
}

So one thread wakes up the other and the other wakes up the first again,
locking up UP machines on shutdown.

Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they
run after the syncer has shutdown, because the syncer can cause a
situation where bufdaemon help is needed to proceed.

PR:		227404
Reviewed by:	kib
Tested by:	cy, rmacklem
2018-04-22 16:05:29 +00:00
Mateusz Guzik
7d853f62bf lockf: slightly depessimize
1. check if P_ADVLOCK is already set and if so, don't lock to set it
(stolen from DragonFly)
2. when trying for fast path unlock, check that we are doing unlock
first instead of taking the interlock for no reason (e.g. if we want
to *lock*). whilere make it more likely that falling fast path will
not take the interlock either by checking for state

Note the code is severely pessimized both single- and multithreaded.
2018-04-22 09:30:07 +00:00
Jonathan T. Looney
44b71282b5 When running with INVARIANTS, the kernel contains extra checks. However,
these assumptions may not hold true once we've panic'd. Therefore, the
checks hold less value after a panic.  Additionally, if one of the checks
fails while we are already panic'd, this creates a double-panic which can
interfere with debugging the original panic.

Therefore, this commit allows an administrator to suppress a response to
KASSERT checks after a panic by setting a tunable/sysctl.  The
tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being
enabled.

Reviewed by:	kib
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D12920
2018-04-21 17:05:00 +00:00
Konstantin Belousov
1302eea7bb Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET
-> PROC_PDEATHSIG_STATUS for consistency with other procctl(2)
operations names.

Requested by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-04-20 15:19:27 +00:00
Andriy Gapon
f87beb93e8 call racct_proc_ucred_changed() under the proc lock
The lock is required to ensure that the switch to the new credentials
and the transfer of the process's accounting data from the old
credentials to the new ones is done atomically.  Otherwise, some updates
may be applied to the new credentials and then additionally transferred
from the old credentials if the updates happen after proc_set_cred() and
before racct_proc_ucred_changed().

The problem is especially pronounced for RACCT_RSS because
- there is a strict accounting for this resource (it's reclaimable)
- it's updated asynchronously by the vm daemon
- it's updated by setting an absolute value instead of applying a delta

I had to remove a call to rctl_proc_ucred_changed() from
racct_proc_ucred_changed() and make all callers of latter call the
former as well.  The reason is that rctl_proc_ucred_changed, as it is
implemented now, cannot be called while holding the proc lock, so the
lock is dropped after calling racct_proc_ucred_changed.  Additionally,
I've added calls to crhold / crfree around the rctl call, because
without the proc lock there is no gurantee that the new credentials,
owned by the process, will stay stable.  That does not eliminate a
possibility that the credentials passed to the rctl will get stale.
Ideally, rctl_proc_ucred_changed should be able to work under the proc
lock.

Many thanks to kib for pointing out the above problems.

PR:		222027
Discussed with:	kib
No comment:	trasz
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D15048
2018-04-20 13:08:04 +00:00
John Baldwin
73c8686e91 Simplify the code to allocate stack for auxv, argv[], and environment vectors.
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().

Reviewed by:	emaste
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15123
2018-04-19 16:00:34 +00:00
Konstantin Belousov
b940886338 Add PROC_PDEATHSIG_SET to procctl interface.
Allow processes to request the delivery of a signal upon death of
their parent process.  Supposed consumer of the feature is PostgreSQL.

Submitted by:	Thomas Munro
Reviewed by:	jilles, mjg
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D15106
2018-04-18 21:31:13 +00:00
John Baldwin
8ce99bb405 Properly do a deep copy of the ioctls capability array for fget_cap().
fget_cap() tries to do a cheaper snapshot of a file descriptor without
holding the file descriptor lock.  This snapshot does not do a deep
copy of the ioctls capability array, but instead uses a different
return value to inform the caller to retry the copy with the lock
held.  However, filecaps_copy() was returning 1 to indicate that a
retry was required, and fget_cap() was checking for 0 (actually
'!filecaps_copy()').  As a result, fget_cap() did not do a deep copy
of the ioctls array and just reused the original pointer.  This cause
multiple file descriptor entries to think they owned the same pointer
and eventually resulted in duplicate frees.

The only code path that I'm aware of that triggers this is to create a
listen socket that has a restricted list of ioctls and then call
accept() which calls fget_cap() with a valid filecaps structure from
getsock_cap().

To fix, change the return value of filecaps_copy() to return true if
it succeeds in copying the caps and false if it fails because the lock
is required.  I find this more intuitive than fixing the caller in
this case.  While here, change the return type from 'int' to 'bool'.

Finally, make filecaps_copy() more robust in the failure case by not
copying any of the source filecaps structure over.  This avoids the
possibility of leaking a pointer into a structure if a similar future
caller doesn't properly handle the return value from filecaps_copy()
at the expense of one more branch.

I also added a test case that panics before this change and now passes.

Reviewed by:	kib
Discussed with:	mjg (not a fan of the extra branch)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D15047
2018-04-17 18:07:40 +00:00
Brooks Davis
cee61c8cac Stop using fuswintr() and suswintr() in the profiler.
Always take the AST path rather than calling MD functions which are
often implemented as always failing. The is the case on amd64, arm,
i386, and powerpc. This optimization (inherited from 4.4 Lite) is a
pessimization on those architectures and is the sole use of these
functions. They will be removed in a seperate commit.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15101
2018-04-17 16:36:53 +00:00
Alan Somers
52c0983128 lio_listio: return EAGAIN instead of EIO when out of resources
This behavior is already documented by the man page, and suggested by POSIX.

Reviewed by:	jhb
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D15099
2018-04-16 18:12:15 +00:00
Konstantin Belousov
d86c1f0dc1 i386 4/4G split.
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.

By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap.  The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.

There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.

copyout(9) was rewritten to use vm_fault_quick_hold().  An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls.  The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline.  If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.

The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done.  The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging.  I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.

Tested by: pho
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D14633
2018-04-13 20:30:49 +00:00
Mateusz Guzik
e0e259a888 locks: extend speculative spin waiting for readers to drain
Now that 10 years have passed since the original limit of 10000 was
committed, bump it a little bit.

Spinning waiting for writers is semi-informed in the sense that we always
know if the owner is running and base the decision to spin on that.
However, no such information is provided for read-locking. In particular
this means that it is possible for a write-spinner to completely waste cpu
time waiting for the lock to be released, while the reader holding it was
preempted and is now waiting for the spinner to go off cpu.

Nonetheless, in majority of cases it is an improvement to spin instead of
instantly giving up and going to sleep.

The current approach is pretty simple: snatch the number of current readers
and performs that many pauses before checking again. The total number of
pauses to execute is limited to 10k. If the lock is still not free by
that time, go to sleep.

Given the previously noted problem of not knowing whether spinning makes
any sense to begin with the new limit has to remain rather conservative.
But at the very least it should also be related to the machine. Waiting
for writers uses parameters selected based on the number of activated
hardware threads. The upper limit of pause instructions to be executed
in-between re-reads of the lock is typically 16384 or 32678. It was
selected as the limit of total spins. The lower bound is set to
already present 10000 as to not change it for smaller machines.

Bumping the limit reduces system time by few % during benchmarks like
buildworld, buildkernel and others. Tested on 2 and 4 socket machines
(Broadwell, Skylake).

Figuring out how to make a more informed decision while not pessimizing
the fast path is left as an exercise for the reader.
2018-04-11 01:43:29 +00:00
Ian Lepore
97603f1da2 Use explicit_bzero() when cleaning values out of the kernel environment.
Sometimes the values contain geli passphrases being communicated from
loader(8) to the kernel, and some day the compiler may decide to start
eliding calls to memset() for a pointer which is not dereferenced again
before being passed to free().
2018-04-10 22:57:56 +00:00
Mateusz Guzik
04457342a3 rw: whack avoidable re-reads in try_upgrade 2018-04-10 22:32:31 +00:00
Stephen Hurd
f422673e10 Make BPF global lock an SX
This allows NIC drivers to sleep on polling config operations.

Submitted by:	Matthew Macy <mmacy@mattmacy.io>
Reviewed by:	shurd
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14982
2018-04-10 19:42:50 +00:00
Mateusz Guzik
a045941bd2 locks: tweak backoff a little bit
Previous limits were chosen when locking primitives had spurious lock
accesses.

Flipping the starting point to 1 (or rather 2 as the first call shifts it)
provides a modest win when mild contention is seen while not hurting worse
cases. Tested on a bunch of one, two and four socket old and new systems
(Westmere, Skylake, Threadreaper and others) by doing concurrent page faults,
buildkernel/buildworld and other stuff (although not all systems got all the
tests).

Another thing is the upper limit. It is semi-arbitrarily chosen as it was
getting out of hand for slightly less small systems (e.g. a 128-thread one).

Note that backoff is fundamentally a speculative bandaid and this change just
makes it fit a little bit better. It remains completely oblivious to the
hardware topology or the contention pattern. This is being experimented with.
2018-04-08 16:34:10 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Brooks Davis
89ea4a30d6 Added SAL annotatations to system calls.
Modify makesyscalls.sh to strip out SAL annotations.

No functional change.

This is based on work I started in CheriBSD and use to validate fat
pointers at the syscall boundary.  Tal Garfinkel reviewed the changes,
added annotations to COMPAT* syscalls and is using them in a record and
playback framework.  One can envision other uses such as a WITNESS-like
validator for copyin/out as speculated on in the review.

As this time we are only annotating sys/kern/syscalls.master as that is
sufficient for userspace work.  If kernel use cases materialize, we can
annotate other syscalls.master as needed.

Submitted by:	Tal Garfinkel <talg@cs.stanford.edu>
Sponsored by:	DARPA, AFRL (in part)
Differential Revision:	https://reviews.freebsd.org/D14285
2018-04-05 20:31:45 +00:00
Jeff Roberson
e5818a53db Implement several enhancements to NUMA policies.
Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.

Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user.  This
also eliminates the need for the complicated interrupt binding code.

Add a sysctl API for viewing and manipulating domainsets.  Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both.  This probably belongs in a dedicated subr file.

Attempt to improve the include situation.

Reviewed by:	kib
Discussed with:	jhb (cpuset parts)
Tested by:	pho (before review feedback)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14839
2018-03-29 02:54:50 +00:00
Jeff Roberson
27a3c9d710 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
Andriy Gapon
f4043145f2 ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code.  Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public.  I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now.  While making the move I've dropped the
vfs_ prefix.

Reviewed by:	mjg
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D14869
2018-03-28 08:55:31 +00:00
Mateusz Guzik
179da98f71 fd: tighten seq protected areas to not contain malloc/free 2018-03-28 03:07:02 +00:00
Konstantin Belousov
fb441a8829 Fix several leaks of kernel stack data through paddings.
It is random collection of fixes for issues not yet corrected,
reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many
issues from that list were already corrected. Most of them are for
compat32, old compat32 or affect both primary host ABI and compat32.

The freebsd32_kldstat(), for instance, was already fixed by using
malloc(M_ZERO).  Patch includes correction to report the supplied
version back, which is just pedantic.

Reviewed by:	brooks, emaste (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14868
2018-03-27 18:05:51 +00:00
Brooks Davis
34a77b9741 Move uio enums to sys/_uio.h.
Include _uio.h instead of uio.h in several headers to reduce header
polution.

Fix a few places that relied on header polution to get the uio.h header.

I have not moved struct uio as many more things that use it rely on
header polution to get other definitions from uio.h.

Reviewed by:	cem, kib, markj
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14811
2018-03-27 15:20:03 +00:00
Andriy Gapon
31260bf042 vfs_donmount: in certain cases try r/o mount if r/w mount fails
If the operation is not an update, if neither r/w nor r/o mode is
explicitly requested, if the error code hints at the possibility of the
media being read-only, and if the fallback is allowed, then we can try
to automatically downgrade to the readonly mode.

This is especially useful for auto-mounting of removable media that
sometimes can happen to be write-protected.

The fallback to r/o is not enabled by default.  It can be requested on a
per-mount basis with a new mount option, 'autoro'.  Or it can be
globally allowed by setting vfs.default_autoro.

Reviewed by:	cem, kib
MFC after:	3 weeks
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D13361
2018-03-27 14:31:42 +00:00
Jeff Roberson
e8cbe51a04 Fix a bug introduced in r329612 that slowly invalidates all clean bufs.
Reported by:	bde
Reviewed by:	bde
Sponsored by:	Netflix, Dell/EMC Isilon
2018-03-26 18:36:17 +00:00
Mark Johnston
803c11a3a6 Use LIST_FOREACH_SAFE in sleepq_chains_remove_matching().
We may remove a sleepqueue from the hash table in
sleepq_resume_thread().

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14847
2018-03-25 20:12:14 +00:00
Konstantin Belousov
ed9e8bc468 Account the size of the vslock-ed memory by the thread.
Assert that all such memory is unwired on return to usermode.

The count of the wired memory will be used to detect the copyout mode.

Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 13:51:27 +00:00
Konstantin Belousov
161bf65f8a In vn_io_fault1(), reduce the scope where pagefaults are disabled.
Most important for the future use, do not call
vm_fault_quick_hold_pages() with disabled pagefaults.

Reported and tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 13:13:52 +00:00
Konstantin Belousov
c398200721 Do not send signals to init directly from shutdown_nice(9), do it from
the task context.

shutdown_nice() is used from the fast interrupt handlers, mostly for
console drivers, where we cannot lock blockable locks.  Schedule the
task in the fast queue to send the signal from the proper context.

Reviewed by:	imp
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-22 20:47:25 +00:00
Jeff Roberson
9a4b4cd3bc Start witness much earlier in boot so that we can shrink the pend list and
make it more immune to further change.

Reviewed by:	markj, imp (Part of D14707)
Sponsored by:	Netflix, Dell/EMC Isilon
2018-03-22 19:11:43 +00:00
Warner Losh
f0d847af61 Drop any recursed taking of Giant once and for all at the top of
kern_reboot(). The shutdown path is now safe to run without Giant.

Discussed with: kib@
Sponsored by: Netflix
2018-03-22 15:34:37 +00:00
Jonathan T. Looney
2529f56ed3 Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by:	gnn (previous version)
Obtained from:	Netflix, Inc.
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D11085
2018-03-22 09:40:08 +00:00
Gleb Smirnoff
27cd06b391 Redo r331328. We need to fix not only type but also format. While
here again notice that we are fixing regression from r331106.
2018-03-22 05:26:27 +00:00
Gleb Smirnoff
5aab68f24a Fix sysctl types broken in r329612. 2018-03-21 23:21:32 +00:00
Mark Johnston
a7defaea9a Elide the object lock in the common case in vfs_vmio_unwire().
The object lock was only needed when attempting to free B_DIRECT
buffer pages, and for testing for invalid pages (and freeing them
if so). Handle the latter by instead moving invalid pages near the head
of the inactive queue, where they will be reclaimed quickly.

Reviewed by:	alc, kib, jeff
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D14778
2018-03-21 21:15:43 +00:00
Warner Losh
3e867f24cb bufshutdown is no longer called with Giant held, so there's no need to
drop or pickup Giant anymore. Remove that code and adjust comments.
2018-03-21 14:46:59 +00:00
Warner Losh
d5292812f8 Remove Giant from init creation and vfs_mountroot.
Sponsored by: Netflix
Discussed with: kib@, mckusick@
Differential Review: https://reviews.freebsd.org/D14712
2018-03-21 14:46:54 +00:00
Conrad Meyer
c37125d9e5 Add missed sys/limits.h include
Apparently header pollution on x86 hid its absense.  Sorry, other arch
users.

Fix the missed header introduced in r331279.

Reported by:	tinderbox
2018-03-21 03:43:40 +00:00
Conrad Meyer
4948f7bf11 Regenerate sysent files after r331279. 2018-03-21 01:17:01 +00:00
Conrad Meyer
e9ac27430c Implement getrandom(2) and getentropy(3)
The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).

getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.

getentropy(3) is added as a compatibility shim for the OpenBSD API.

truss(1) support is included.

Tests for both system calls are provided.  Coverage is believed to be at
least as comprehensive as LTP getrandom(2) test coverage.  Additionally,
instructions for running the LTP tests directly against FreeBSD are provided
in the "Test Plan" section of the Differential revision linked below.  (They
pass, of course.)

PR:		194204
Reported by:	David CARLIER <david.carlier AT hardenedbsd.org>
Discussed with:	cperciva, delphij, jhb, markj
Relnotes:	maybe
Differential Revision:	https://reviews.freebsd.org/D14500
2018-03-21 01:15:45 +00:00
Jamie Gritton
672756aa9f Represent boolean jail options as an array of structures containing the
flag and both the regular and "no" names, instead of two different string
arrays whose indices need to match the flag's bit position.  This makes
them similar to the say "jailsys" options are represented.

Loop through either kind of option array with a structure pointer rather
then an integer index.
2018-03-20 23:08:42 +00:00
Gleb Smirnoff
83fc34ea0d At this point iwmesg isn't initialized yet, so print pointer to lock
rather than panic before panicing.
2018-03-20 22:05:21 +00:00
Mark Johnston
8c7549da2b Drop KTR_CONTENTION.
It is incomplete, has not been adopted in the other locking primitives,
and we have other means of measuring lock contention (lock_profiling,
lockstat, KTR_LOCK). Drop it to slightly de-clutter the mutex code and
free up a precious KTR class index.

Reviewed by:	jhb, mjg
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D14771
2018-03-20 15:51:05 +00:00
Justin Hibbits
2acde6a85a Cast through uintptr_t to narrow the buf domain pointer on 32-bit archs
arg2 is an intmax_t, which on 32-bit architectures is 64 bits, wider than a
pointer.  When &bdomain[i] is added to arg2 it widens from uintptr_t to
intmax_t, then gcc whines when it gets cast to a pointer.  Casting through
uintptr_t silences this warning.
2018-03-20 02:01:30 +00:00