Commit Graph

16279 Commits

Author SHA1 Message Date
mjg
e3c2d3a906 Eliminate false sharing in malloc due to statistic collection
Currently stats are collected in a MAXCPU-sized array which is not
aligned and suffers enormous false-sharing. Fix the problem by
utilizing per-cpu allocation.

The counter(9) API is not used here as it is too incomplete and does
not provide a win over per-cpu zone sized for malloc stats struct. In
particular stats are being reported for each cpu separately by just
copying what is supposed to be an array element for given cpu.

This eliminates significant false-sharing during malloc-heavy tests
e.g. on Skylake. See the review for details.

Reviewed by:	markj
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17289
2018-09-23 19:00:06 +00:00
mjg
00ab82889c select: stop doing zero-sized memsets
Approved by:	re (kib)
2018-09-21 13:20:41 +00:00
markj
c030a808b9 Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned.
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.

Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.

Reported by:	alc
Reviewed by:	alc, kib (previous version)
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17249
2018-09-20 18:29:55 +00:00
mjg
846a8dd029 vfs: remove lookup_shared tunable
Reviewed by:	kib, jhb
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17253
2018-09-20 18:25:26 +00:00
mjg
4279599452 fd: prevent inlining of _fdrop thorough kern_descrip.c
fdrop is used in several places in the file and almost never has to call
_fdrop. Thus inlining it is a pure waste of space.

Approved by:	re (kib)
2018-09-20 13:32:40 +00:00
kib
610ca65f57 Fix state of dquot-less vnodes after failed quotaoff.
UFS quotaoff iterates over all mp vnodes, and derefences and clears
the pointers to corresponding dquots. If SU work items transiently
reference some of dquots,quotaoff() would eventually fail, but all
processed vnodes are already stripped from dquots.  The state is
problematic, since quotas are left enabled, but there is no dquots
where blocks and inodes can be accounted.  The result is assertion
failures and NULL pointer dereferences.

Fix it by suspending writes around quotaoff() call.  Since the
filesystem is synced, no dandling references to dquots from SU
workitems can left behind, which means that quotaoff succeeds.

The complication there is that quotaoff VFS op is performed with the
mount point busied, while to suspend, we need to start write on the
mp.  If vn_start_write() is called on busied mp, system might deadlock
against parallel unmount request.  Handle this by unbusy-ing mp before
starting write, which in turn requires changing the quotaoff()
interface to return with the mount point not busied, same as was done
for quotaon().

Reviewed by:	mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17208
2018-09-19 14:36:57 +00:00
gordon
8f59931436 Correct ELF header parsing code to prevent invalid ELF sections from
disclosing memory.

Submitted by:	markj
Reported by:	Thomas Barabosch, Fraunhofer FKIE
Approved by:	re (implicit)
Approved by:	so
Security:	FreeBSD-SA-18:12.elf
Security:	CVE-2018-6924
Sponsored by:	The FreeBSD Foundation
2018-09-12 04:57:34 +00:00
markj
02b7bd84a2 Rename hardclock_cnt() to hardclock() and remove the old implementation.
Also remove some related and unused subroutines.  They have long been
replaced by variants that handle multiple coalesced events with a single
call.

No functional change intended.

Reviewed by:	cem, kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17029
2018-09-06 02:10:59 +00:00
markj
b1bab99c84 Correct the condition under which we allocate a terminator node.
We will have last_block < blocks if the block count is divisible
by BLIST_BMAP_RADIX, but a terminator node is still needed if the
tree isn't balanced.  In this case we were overruning the blist
array by 16 bytes during initialization.

While here, add a check for the invalid blocks == 0 case.

PR:		231116
Reviewed by:	alc, kib (previous version), Doug Moore <dougm@rice.edu>
Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17020
2018-09-05 19:05:30 +00:00
kib
0a635fe513 Add amd64 mdthread fields needed for the upcoming EFI RT exception
handling.

This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 21:16:43 +00:00
kib
c277253350 Improve error messages from clock_if.m method failures.
Print error message in verbose mode when CLOCK_SETTIME() clock_if.m
method failed.  For EFIRT RTC clock, add error code for the failure of
CLOCK_GETTIME() report.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 20:17:51 +00:00
markm
d8723e8b03 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
alc
3799d78beb Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
imp
2608f6dbcf Add a new device flag: DF_ATTACHED_ONCE
This flag is set once the device has been successfully attached. When
set, it inhibits devmatch from trying to match the device. This in
turn allows kldunload to work as expected. Prior to the change, the
driver would immediately reload because devmatch had no notion that
the driver had once been attached, and therefore shouldn't participate
in further matching.

Differential Revision: https://reviews.freebsd.org/D16735
2018-08-23 05:06:16 +00:00
imp
485cde8df8 Create devctl freeze/thaw.
This adds it to devctl, libdevctl, defines the two IOCTLs and
implements the kernel bits. causes any new drivers that are added via
kldload to be deferred until a 'thaw' comes in. These do not stack: it
is an error to freeze while frozen, or thaw while thawed.

Differential Revision: https://reviews.freebsd.org/D16735
2018-08-23 05:05:47 +00:00
cem
c843990202 devstat(9): Constify function parameters that can be const
No functional change.

When attempting to document the changed argument types in devstat.9, I
discovered the 20 year old manual page severely mismatched reality even
prior to my simple change.  So I took a first cut pass cleaning that up to
match reality.  I'm sure I've missed some things; the goal was just to leave
it better than when I started.

Sponsored by:	Dell EMC Isilon
2018-08-23 01:42:45 +00:00
cem
ffa2c7d287 KASSERT: Make runtime optionality optional
Add an option, KASSERT_PANIC_OPTIONAL, that allows runtime KASSERT()
behavior changes.  When this option is not enabled, code that allows
KASSERTs to become optional is not enabled, and all violated assertions
cause termination.

The runtime KASSERT behavior was added in r243980.

One important distinction here is that panic has __dead2
("attribute((noreturn))"), while kassert_panic does not.  Static analyzers
like Coverity understand __dead2.  Without it, KASSERTs go misunderstood,
resulting in many false positives that result from violation of program
invariants.

Reviewed by:	jhb, jtl, np, vangyzen
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16835
2018-08-22 22:19:42 +00:00
markj
1d715a5168 Prepare the kernel linker to handle PC-relative ifunc relocations.
The boot-time ifunc resolver assumes that it only needs to apply
IRELATIVE relocations to PLT entries.  With an upcoming optimization,
this assumption no longer holds, so add the support required to handle
PC-relative relocations targeting GNU_IFUNC symbols.
- Provide a custom symbol lookup routine that can be used in early boot.
  The default lookup routine uses kobj, which is not functional at that
  point.
- Apply all existing relocations during boot rather than filtering
  IRELATIVE relocations.
- Ensure that we continue to apply ifunc relocations in a second pass
  when loading a kernel module.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16749
2018-08-22 20:44:30 +00:00
tuexen
3e0ad7d794 Add SOL_SOCKET level socket option with name SO_DOMAIN to get
the domain of a socket.

This is helpful when testing and Solaris and Linux have the same
socket option using the same name.

Reviewed by:		bcr@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16791
2018-08-21 14:04:30 +00:00
alc
71b5b012c4 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
kevans
ad8cc64e2b res_find: Fix fallback logic
The fallback logic was broken if hints were found in multiple environments.
If we found a hint in either the loader environment or the static
environment, fallback would be incremented excessively when we returned to
the environment-selection bits. These checks should have also been guarded
by the fbacklvl checks. As a result, fbacklvl could quickly get to a point
where we skip either the static environment and/or the static hints
depending on which environments contained valid hints.

The impact of this bug is minimal, mostly affecting mips boards that use
static hints and may have hints in either the loader environment or the
static environment.

There may be better ways to express the searchable environments and
describing their characteristics (immutable, already searched, etc.) but
this may be revisited after 12 branches.

Reported by:	Dan Nelson <dnelson_1901@yahoo.com>
Triaged by:	Dan Nelson <dnelson_1901@yahoo.com>
MFC after:	3 days
2018-08-18 19:45:56 +00:00
delphij
2acd1f2a25 Regen after r337998. 2018-08-18 06:33:51 +00:00
delphij
4f62d03ca0 getrandom(2) should not be restricted in capability mode. 2018-08-18 06:31:49 +00:00
markj
d1a00acf4d Typo.
X-MFC with:	r337974
2018-08-17 16:07:06 +00:00
markj
4e68a99c04 Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update.  The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by:	jhibbits
Reviewed by:	kib, mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16756
2018-08-17 15:41:01 +00:00
oshogbo
1aa9b1400a capsicum: allow the setproctitle(3) function in capability mode
Capsicum in past allowed to change the process title.
This was broken with r335939.

PR:		230584
Submitted by:	Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by:	ian@niw.com.au
MFC after:	1 week
2018-08-17 14:35:10 +00:00
kevans
645507f98f subr_prf: Don't write kern.boot_tag if it's empty
This change allows one to set kern.boot_tag="" and not get a blank line
preceding other boot messages. While this isn't super critical- blank lines
are easy to filter out both mentally and in processing dmesg later- it
allows for a mode of operation that matches previous behavior.

I intend to MFC this whole series to stable/11 by the end of the month with
boot_tag empty by default to make this effectively a nop in the stable
branch.
2018-08-17 03:42:57 +00:00
jamie
6b9aac38ce Revert r337922, except for some documention-only bits. This needs to wait
until user is changed to stop using jail(2).

Differential Revision:	D14791
2018-08-16 19:09:43 +00:00
jamie
94a36bb7c1 Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES).  These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do.  The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision:	D14791
2018-08-16 18:40:16 +00:00
trasz
5d015ffd7a In the help message at the mountroot prompt, suggest something that
actually works and matches the bsdinstall(8) default.

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2018-08-15 12:12:21 +00:00
alc
d20c62f3cb Eliminate a redundant assignment.
MFC after:	1 week
2018-08-11 19:21:53 +00:00
kevans
d0635b16cc subr_prf: remove think-o that had returned to local patch
Reported by:	cognet
2018-08-10 15:35:02 +00:00
kevans
46cff726c1 boot tagging: minor fixes
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.

The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
2018-08-10 15:29:06 +00:00
kevans
135e1f225d subr_prf: style(9) the sizeof
Reported by:	jkim, ian
2018-08-09 19:09:06 +00:00
kevans
a4d7516115 subr_prf: Use "sizeof current_boot_tag" instead 2018-08-09 17:53:18 +00:00
kevans
d2718a67f3 BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable
BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great
for changing it or removing it. Move it into subr_prf.c and add options for
it to opt_printf.h.

One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the
buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add
a loader tunable by the same name that we'll fetch upon initialization of
the msgbuf.

This allows for flexibility and also ensures that there's a consistent way
to figure out the boot tag of the running kernel, rather than relying on
headers to be in-sync.

Prodded super-super-lightly by:	imp
2018-08-09 17:47:47 +00:00
kevans
3aecd7a21e msgbuf: Light detailing (const'ify and bool'itize) 2018-08-09 17:42:27 +00:00
luporl
2f30606f2f [ppc] Fix kernel panic when using BOOTP_NFSROOT
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.

This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.

2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.

PR:		Bug 230168
Reported by:	sbruno
Reviewed by:	jhibbits, mmacy, sbruno
Approved by:	jhibbits
Differential Revision:	D16633
2018-08-09 14:04:51 +00:00
mmacy
33829b496c epoch_block_wait: don't check TD_RUNNING
struct epoch_thread is not type safe (stack allocated) and thus cannot be dereferenced from another CPU

Reported by: novel@
2018-08-09 05:18:27 +00:00
kevans
47b7c5f8ae kern: Add a BOOT_TAG marker at the beginning of boot dmesg
From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by
default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it
easier to do textproc magic to locate the start of each boot and, of
particular interest to some, the dmesg of the current boot.

The PR has a dmesg(8) component as well that I've opted not to include for
the moment- it was the more contentious part of this PR.

bde@ also made the statement that this boot tag should be written with an
ordinary printf, which I've- for the moment- declined to change about this
patch to keep it more transparent to observer of the boot process.

PR:		43434
Submitted by:	dak <aurelien.nephtali@wanadoo.fr> (basically rewritten)
MFC after:	maybe never
2018-08-09 01:32:09 +00:00
kib
4ef74e60b2 Followup to r337430: only call elf_reloc_ifunc on x86.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-07 20:43:50 +00:00
kib
df3033467d Add missed handling of local relocs against ifunc target in the obj modules.
Reported and tested by:	wulf
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-07 18:26:46 +00:00
markj
7a979485ab Improve handling of control message truncation.
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages.  In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table.  Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient.  To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
  while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
  message boundary, e.g., if the caller received two control messages
  and provided only the exact amount of buffer space needed for the
  first.

PR:		131876
Reviewed by:	ed (previous version)
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16561
2018-08-07 16:36:48 +00:00
kib
4d78ed404a Swap in WKILLED processes.
Swapped-out process that is WKILLED must be swapped in as soon as
possible.  The reason is that such process can be killed by OOM and
its pages can be only freed if the process exits.  To exit, the kernel
stack of the process must be mapped.

When allocating pages for the stack of the WKILLED process on swap in,
use VM_ALLOC_SYSTEM requests to increase the chance of the allocation
to succeed.

Add counter of the swapped out processes to avoid unneeded iteration
over the allprocs list when there is no work to do, reducing the
allproc_lock ownership.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16489
2018-08-04 20:45:43 +00:00
markj
75b64fe9d9 Don't check rcv sockbuf limits when sending on a unix stream socket.
sosend_generic() performs an initial comparison of the amount of data
(including control messages) to be transmitted with the send buffer
size. When transmitting on a unix socket, we then compare the amount
of data being sent with the amount of space in the receive buffer size;
if insufficient space is available, sbappendcontrol() returns an error
and the data is lost.  This is easily triggered by sending control
messages together with an amount of data roughly equal to the send
buffer size, since the control message size may change in uipc_send()
as file descriptors are internalized.

Fix the problem by removing the space check in sbappendcontrol(),
whose only consumer is the unix sockets code.  The stream sockets code
uses the SB_STOP mechanism to ensure that senders will block if the
receive buffer fills up.

PR:		181741
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16515
2018-08-04 20:26:54 +00:00
markj
c0f926949d Style. 2018-08-04 20:16:36 +00:00
avg
8d0848ba39 safer wait-free iteration of shared interrupt handlers
The code that iterates a list of interrupt handlers for a (shared)
interrupt, whether in the ISR context or in the context of an interrupt
thread, does so in a lock-free fashion.   Thus, the routines that modify
the list need to take special steps to ensure that the iterating code
has a consistent view of the list.  Previously, those routines tried to
play nice only with the code running in the ithread context.  The
iteration in the ISR context was left to a chance.

After commit r336635 atomic operations and memory fences are used to
ensure that ie_handlers list is always safe to navigate with respect to
inserting and removal of list elements.

There is still a question of when it is safe to actually free a removed
element.

The idea of this change is somewhat similar to the idea of the epoch
based reclamation.  There are some simplifications comparing to the
general epoch based reclamation.  All writers are serialized using a
mutex, so we do not need to worry about concurrent modifications.  Also,
all read accesses from the open context are serialized too.

So, we can get away just two epochs / phases.  When a thread removes an
element it switches the global phase from the current phase to the other
and then drains the previous phase.  Only after the draining the removed
element gets actually freed. The code that iterates the list in the ISR
context takes a snapshot of the global phase and then increments the use
count of that phase before iterating the list.  The use count (in the
same phase) is decremented after the iteration.  This should ensure that
there should be no iteration over the removed element when its gets
freed.

This commit also simplifies the coordination with the interrupt thread
context.  Now we always schedule the interrupt thread when removing one
of handlers for its interrupt.  This makes the code both simpler and
safer as the interrupt thread masks the interrupt thus ensuring that
there is no interaction with the ISR context.

P.S.  This change matters only for shared interrupts and I realize that
those are becoming a thing of the past (and quickly).  I also understand
that the problem that I am trying to solve is extremely rare.

PR:		229106
Reviewed by:	cem
Discussed with:	Samy Al Bahra
MFC after:	5 weeks
Differential Revision: https://reviews.freebsd.org/D15905
2018-08-03 14:27:28 +00:00
asomers
fabe732b5e Fix LOCAL_PEERCRED with socketpair(2)
Enable the LOCAL_PEERCRED socket option for unix domain stream sockets
created with socketpair(2). Previously, it only worked with unix domain
stream sockets created with socket(2)/listen(2)/connect(2)/accept(2).

PR:		176419
Reported by:	Nicholas Wilson <nicholas@nicholaswilson.me.uk>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16350
2018-08-03 01:37:00 +00:00
avg
d9bc6597cf fix a typo resulting in a wrong variable in kern_syscall_deregister
The difference is between sysent, a global, and sysents, a function
parameter.
2018-08-02 09:41:55 +00:00
markj
3a415f27d2 Remove a redundant check.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-07-30 17:58:41 +00:00