Commit Graph

16605 Commits

Author SHA1 Message Date
Brooks Davis
3a325dec32 Remove a needlessly clever hack to start init with sys_exec().
Construct a struct image_args with the help of new exec_args_*() helper
functions and call kern_execve().

The previous code mapped a page in userspace, copied arguments out
to it one at a time, and then constructed a struct execve_args all so
that sys_execve() can call exec_copyin_args() to copy the data back in
to a struct image_args.

Opencode the part of pre_execve()/post_execve() that releases a
reference to the initial vmspace. We don't need to stop threads like
they do.

Reviewed by:	kib, jhb (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15469
2018-12-04 00:15:47 +00:00
Mark Johnston
02164d3603 Add a missing definition for the !COMPAT_FREEBSD32 case.
Reported by:	jenkins
MFC with:	r341442
Sponsored by:	The FreeBSD Foundation
2018-12-03 21:07:10 +00:00
Mark Johnston
352aaa5122 Plug memory disclosures via ptrace(2).
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory.  The same
issue existed with the register files in procfs.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18421
2018-12-03 20:54:17 +00:00
Konstantin Belousov
200bf72793 Correct accuracy of the barrier writes accounting.
Discussed with:	mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-02 12:53:39 +00:00
Eric van Gyzen
5e38e3f5eb Include path for tmpfs objects in vm.objects sysctl
This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.

Reviewed by:	kib, dab
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D2724
2018-11-30 04:59:43 +00:00
Brooks Davis
f373437a01 Add helper functions to copy strings into struct image_args.
Given a zeroed struct image_args with an allocated buf member,
exec_args_add_fname() must be called to install a file name (or NULL).
Then zero or more calls to exec_args_add_env() followed by zero or
more calls to exec_args_add_env(). exec_args_adjust_args() may be
called after args and/or env to allow an interpreter to be prepended to
the argument list.

To allow code reuse when adding arg and env variables, begin_envv
should be accessed with the accessor exec_args_get_begin_envv()
which handles the case when no environment entries have been added.

Use these functions to simplify exec_copyin_args() and
freebsd32_exec_copyin_args().

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15468
2018-11-29 21:00:56 +00:00
Konstantin Belousov
7d2b0bd7d7 If BENEATH is specified, always latch the topping directory vnode.
It is possible that we started with a relative path but during the
lookup, found an absolute symlink.  In this case, BENEATH handling
code needs the latch, but it is too late to calculate it.

While there, somewhat improve the assertions.  Clear the NI_LCF_LATCH
flag when the latch vnode is released, so that asserts know the state.
Assert that there is a latch if we entered beneath+abs path mode,
after the starting point is processed.

Reported by:	wulf
With more input from:	pho
Sponsored by:	The FreeBSD Foundation
2018-11-29 19:13:10 +00:00
Mateusz Guzik
1f6ad48c76 vfs: fix i386 build after r341220 2018-11-29 09:54:27 +00:00
Mateusz Guzik
22443809ff cache: retire cache_enter compat schim
It was added over 6 years ago for binary compat. cache_enter macro remains
as it expands to cache_enter_time.

Sponsored by:	The FreeBSD Foundation
2018-11-29 09:32:59 +00:00
Mateusz Guzik
712775843f vfs: drop spurious memcpy in stat
Sponsored by:	The FreeBSD Foundation
2018-11-29 09:04:10 +00:00
Mateusz Guzik
d47f3fdb0a fd: unify fd range check across the routines
While here annotate out of range as unlikely.

Sponsored by:	The FreeBSD Foundation
2018-11-29 08:53:39 +00:00
Mateusz Guzik
eec8d0a378 Convert racct_enable to bool and annotate as __read_frequently
Sponsored by:	The FreeBSD Foundation
2018-11-29 05:17:16 +00:00
Mateusz Guzik
64cf6a62d4 Deinline racct throttling out of syscall exit path.
racct is not enabled by default and even when it is enabled processes are
typically not throttled. The order of checks is left unchanged since
racct_enable will be annotated as __read_frequently, while checking for the
flag in the processes would probably require an extra fetch.

Sponsored by:	The FreeBSD Foundation
2018-11-29 05:08:46 +00:00
Mateusz Guzik
e272bf479b Annotate td_cowgen check as unlikely.
Sponsored by:	The FreeBSD Foundation
2018-11-29 04:48:22 +00:00
Mateusz Guzik
3277792bde Tidy up hardclock.
- use fcmpset for updating ticks
- move (rarely used) itimer handling to a dedicated function

Sponsored by:	The FreeBSD Foundation
2018-11-29 03:44:02 +00:00
Mateusz Guzik
1e9a1bf589 proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock
waitpid always takes proctree to evaluate the list, but only takes allproc
if it can reap. With this patch allproc is no longer taken, which helps during
poudriere -j 128.

Discussed with: kib
Sponsored by:	The FreeBSD Foundation
2018-11-29 02:52:08 +00:00
Konstantin Belousov
affd918514 Improve sigonstack().
Avoid relying on unsigned overflow for the test.
Simplify expressions to avoid duplicate check for the range.
Style.
Add herald comment.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18361
2018-11-27 19:50:58 +00:00
Jamie Gritton
b307954481 In hardened systems, where the security.bsd.unprivileged_proc_debug sysctl
node is set, allow setting security.bsd.unprivileged_proc_debug per-jail.
In part, this is needed to create jails in which the Address Sanitizer
(ASAN) fully works as ASAN utilizes libkvm to inspect the virtual address
space. Instead of having to allow unprivileged process debugging for the
entire system, allow setting it on a per-jail basis.

The sysctl node is still security.bsd.unprivileged_proc_debug and the
jail(8) param is allow.unprivileged_proc_debug. The sysctl code is now a
sysctl proc rather than a sysctl int. This allows us to determine setting
the flag for the corresponding jail (or prison0).

As part of the change, the dynamic allow.* API needed to be modified to
take into account pr_allow flags which may now be disabled in prison0.
This prevents conflicts with new pr_allow flags (like that of vmm(4)) that
are added (and removed) dynamically.

Also teach the jail creation KPI to allow differences for certain pr_allow
flags between the parent and child jail. This can happen when unprivileged
process debugging is disabled in the parent prison, but enabled in the
child.

Submitted by:	Shawn Webb <lattera at gmail.com>
Obtained from:	HardenedBSD (45b3625edba0f73b3e3890b1ec3d0d1e95fd47e1, deba0b5078cef0faae43cbdafed3035b16587afc, ab21eeb3b4c72f2500987c96ff603ccf3b6e7de8)
Relnotes:	yes
Sponsored by:	HardenedBSD and G2, Inc
Differential Revision:	https://reviews.freebsd.org/D18319
2018-11-27 17:51:50 +00:00
Eric van Gyzen
607a0eb2f1 Remove superfluous bzero in getcontext/swapcontext/sendsig
We zero the whole structure; we don't need to zero the __spare__ field again.

Remove trailing whitespace.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2018-11-26 20:56:05 +00:00
Alan Somers
72bce9fff6 vfs_aio.c: rename "physio" symbols to "bio".
aio has two paths: an asynchronous "physio" path and a synchronous path.
Confusingly, physio(9) isn't actually used by the "physio" path, and never
has been.  In fact, it may even be called by the synchronous path!  Rename
the "physio" path to the "bio" path to reflect what it actually does:
directly compose BIOs and send them to character devices.

MFC after:	2 weeks
2018-11-26 18:31:00 +00:00
Alan Cox
ee73fef96e blist_meta_alloc assumes that mask=scan->bm_bitmap is nonzero. But if the
cursor lies in the middle of the space that the meta node represents, then
blanking the low bits of mask may make it zero, and break later code that
expects a nonzero value.  Add a test that returns failure if the mask has
been cleared.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	pho
Tested by:	pho
X-MFC with:	r340402
Differential Revision:	https://reviews.freebsd.org/D18058
2018-11-24 21:52:10 +00:00
Mark Johnston
792843c38f Pass malloc flags directly through kevent(2) subroutines.
Some kevent functions have a boolean "waitok" parameter for use when
calling malloc(9).  Replace them with the corresponding malloc() flags:
the desired behaviour is known at compile-time, so this eliminates a
couple of conditional branches, and makes the code easier to read.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18318
2018-11-24 17:06:01 +00:00
Mark Johnston
36c4960ef8 Plug some kernel memory disclosures via kevent(2).
The kernel may register for events on behalf of a userspace process,
in which case it must be careful to zero the kevent struct that will be
copied out to userspace.

Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18317
2018-11-24 17:02:31 +00:00
Mark Johnston
a2afae524a Ensure that knotes do not get registered when KQ_CLOSING is set.
KQ_CLOSING is set before draining the knotes associated with a kqueue,
so we must ensure that new knotes are not added after that point.  In
particular, some kernel facilities may register for events on behalf
of a userspace process and race with a close of the kqueue.

PR:		228858
Reviewed by:	kib
Tested by:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18316
2018-11-24 16:58:34 +00:00
Mark Johnston
1eeab857a3 Lock the knlist before releasing the in-flux state in knote_fork().
Otherwise there is a window, before iteration is resumed, during which
the knote may be freed.  The in-flux state ensures that the knote will
not be removed from the knlist while locks are dropped.

PR:		228858
Reviewed by:	kib
Tested by:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18316
2018-11-24 16:41:29 +00:00
Konstantin Belousov
cefb93f253 Parse FreeBSD Feature Control note on the ELF image activation.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:33:55 +00:00
Konstantin Belousov
92328a3251 Generalize ELF parse_notes().
Remove the knowledge of the ABI note type and brandnote from it,
instead provide it with a callback to do note-specific matching and
data fetching.  Implement callback to match against ELF brand.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:29:14 +00:00
Konstantin Belousov
eda8fe63c9 Trivial reduction of the code duplication, reuse the return FALSE code.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:16:01 +00:00
Mark Johnston
96fdfb3649 Honour the waitok parameter in kevent_expand().
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18316
2018-11-23 23:10:03 +00:00
Konstantin Belousov
f5cf758998 Provide storage for the process feature control flags in struct proc.
The flags are cleared on exec, it is up to the image activator to set
them.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:07:57 +00:00
Mark Johnston
6d2e2df764 Ensure that directory entry padding bytes are zeroed.
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace.  With the ino64 work there
are also some explicit pad fields in struct dirent.  Add a subroutine
to clear these bytes and use it in the in-tree filesystems.  The
NFS client is omitted for now as it was fixed separately in r340787.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-23 22:24:59 +00:00
Mateusz Guzik
e3d3e8289b Revert "fork: fix use-after-free with vfork"
This unreliably breaks libc handling of vfork where forking succeded,
but execve did not.

vfork code in libc performs waitpid with WNOHANG in case of failed exec.
With the fix exit codepath was waking up the parent before the child
fully transitioned to a zombie. Woken up parent would waitpid, which
could find a not-yet-zombie child and fail to reap it due to the WNOHANG
flag.

While removing the flag fixes the problem, it is not an option due to older
releases which would still suffer from the kernel change.

Revert the fix until a solution can be worked out.

Note that while use-after-free which gets back due to the revert is a real
bug, it's side-effects are limited due to the fact that struct proc memory
is never released by UMA.
2018-11-23 04:38:50 +00:00
Mateusz Guzik
adce241981 Annotate TDP_RFPPWAIT as unlikely.
The flag is only set on vfork, but is tested for *all* syscalls.
On amd64 this shortens common-case (not vfork) code.
2018-11-22 21:38:24 +00:00
Mateusz Guzik
a5ac8272c0 fork: remove avoidable proc lock/unlock pair
We don't have to access the process after making it runnable, so there
is no need to hold it either.

Sponsored by:	The FreeBSD Foundation
2018-11-22 21:29:36 +00:00
Mateusz Guzik
b00b27e925 fork: fix use-after-free with vfork
The pointer to the child is stored without any reference held. Then it is
blindly used to wait until P_PPWAIT is cleared. However, if the child is
autoreaped it could have exited and get freed before the parent started
waiting.

Use the existing hold mechanism to mitigate the problem. Most common case
of doing exec remains unchanged. The corner case of doing exit performs
wake up before waiting for holds to clear.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18295
2018-11-22 21:08:37 +00:00
Mark Johnston
79db6fe7aa Plug some networking sysctl leaks.
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland.  Fix a number of such handlers.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	tuexen
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18301
2018-11-22 20:49:41 +00:00
Mateusz Guzik
f218ac5087 uipc_usrreq: fix inode number assignment
The code was incrementing a global variable in an unsafe manner.
Two different threads stating two different sockets could have resulted
in the same inode numbers assigned to both.

Creation is protected with a global lock, move the assigment there.
Since inode numbers are 64-bit now drop the check for overflows.

Sponsored by:	The FreeBSD Foundation
2018-11-21 22:25:05 +00:00
Mateusz Guzik
a627b4629d proc: update list manipulation comment on process exit
Processes stay in the hash until they get reaped.

This code does not unlink the child from the parent, so remove
the claim that it does.

Sponsored by:	The FreeBSD Foundation
2018-11-21 22:16:10 +00:00
Mateusz Guzik
7883ce1f26 uipc_shm: use unr64 for inode numbers
Sponsored by:	The FreeBSD Foundation
2018-11-21 22:01:06 +00:00
Mateusz Guzik
53011553fa proc: convert pfind & friends to use pidhash locks and other cleanup
pfind_locked is retired as it relied on allproc which unnecessarily
restricts locking of the hash.

Sponsored by:	The FreeBSD Foundation
2018-11-21 20:15:56 +00:00
Mateusz Guzik
3d3e6793f6 proc: implement pid hash locks and an iterator
forks, exits and waits are frequently stalled during poudriere -j 128 runs
due to killpg and process list exports performed for each package.

Both uses take the allproc lock. The latter case can be modified to iterate
over the hash with finer grained locking instead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17817
2018-11-21 18:56:15 +00:00
Mark Johnston
d5e494fee4 Avoid unsynchronized updates to kn_status.
kn_status is protected by the kqueue's lock, but we were updating it
without the kqueue lock held.  For EVFILT_TIMER knotes, there is no
knlist lock, so the knote activation could occur during the kn_status
update and result in KN_QUEUED being lost, in which case we'd enqueue
an already-enqueued knote, corrupting the queue.

Fix the problem by setting or clearing KN_DISABLED before dropping the
kqueue lock to call into the filter.  KN_DISABLED is used only by the
core kevent code, so there is no side effect from setting it earlier.

Reported and tested by:	Sylvain GALLIANO <sg@efficientip.com>
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18060
2018-11-21 17:32:09 +00:00
Mark Johnston
45aecd0422 Remove KN_HASKQLOCK.
It is a write-only flag whose last use was removed in r302235.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18059
2018-11-21 17:28:10 +00:00
Mark Johnston
bb58b5d670 Add a taskqueue_quiesce(9) KPI.
This is similar to taskqueue_drain_all(9) but will wait for the queue
to become idle before returning instead of only waiting for
already-enqueued tasks to finish.  This will be used in the opensolaris
compat layer.

PR:		227784
Reviewed by:	cem
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17975
2018-11-21 17:18:27 +00:00
Mark Johnston
c7dc361d6f Clear pad bytes in the struct exported by kern.ntp_pll.gettime.
Reported by:	Thomas Barabosch, Fraunhofer FKIE
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-20 20:32:10 +00:00
Mateusz Guzik
737037f6c0 pipe: use unr64
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18054
2018-11-20 14:59:27 +00:00
Mateusz Guzik
435bef7a2f Implement unr64
Important users of unr like tmpfs or pipes can get away with just
ever-increasing counters, making the overhead of managing the state
for 32 bit counters a pessimization.

Change it to an atomic variable. This can be further sped up by making
the counts variable "allocate" ranges and store them per-cpu.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18054
2018-11-20 14:58:41 +00:00
Ben Widawsky
1a305bda15 acpi: fix acpi_ec_probe to only check EC devices
This patch utilizes the fixed_devclass attribute in order to make sure
other acpi devices with params don't get confused for an EC device.

The existing code assumes that acpi_ec_probe is only ever called with a
dereferencable acpi param. Aside from being incorrect because other
devices of ACPI_TYPE_DEVICE may be probed here which aren't ec devices,
(and they may have set acpi private data), it is even more nefarious if
another ACPI driver uses private data which is not dereferancable. This
will result in a pointer deref during boot and therefore boot failure.

On X86, as it stands today, no other devices actually do this (acpi_cpu
checks for PROCESSOR type devices) and so there is no issue. I ran into
this because I am adding such a device which gets probed before
acpi_ec_probe and sets private data. If ARM ever has an EC, I think
they'd run into this issue as well.

There have been several iterations of this patch. Earlier
iterations had ECDT enumerated ECs not call into the probe/attach
functions of this driver. This change was Suggested by: jhb@.

Reviewed by:    jhb
Approved by:	emaste (mentor)
Differential Revision:  https://reviews.freebsd.org/D16635
2018-11-19 18:29:03 +00:00
Hans Petter Selasky
90acd1d139 Minor code factoring. No functional change.
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-11-19 09:36:09 +00:00
Hans Petter Selasky
2205f61a31 Be more verbose when a sysctl fails to unregister.
Print name of sysctl in question.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-11-19 09:35:16 +00:00
Kevin Bowling
2a24f4d911 Retire sbsndptr() KPI
As of r340465 all consumers use sbsndptr_adv and sbsndptr_noadv

Reviewed by:	gallatin
Approved by:	krion (mentor)
Differential Revision:	https://reviews.freebsd.org/D17998
2018-11-19 00:54:31 +00:00
Mateusz Guzik
2c054ce924 proc: always store parent pid in p_oppid
Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.

Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17825
2018-11-16 17:07:54 +00:00
Mark Johnston
aeb7a84ee1 Remove mostly-useless proc provider probes.
For some reason the proc UMA zone's ctor, dtor and init functions are
instrumented, but these functions are always available through FBT.
Moreover, the probes are not part of the original Solaris proc
provider, aren't documented, have no uses (e.g., in dwatch(8)) and
have no clear use to begin with.  Therefore, remove them.

Reviewed by:	rpaulo
Differential Revision:	https://reviews.freebsd.org/D2169
2018-11-15 23:02:59 +00:00
Warner Losh
36173f6976 Do proper conversion to/from sbt.
Doh! sbttoX and Xtosbt were backwards. While they ran, they produced
bogus results.

Pointy hat to: imp@
2018-11-15 16:02:24 +00:00
Gleb Smirnoff
905837ebe7 Initialize compatibility epoch tracker for thread0. Fixes
panics for drivers that call if_maddr_lock() during startup.

Reported by:	cy
2018-11-14 19:10:35 +00:00
Brooks Davis
5b1df30051 Use the main capabilities.conf for freebsd32.
Allow the location of capabilities.conf to be configured.

Also allow a per-abi syscall prefix to be configured with the
abi_func_prefix syscalls.conf variable and check syscalls against
entries in capabilities.conf with and without the prefix amended.

Take advantage of these two features to allow use shared capabilities.conf
between the default syscall vector and the freebsd32 compatability
layer.  We've been inconsistent about keeping the two in sync as
evidenced by the bugs fixed in r340294.  This eliminates that problem
going forward.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17932
2018-11-14 00:46:02 +00:00
Gleb Smirnoff
6febf18036 Fix build on some architectures after r340413. On amd64 epoch.h
appeared to be included implicitly.
2018-11-14 00:33:03 +00:00
Matt Macy
91cf497515 epoch(9) revert r340097 - no longer a need for multiple sections per cpu
I spoke with Samy Bahra and recent changes to CK to make ck_epoch_call and
ck_epoch_poll not modify the record have eliminated the need for this.
2018-11-14 00:12:04 +00:00
Gleb Smirnoff
635c18840a style(9), mostly adjusting overly long lines. 2018-11-13 23:57:34 +00:00
Gleb Smirnoff
a760c50c9e With epoch not inlined, there is no point in using _lite KPI. While here,
remove some unnecessary casts.
2018-11-13 23:45:38 +00:00
Gleb Smirnoff
9f360eecf9 The dualism between epoch_tracker and epoch_thread is fragile and
unnecessary. So, expose CK types to kernel and use a single normal
structure for epoch_tracker.

Reviewed by:	jtl, gallatin
2018-11-13 23:20:55 +00:00
Gleb Smirnoff
b79aa45e0e For compatibility KPI functions like if_addr_rlock() that used to have
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.

Reviewed by:	markj
2018-11-13 22:58:38 +00:00
Mateusz Guzik
f183fb162c locks: plug warnings about unitialized variables
They only showed up after I redefined LOCKSTAT_ENABLED to 0.

doing_lockprof in mutex.c is a real (but harmless) bug. Should the
value be non-zero it will do checks for lock profiling which would
otherwise be skipped.

state in rwlock.c is a wart from the compiler, the value can't be
used if lock profiling is not enabled.

Sponsored by:	The FreeBSD Foundation
2018-11-13 21:29:56 +00:00
Eric van Gyzen
d54474e63b Make no assertions about lock state when the scheduler is stopped.
Change the assert paths in rm, rw, and sx locks to match the lock
and unlock paths.  I did this for mutexes in r306346.

Reported by:	Travis Lane <tlane@isilon.com>
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2018-11-13 20:48:05 +00:00
Gleb Smirnoff
a82296c2df Uninline epoch(9) entrance and exit. There is no proof that modern
processors would benefit from avoiding a function call, but bloating
code. In fact, clang created an uninlined real function for many
object files in the network stack.

- Move epoch_private.h into subr_epoch.c. Code copied exactly, avoiding
  any changes, including style(9).
- Remove private copies of critical_enter/exit.

Reviewed by:	kib, jtl
Differential Revision:	https://reviews.freebsd.org/D17879
2018-11-13 19:02:11 +00:00
Mark Johnston
bb4a27f927 Allow allocations across meta boundaries.
Remove restrictions that prevent allocation requests to cross the
boundary between two meta nodes.

Replace the bmu_avail field in meta nodes with a bitmap that identifies
which subtrees have some free memory, and iterate over the nonempty
subtrees only in blst_meta_alloc.  If free memory is scarce, this should
make searching for it faster.

Put the code for handling the next-leaf allocation in a separate
function.  When taking blocks from the next leaf empties the leaf, be
sure to clear the appropriate bit in its parent, and so on, up to the
least-common ancestor of this leaf and the next.

Eliminate special terminator nodes, and rely instead on the fact that
there is a 0-bit at the end of the bitmask at the root of the tree that
will stop a meta_alloc search, or a next-leaf search, before the search
falls off the end of the tree. Make sure that the tree is big enough to
have space for that 0-bit.

Eliminate special all-free indicators.  Lazy initialization of subtrees
stands in the way of having an allocation span a meta-node boundary, so
a subtree of all free blocks is not treated specially.  Subtrees of
all-allocated blocks are still recognized by looking at the bitmask at
the root and finding 0.

Don't print all-allocated subtrees.  Do print the bitmasks for meta
nodes, when tree-printing.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12635
2018-11-13 18:40:01 +00:00
Kyle Evans
75beb4d46a Add dynamic_kenv assertion to init_static_kenv
Both to formally document the requirement that this not be called after the
dynamic kenv is setup, and to perhaps help static analyzers figure out
what's going on. While calling init_static_kenv this late isn't fatal, there
are some caveats that the caller should be aware of:

- Late calls are effectively a no-op, as far as default FreeBSD is
concerned, as everything will switch to searching the dynamic kenv once it's
available.

- Each of the kern_getenv calls will leak memory, as it's assumed that
these are searching static environment and allocations will not be made.

As such, this usage is not sensible and should be detected.
2018-11-13 04:34:30 +00:00
Konstantin Belousov
389474c122 Allow set ether/vlan PCP operation from the VNET jails.
The vlan interfaces can be created from vnet jails, it seems, so it
sounds logical to allow pcp configuration as well.

Reviewed by:	bz, hselasky (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17777
2018-11-12 15:59:32 +00:00
Conrad Meyer
0d1467b199 netdump: Fix netdumping with INVARIANTS kernels
Correct boneheaded assertion I added in r339501.  Mea culpa.

The intent is to notice when an M_WAITOK zone allocation would fail during
netdump, not to prevent all use of mbufs during netdump.

Reviewed by:	markj
X-MFC-With:	r339501
Differential Revision:	https://reviews.freebsd.org/D17957
2018-11-12 05:24:20 +00:00
Konstantin Belousov
8782eef46f Remove one-use variable.
This also removes a lot of #ifdefs and cleans up a warning when the
AUDIT kernel option is defined, but neither KDTRACE_HOOKS nor MAC are.

Reported and tested by:	danger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-11-11 00:21:28 +00:00
Konstantin Belousov
ade85c5eec Allow absolute paths for O_BENEATH.
The path must have a tail which does not escape starting/topping
directory.  The documentation will come shortly, see the man pages
commit message for the reason of separate commit.

Reviewed by:	jilles (previous version)
Discussed with:	emaste
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17714
2018-11-11 00:04:36 +00:00
Brooks Davis
9a38df59e9 Fix freebsd32 mknod(at).
As dev_t is now a 64-bit integer, it requires special handling as a
system call argument.  64-bit arguments are split between two 64-bit
integers due to the way arguments are promoted to allow reuse of most
system call implementations.  They must be reassembled before use.
Further, 64-bit arguments at an odd offset (counting from zero) are
padded and slid to the next slot on powerpc and mips.  Fix the
non-COMPAT11 system call by adding a freebsd32_mknodat() and
appropriately padded declerations.

The COMPAT11 system calls are fully compatible with the 64-bit
implementations so remove the freebsd32_ versions.

Use uint32_t consistently as the type of the old dev_t.  This matches
the old definition.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17928
2018-11-09 21:01:16 +00:00
Brooks Davis
b34f4419fb Make freebsd32_umtx_op follow the freebsd32_foo convention.
Sponsored by:	DARPA, AFRL
2018-11-09 00:46:10 +00:00
John Baldwin
4bf4b0f139 Enable non-executable stacks by default on RISC-V.
Reviewed by:	markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D17878
2018-11-07 18:32:02 +00:00
Brooks Davis
5577e44bf4 Regen after r340221: allow pointer return types.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17873
2018-11-07 16:56:07 +00:00
Brooks Davis
e56ec0e519 makesyscalls.sh: allow pointer return types.
The previous code required that the return type be a single word.  This
allows it to be a pointer without using a typedef.

Update the return types of break, mmap, and shmat to be void * as
declared.  This only effects systrace output in-tree, but can aid in
generating system call wrappers from syscalls.master.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17873
2018-11-07 16:55:04 +00:00
Mark Johnston
f8a222010f Avoid fixing the tty_info() buffer size in tty.h.
Different compilation units may otherwise get a different view of the
layout of struct tty depending on whether they include opt_printf.h.
This caused a blowup in the number of types defined in the kernel's
CTF file after r339468; thanks to dim@ for bisecting down to that
revision.

PR:		232675
Reported by:	dim
Reviewed by:	cem (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17877
2018-11-06 23:41:44 +00:00
Mark Johnston
07702f72e5 Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map.
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.

Reported by:	jhb
Reviewed by:	alc, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17827
2018-11-06 21:57:03 +00:00
Mark Johnston
6741ea083f We need opt_stack.h after r339605.
Reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
2018-11-06 21:47:22 +00:00
Brooks Davis
dd4d2f216f Update some comments made obsolete by recent commits. 2018-11-06 20:45:15 +00:00
Brooks Davis
938e8dcf60 Regen after r340199: Use declared types for caddr_t arguments.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17852
2018-11-06 18:47:29 +00:00
Brooks Davis
318f0d7720 Use declared types for caddr_t arguments.
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17852
2018-11-06 18:46:38 +00:00
Mariusz Zaborski
f4a035b8df Regenerate after r340129.
Pointed out by:	brooks
2018-11-06 18:03:04 +00:00
Mark Johnston
f71ef9b686 Use plain atomic_{add,subtract} when that's sufficient.
CID:		1386920
MFC after:	2 weeks
2018-11-06 17:32:25 +00:00
Andrew Turner
4ea56599e8 Port the NetBSD ubsan runtime to the FreeBSD kernel.
This allows us to build the ubsan code added in r340189 into the kernel
with the KUBSAN option. This will report when undefined behaviour is
detected in the currently running kernel.

As it can be large, the kernel is 65MB on arm64, loader may not be able to
load the kernel on all architectures so is disabled by default for now.

Sponsored by:	DARPA, AFRL
2018-11-06 17:32:07 +00:00
Andrew Turner
0645126fae Import the NetBSD micro ubsan code for the kernel.
This imports revision 1.3 of common/lib/libc/misc/ubsan.c from NetBSD, the
micro-ubsan code. It is an implementation of the Undefined Behavior
Sanitizer runtime for use with recent clang and gcc.

The uubsan code will be used in a later commit to implement kubsan to help
find undefined behavior in the kernel.

Sponsored by:	DARPA, AFRL
2018-11-06 16:56:49 +00:00
Brooks Davis
44cbc1c2b7 Fix a couple indentation errors in r339958. 2018-11-06 00:09:43 +00:00
John Baldwin
4cbbb74888 Add a KPI for the delay while spinning on a spin lock.
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI.  Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms.  However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.

Reviewed by:	kib
MFC after:	3 days
2018-11-05 21:34:17 +00:00
Mariusz Zaborski
82560231d3 capsicum: allow ppoll(2) in capability mode
We already allow to use poll(2). There is no reason to disallow ppoll(2).

PR:		232495
Submitted by:	Stefan Grundmann <sg2342@googlemail.com>
Reviewed by:	cem, oshogbo
MFC after:	2 weeks
2018-11-04 17:12:53 +00:00
Matt Macy
10f42d244b Convert epoch to read / write records per cpu
In discussing D17503 "Run epoch calls sooner and more reliably" with
sbahra@ we came to the conclusion that epoch is currently misusing the
ck_epoch API. It isn't safe to do a "write side" operation (ck_epoch_call
or ck_epoch_poll) in the middle of a "read side" section. Since, by definition,
it's possible to be preempted during the middle of an EPOCH_PREEMPT
epoch the GC task might call ck_epoch_poll or another thread might call
ck_epoch_call on the same section. The right solution is ultimately to change
the way that ck_epoch works for this use case. However, as a stopgap for
12 we agreed to simply have separate records for each use case.

Tested by: pho@

MFC after:	3 days
2018-11-03 03:43:32 +00:00
Brooks Davis
4e8c73eb20 Regen after r340080: Add const to input-only char * arguments.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17812
2018-11-02 20:56:19 +00:00
Brooks Davis
12e69f96a2 Add const to input-only char * arguments.
These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17812
2018-11-02 20:50:22 +00:00
Warner Losh
003ffd57fe Add sysctl_usec_to_sbintime and sysctl_msec_to_sbintime.
These functions are used to present a sbintime_t as either a number of
microseconds or a number of milliseconds respectively.

Sponsored by: Netflix
2018-11-02 17:50:57 +00:00
Mark Johnston
2203c46d87 Initialize the eflags field of vm_map headers.
Initializing the eflags field of the map->header entry to a value with a
unique new bit set makes a few comparisons to &map->header unnecessary.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14005
2018-11-02 16:26:44 +00:00
Brooks Davis
1493c2ee62 Make vop_symlink take a const target path.
This will enable callers to take const paths as part of syscall
decleration improvements.

Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.

Bump __FreeBSD_version for external API consumers.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17805
2018-11-02 14:42:36 +00:00
Conrad Meyer
78c2a9806e kern_poll: Restore explanatory comment removed in r177374
The comment isn't stale.  The check is bogus in the sense that poll(2)
does not require pollfd entries to be unique in fd space, so there is no
reason there cannot be more pollfd entries than open or even allowed
fds.  The check is mostly a seatbelt against accidental misuse or
abuse.  FD_SETSIZE, while usually unrelated to poll, is used as an
arbitrary floor for systems with very low kern.maxfilesperproc.

Additionally, document this possible EINVAL condition in the poll.2
manual.

No functional change.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17671
2018-11-01 23:46:23 +00:00
Brooks Davis
f7e5ce325f Regent after r340034: Use mode_t when the documented signature does.
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17784
2018-11-01 23:10:53 +00:00
Brooks Davis
2105ac07d7 Use mode_t when the documented signature does.
This is more clear and produces better results when generating function
stubs from syscalls.master.

Reviewed by:	kib, emaste
Obtained from:	CheribSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17784
2018-11-01 23:06:50 +00:00
John Baldwin
b317cfd4c0 Don't enter DDB for fatal traps before panic by default.
Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic'
and make the calls to kdb_trap() in MD fatal trap handlers prior to
calling panic() conditional on this new knob instead of
'debugger_on_panic'.  Disable the new knob by default.  Developers who
wish to recover from a fatal fault by adjusting saved register state
and retrying the faulting instruction can still do so by enabling the
new knob.  However, for the more common case this makes the user
experience for panics due to a fatal fault match the user experience
for other panics, e.g. 'c' in DDB will generate a crash dump and
reboot the system rather than being stuck in an infinite loop of fatal
fault messages and DDB prompts.

Reviewed by:	kib, avg
MFC after:	2 months
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D17768
2018-11-01 21:34:17 +00:00
Brooks Davis
e3e5481326 Reformat syscalls.master for better readability.
This takes advantage of two recents changes to makesyscalls.sh:
r328598: Permit a range of syscall numbers for UNIMPL
r339624: Remove the need for backslashes in syscalls.master

Syscall declerations are now split across multiple lines with the
syscall name and variables each on seperate lines (with an exception for
syscalls taking no arguments.)

Reviewed by:	imp
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17706
2018-10-31 16:17:45 +00:00
Bjoern A. Zeeb
9afc56849a Fix mips build after r339931.
I erroneously thought that it was two 64bit platforms which use link_elf_obj.c.

PR:		228854
Reported by:	ci.f.o.
MFC after:	3 days
X-MFC with:	r339931
Pointyhat to:	bz
2018-10-30 21:35:56 +00:00
Bjoern A. Zeeb
0f823b6497 As a follow-up to r339930 and various reports implement logging in case
we fail during module load because the pcpu or vnet module sections are
full.  We did return a proper error but not leaving any indication to
the user as to what the actual problem was.

Even worse, on 12/13 currently we are seeing an unrelated error (ENOSYS
instead of ENOSPC, which gets skipped over in kern_linker.c) to be
printed which made problem diagnostics even harder.

PR:		228854
MFC after:	3 days
2018-10-30 20:51:03 +00:00
Mark Johnston
9978bd996b Add malloc_domainset(9) and _domainset variants to other allocator KPIs.
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain.  The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted.  Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with:	gallatin, jeff
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17418
2018-10-30 18:26:34 +00:00
Mark Johnston
920239efde Fix some problems that manifest when NUMA domain 0 is empty.
- In uma_prealloc(), we need to check for an empty domain before the
  first allocation attempt, not after.  Fix this by switching
  uma_prealloc() to use a vm_domainset iterator, which addresses the
  secondary issue of using a signed domain identifier in round-robin
  iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
  excluding empty domains; otherwise we may frequently specify an empty
  domain when calling in to the page allocator, wasting CPU time.
  Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
  total page counts for now: some vm_phys segments are created before
  the SRAT is parsed and are thus always identified as being in domain 0
  even when they are not.  Then, when bootstrap pages are freed, they
  are added to a domain that we had previously thought was empty.  Until
  this is corrected, we simply exclude them from the per-domain page
  count.

Reported and tested by:	Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17704
2018-10-30 17:57:40 +00:00
Eric van Gyzen
fcbb889fdb Always stop the scheduler when entering kdb
Set curthread->td_stopsched when entering kdb via any vector.
Previously, it was only set when entering via panic, so when
entering kdb another way, mutexes and such were still "live",
and an attempt to lock an already locked mutex would panic.

Reviewed by:	kib, cem
Discussed with:	jhb
Tested by:	pho
MFC after:	2 months
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17687
2018-10-30 14:54:15 +00:00
Stephen Hurd
5201e0f110 Drain grouptaskqueue of the gtask before detaching it.
taskqgroup_detach() would remove the task even if it was running or
enqueued, which could lead to panics (see D17404). With this change,
taskqgroup_detach() drains the task and sets a new flag which prevents the
task from being scheduled again.

I've added grouptask_block() and grouptask_unblock() to allow control
over the flag from other locations as well.

Reviewed by:	Jeffrey Pieper <jeffrey.e.pieper@intel.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17674
2018-10-29 14:36:03 +00:00
Mark Johnston
4aed5937db Use M_WAITOK in init_hwpmc().
No functional change intended.

MFC after:	2 weeks
2018-10-27 18:48:49 +00:00
Conrad Meyer
2384981b61 poll: Unify userspace pollfd pointer name
Some of the poll code used 'fds' and some used 'ufds' to refer to the
uap->fds userspace pointer that was passed around to subroutines.  Some of
the poll code used 'fds' to refer to the kernel memory pollfd arrays, which
seemed unnecessarily confusing.

Unify on 'ufds' to refer to the userspace pollfd array.

Additionally, 'bits' is not an accurate description of the kernel pollfd
array in kern_poll, so rename that to 'kfds'.  Finally, clean up some logic
with mallocarray() and nitems().

No functional change.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D17670
2018-10-26 20:07:46 +00:00
Brooks Davis
ed34a7fcf2 Move 32-bit compat support for FIODGNAME to the right place.
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.

The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.

Unlike r339174 this change supports both places FIODGNAME is handled.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17475
2018-10-26 17:59:25 +00:00
Konstantin Belousov
4f77f48884 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
Mark Johnston
ad054101eb Remove a dead store.
CID:		1304878
MFC after:	1 week
2018-10-25 17:36:28 +00:00
Mark Johnston
970a174f3b Add FALLTHROUGH comments to appease Coverity.
CID:		1017862-1017864, 1017866-1017868
MFC after:	2 weeks
2018-10-25 15:43:21 +00:00
Mark Johnston
a0a18fd46b Remove a redundant check.
CID:		1042100
MFC after:	2 weeks
2018-10-25 15:40:59 +00:00
Konstantin Belousov
4fceda6206 Correct condition to detect mount(2) support by a filesystem.
Reported and tested by:	cy
Sponsored by:	The FreeBSD Foundation
Approved by:	re (rgrimes)
2018-10-24 19:40:09 +00:00
Konstantin Belousov
8ff7fad1d7 Only call sigdeferstop() for NFS.
Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.

The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.

VFS_OP()s wrap is straightforward.

Requested and reviewed by:	mjg (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17658
2018-10-23 21:43:41 +00:00
Eric Joyner
46fa0c2552 Revert r339634.
That commit is causing kernel panics in em(4), so this will be reverted
until those are fixed.

Reported by:	ae@, pho@, et al
Sponsored by:	Intel Corporation
2018-10-23 17:06:36 +00:00
Eric Joyner
940f62d616 iflib: drain enqueued tasks before detaching from taskqgroup
The taskqgroup_detach function does not check if task is already enqueued when
detaching it. This may lead to kernel panic if enqueued task starts after
context state lock is destroyed. Ensure that the already enqueued admin tasks
are executed before detaching them.

The issue was discovered during validation of D16429. Unloading of if_ixlv
followed by immediate removal of VFs with iovctl -D may lead to panic on
NODEBUG kernel.

As well, check if iflib is in detach before enqueueing new admin or iov
tasks, to prevent new tasks from executing while the taskqgroup tasks
are being drained.

Submitted by:	Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by:	shurd@, erj@
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D17404
2018-10-23 04:37:29 +00:00
Brooks Davis
9d7051d920 Remove the need for backslashes in syscalls.master.
Join non-special lines together until we hit a line containing a '}'
character. This allows the function declaration body to be split
across multiple lines without backslash continuation characters.

Continue to join lines ending with backslashes to allow gradual
migration and to support out-of-tree syscall vectors

Reviewed by:	emaste, kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17488
2018-10-22 22:13:00 +00:00
Brooks Davis
22c0c9a481 Remove __restrict qualifiers from syscalls.master.
The restruct qualifier is intended to aid code generation in the
compiler, but the only access to storage through these pointers is via
structs using copyin/copyout and the like which can not be written in C
or C++ and thus the compiler gains nothing from the qualifiers.

As such, the qualifiers add no value in current usage.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17574
2018-10-22 21:50:43 +00:00
Mark Johnston
b61f314290 Make it possible to disable NUMA support with a tunable.
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration.  With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR:		231460
Reviewed by:	alc, gallatin, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17439
2018-10-22 20:13:51 +00:00
Conrad Meyer
93940996fd Conditionalize kern.tty_info_kstacks feature on STACKS option
Fix tinderbox (mips XLPN32) after r339471.

Reported by:	tinderbox
X-MFC-With:	r339471
Sponsored by:	Dell EMC Isilon
2018-10-22 17:42:57 +00:00
Mark Johnston
21744c825f Don't import 0 into vmem quantum caches.
vmem uses UMA cache zones to implement the quantum cache.  Since
uma_zalloc() returns 0 (NULL) to signal an allocation failure, UMA
should not be used to cache resource 0.  Fix this by ensuring that 0 is
never cached in UMA in the first place, and by modifying vmem_alloc()
to fall back to a search of the free lists if the cache is depleted,
rather than blocking in qc_import().

Reported by and discussed with:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	alc
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17483
2018-10-22 16:16:42 +00:00
Warner Losh
e9b5375b04 Retire dpt(4)
Marked as gone in 12 and not relevant since the early 90s. No
sightings in nycbug's dmesg database.

Relnotes: yes
2018-10-22 02:35:12 +00:00
Warner Losh
a1db7455b7 Remove bt(4) driver
The buslogic scsi driver has been tagged as gone in 12 for some time
now. Remove it. The nycbug dmesg database shows only one sighting in 6
for this driver. It was very popular in the early days of the project,
but that popularity seems to have died by 2004 when the nycbug
database started up.

Relnotes: yes
2018-10-22 02:34:59 +00:00
Warner Losh
43b16da804 Remove adv(4) and adw(4)
Remove the advanssy drivers (both adv and adw). They were tagged as
gone in 12 a while qgo. The nycbug dmesg database shows this was last
seen in 6 and there were only a few adv sightings then (none for adw).

Relnotes: yes
2018-10-22 02:34:47 +00:00
Warner Losh
39c362e0b0 Remove aha(4) from the tree.
We tagged aha as gone in 12 a while ago. Proceed with its removal.
Data from nycbug's database shows the last sighting of this driver in
6, with the prior one in 4.x show its popularity had died prior to
4.x.

Relnotes: yes
2018-10-22 02:34:25 +00:00
Warner Losh
c4b23051af Remove stray refernce to pdq. Like the infamous twenty first of Johan
Sebastian Bach's twenty children, it hasn't been seen in many years.
2018-10-21 16:49:49 +00:00
Conrad Meyer
3937ee7557 netdump: Zone mbufs should be allocated before dump
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17306
2018-10-20 22:24:58 +00:00
Conrad Meyer
767bc248de ZSTDIO: Correctly initialize zstd context with provided 'level'
Prior to this revision, we allocated sufficient context space for 'level'
but never actually set the compress level parameter, so we would always get
the default '3'.

Reviewed by:	markj, vangyzen
MFC after:	12 hours
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17144
2018-10-20 21:49:44 +00:00
Conrad Meyer
ec86f8b28b dev_refthread: Do not initialize *ref when reference was not acquired
Like the companion API devvn_refthread, leave *ref uninitialized when a
reference was not acquired.  Initializing to 1 provides a vaguely
correct-looking but bogus value for broken callers to (mistakenly) pass to
dev_relthread() when refthread fails.

Make it even more clear to consumers that dev_relthread is only valid when
dev_refthread succeeds.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16885
2018-10-20 19:42:38 +00:00
Conrad Meyer
d7aa89c363 tty info (^T): Add optional kernel stack(9) traces
It is often useful for developers and administrators to determine a running
thread's stack for debugging purposes.  With this feature, using ^T will
print that information

For now, the feature is disabled by default.  Enable with sysctl
kern.tty_info_kstacks=1.

Discussed with:	markj
Reviewed by:	oshogbo
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17621
2018-10-20 18:42:28 +00:00
Conrad Meyer
6858c2cc8f Replace ttyprintf with sbuf_printf and tty drain routine
Add string variants of cnputc and tty_putchar, and use them from the tty
sbuf drain routine.

Suggested by:	ed@
Sponsored by:	Dell EMC Isilon
2018-10-20 18:31:36 +00:00
Conrad Meyer
d158fa4ade Add flags variants to linker_files / stack(9) symbol resolution
Some best-effort consumers may find trylock behavior for stack(9) symbol
resolution acceptable.  Expose that behavior to such consumers.

This API is ugly.  If in the future the modules and linker file list locking
is cleaned up such that the linker_files list can be iterated safely without
acquiring a sleepable lock, this API should be removed.  However, most of
the time nothing will be holding the linker files lock exclusive and the
acquisition can proceed.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17620
2018-10-20 18:08:43 +00:00
Mark Johnston
662e7fa8d9 Create some global domainsets and refactor NUMA registration.
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by:	alc, gallatin, jeff, kib
Tested by:	pho (part of a larger patch)
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17416
2018-10-20 17:36:00 +00:00
Bjoern A. Zeeb
9ae7bc39c2 In r78161 the lookup_set linker method was introduced which optionally
returns the section start and stop locations as well as a count if the
caller asks for them.
There was only one out-of-file consumer of count which did not actually
use it and hence was eliminated in r339407.
In r194784 parse_dpcpu(), and in r195699 parse_vnet() (a copy of the
former) started to use the link_elf_lookup_set() interface internally
also asking for the count.

count is computed as the difference of the void **stop - void **start
locations and as such, if the absoulte numbers
	(stop - start) % sizeof(void *) != 0
a round-down happens, e.g., **stop 0x1003 - **start 0x1000 => count 0.

To get the section size instead of "count is the number of pointer
elements in the section", the parse_*() functions do a
	count *= sizeof(void *).
They use the result to allocate memory and copy the section data
into the "master" and per-instance memory regions with a size of
count.

As a result of count possibly round-down this can miss the last
bytes of the section.  The good news is that we do not touch
out of bounds memory during these operations (we may at a later stage
if the last bytes would overflow the master sections).
Given relocation in elf_relocaddr() works based on the absolute
numbers of start and stop, this means that we can possibly try to
access relocated data which was never copied and hence we get
random garbage or at best zeroed memory.

Stop the two (last) consumers of count (the parse_*() functions)
from using count as well, and calculate the section size based on
the absolute numbers of stop and start and use the proper size for
the memory allocation and data copies.  This will make the symbols
in the last bytes of the pcpu or vnet sections be presented as
expected.

PR:			232289
Approved by:		re (gjb)
MFC after:		2 weeks
2018-10-18 20:20:41 +00:00
Jamie Gritton
4520f617c9 Fix typos from r339409.
Reported by:	maxim
Approved by:	re (gjb)
2018-10-18 15:02:57 +00:00
Jonathan T. Looney
e77f0bdcb5 r334853 added a "socket destructor" callback. However, as implemented, it
was really a "socket close" callback.

Update the socket destructor functionality to run when a socket is
destroyed (rather than when it is closed). The original submitter has
confirmed that this change satisfies the intended use case.

Suggested by:	rwatson
Submitted by:	Michio Honda <micchie at sfc.wide.ad.jp>
Tested by:	Michio Honda <micchie at sfc.wide.ad.jp>
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17590
2018-10-18 14:20:15 +00:00
Jamie Gritton
b19d66fd5a Add a new jail permission, allow.read_msgbuf. When true, jailed processes
can see the dmesg buffer (this is the current behavior).  When false (the
new default), dmesg will be unavailable to jailed users, whether root or
not.

The security.bsd.unprivileged_read_msgbuf sysctl still works as before,
controlling system-wide whether non-root users can see the buffer.

PR:		211580
Submitted by:	bz
Approved by:	re@ (kib@)
MFC after:	3 days
2018-10-17 16:11:43 +00:00
Bjoern A. Zeeb
0455a92bcb The countp argument passed to linker_file_lookup_set() in
linker_load_dependencies() is unused, so no need to ask for the
value in first place.  Remove the unused "count" variable.

Approved by:	re (kib)
2018-10-17 10:31:08 +00:00
Mark Johnston
ddab8c351a Reparent a child of pdfork(2) to its reaper when the procdesc is closed.
Unconditionally reparenting to PID 1 breaks the procctl(2) reaper
functionality.

Add a regression test for this case.

Reviewed by:	kib
Approved by:	re (gjb)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17589
2018-10-16 20:06:56 +00:00
Gleb Smirnoff
47f1ea5109 Plug sendfile(2) on a listening socket with proper error code.
Reported by:	ngie
Reviewed by:	ngie
Approved by:	re (delphij)
2018-10-16 15:57:16 +00:00
Kyle Evans
29bf3a7ba8 Correct COMPAT* macro names in syscalls.master
Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master
cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places.

Approved by:	re (glebius)
2018-10-15 21:35:57 +00:00
Mateusz Guzik
98fca94d22 capsicum: provide cap_rights_fde_inline
Reading caps is in the hot path (on each successful fd lookup), but
completely unnecessarily requires a function call.

Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
2018-10-12 23:48:10 +00:00
Mateusz Guzik
c9964045a0 Add a file missed in r339321
Approved by:	re (implicit)
Sad face: mjg
2018-10-12 00:32:45 +00:00
Mateusz Guzik
3f102f5881 Provide string functions for use before ifuncs get resolved.
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.

Convert places which need them. Xen bits by royger.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17487
2018-10-11 23:28:04 +00:00
Jamie Gritton
08b4333399 Fix the test prohibiting jails from sharing IP addresses.
It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address.  This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.

VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.

Approved by:	re@ (kib@)
MFC after:	5 days
2018-10-06 02:10:32 +00:00
Matt Macy
e8bb589d56 eliminate locking surrounding ui_vmsize and swap reserve by using atomics
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by:	alc@, markj@, kib@
Approved by:	re (delphij@)
Differential Revision:	https://reviews.freebsd.org/D16273
2018-10-05 05:50:56 +00:00
Gleb Smirnoff
ad7eb8cad5 In PR 227259, a user is reporting that they have code which is using
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.

It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.

This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.

PR:		227259
Obtained from:	jtl
Approved by:	re (kib)
Differential Revision:  D15019
2018-10-03 17:40:04 +00:00
Andrew Turner
8696dcdacf Add kernel ifunc support on arm64.
Tested with ifunc resolvers in the kernel and module with calls from
kernel to kernel, module to kernel, and module to module.

Reviewed by:	kib (previous version)
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17370
2018-10-01 18:51:08 +00:00
Andrew Gallatin
30c5525b3c Allow empty NUMA memory domains to support Threadripper2
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2  memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
    rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
    Thanks to Jeff for suggesting this strategy.

Reviewed by:	alc, markj
Approved by:	re (gjb@)
Differential Revision:	https://reviews.freebsd.org/D1683
2018-10-01 14:14:21 +00:00
John Baldwin
ff13c0a24f Regenerate after UNIMPL -> OBSOL changes in r339001.
Approved by:	re (gjb)
2018-09-28 17:25:28 +00:00
John Baldwin
46e2054905 Mark various removed system calls as OBSOL instead of UNIMPL.
This is mostly a cosmetic change except that obsolete system calls are
assigned meaningful names in the names arrays which means that using
tools like kdump or truss against binaries invoking these system calls
will print out the name instead of the number.  The script I use to
generate the XML list of syscalls for GDB also ignores UNIMPL but not
OBSOL entries.  In general UNIMPL should only be used to reserve
placeholders for system calls that have never been implemented while
system calls that existed at one time in FreeBSD but were removed
should be marked OBSOL instead.

Reviewed by:	brooks, kib, imp
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17344
2018-09-28 17:23:54 +00:00
Gordon Tetlow
89f652d149 Clear stack allocated data structure to prevent kernel memory leak.
Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	wes@
Approved by:	re (implicit)
Approved by:	so
Security:	FreeBSD-EN-18:12.mem
Security:	CVE-2018-17155
2018-09-27 18:39:54 +00:00
Mateusz Guzik
9afff6b1c0 Eliminate false sharing in malloc due to statistic collection
Currently stats are collected in a MAXCPU-sized array which is not
aligned and suffers enormous false-sharing. Fix the problem by
utilizing per-cpu allocation.

The counter(9) API is not used here as it is too incomplete and does
not provide a win over per-cpu zone sized for malloc stats struct. In
particular stats are being reported for each cpu separately by just
copying what is supposed to be an array element for given cpu.

This eliminates significant false-sharing during malloc-heavy tests
e.g. on Skylake. See the review for details.

Reviewed by:	markj
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17289
2018-09-23 19:00:06 +00:00
Mateusz Guzik
d6fda03a64 select: stop doing zero-sized memsets
Approved by:	re (kib)
2018-09-21 13:20:41 +00:00
Mark Johnston
969e147aff Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned.
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.

Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.

Reported by:	alc
Reviewed by:	alc, kib (previous version)
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17249
2018-09-20 18:29:55 +00:00
Mateusz Guzik
c396945b74 vfs: remove lookup_shared tunable
Reviewed by:	kib, jhb
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17253
2018-09-20 18:25:26 +00:00
Mateusz Guzik
51e13c93b6 fd: prevent inlining of _fdrop thorough kern_descrip.c
fdrop is used in several places in the file and almost never has to call
_fdrop. Thus inlining it is a pure waste of space.

Approved by:	re (kib)
2018-09-20 13:32:40 +00:00
Konstantin Belousov
bf94d6c78b Fix state of dquot-less vnodes after failed quotaoff.
UFS quotaoff iterates over all mp vnodes, and derefences and clears
the pointers to corresponding dquots. If SU work items transiently
reference some of dquots,quotaoff() would eventually fail, but all
processed vnodes are already stripped from dquots.  The state is
problematic, since quotas are left enabled, but there is no dquots
where blocks and inodes can be accounted.  The result is assertion
failures and NULL pointer dereferences.

Fix it by suspending writes around quotaoff() call.  Since the
filesystem is synced, no dandling references to dquots from SU
workitems can left behind, which means that quotaoff succeeds.

The complication there is that quotaoff VFS op is performed with the
mount point busied, while to suspend, we need to start write on the
mp.  If vn_start_write() is called on busied mp, system might deadlock
against parallel unmount request.  Handle this by unbusy-ing mp before
starting write, which in turn requires changing the quotaoff()
interface to return with the mount point not busied, same as was done
for quotaon().

Reviewed by:	mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17208
2018-09-19 14:36:57 +00:00
Gordon Tetlow
c9e562b188 Correct ELF header parsing code to prevent invalid ELF sections from
disclosing memory.

Submitted by:	markj
Reported by:	Thomas Barabosch, Fraunhofer FKIE
Approved by:	re (implicit)
Approved by:	so
Security:	FreeBSD-SA-18:12.elf
Security:	CVE-2018-6924
Sponsored by:	The FreeBSD Foundation
2018-09-12 04:57:34 +00:00
Mark Johnston
cc4f3d0ae2 Rename hardclock_cnt() to hardclock() and remove the old implementation.
Also remove some related and unused subroutines.  They have long been
replaced by variants that handle multiple coalesced events with a single
call.

No functional change intended.

Reviewed by:	cem, kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17029
2018-09-06 02:10:59 +00:00
Mark Johnston
ce9eea6425 Correct the condition under which we allocate a terminator node.
We will have last_block < blocks if the block count is divisible
by BLIST_BMAP_RADIX, but a terminator node is still needed if the
tree isn't balanced.  In this case we were overruning the blist
array by 16 bytes during initialization.

While here, add a check for the invalid blocks == 0 case.

PR:		231116
Reviewed by:	alc, kib (previous version), Doug Moore <dougm@rice.edu>
Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17020
2018-09-05 19:05:30 +00:00
Konstantin Belousov
1565fb29a7 Add amd64 mdthread fields needed for the upcoming EFI RT exception
handling.

This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 21:16:43 +00:00
Konstantin Belousov
420382368a Improve error messages from clock_if.m method failures.
Print error message in verbose mode when CLOCK_SETTIME() clock_if.m
method failed.  For EFIRT RTC clock, add error code for the failure of
CLOCK_GETTIME() report.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 20:17:51 +00:00
Mark Murray
19fa89e938 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
Alan Cox
49bfa624ac Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
Warner Losh
d36967bd2b Add a new device flag: DF_ATTACHED_ONCE
This flag is set once the device has been successfully attached. When
set, it inhibits devmatch from trying to match the device. This in
turn allows kldunload to work as expected. Prior to the change, the
driver would immediately reload because devmatch had no notion that
the driver had once been attached, and therefore shouldn't participate
in further matching.

Differential Revision: https://reviews.freebsd.org/D16735
2018-08-23 05:06:16 +00:00
Warner Losh
5fa2979791 Create devctl freeze/thaw.
This adds it to devctl, libdevctl, defines the two IOCTLs and
implements the kernel bits. causes any new drivers that are added via
kldload to be deferred until a 'thaw' comes in. These do not stack: it
is an error to freeze while frozen, or thaw while thawed.

Differential Revision: https://reviews.freebsd.org/D16735
2018-08-23 05:05:47 +00:00
Conrad Meyer
b6c7d9c345 devstat(9): Constify function parameters that can be const
No functional change.

When attempting to document the changed argument types in devstat.9, I
discovered the 20 year old manual page severely mismatched reality even
prior to my simple change.  So I took a first cut pass cleaning that up to
match reality.  I'm sure I've missed some things; the goal was just to leave
it better than when I started.

Sponsored by:	Dell EMC Isilon
2018-08-23 01:42:45 +00:00
Conrad Meyer
4ca8c1efe4 KASSERT: Make runtime optionality optional
Add an option, KASSERT_PANIC_OPTIONAL, that allows runtime KASSERT()
behavior changes.  When this option is not enabled, code that allows
KASSERTs to become optional is not enabled, and all violated assertions
cause termination.

The runtime KASSERT behavior was added in r243980.

One important distinction here is that panic has __dead2
("attribute((noreturn))"), while kassert_panic does not.  Static analyzers
like Coverity understand __dead2.  Without it, KASSERTs go misunderstood,
resulting in many false positives that result from violation of program
invariants.

Reviewed by:	jhb, jtl, np, vangyzen
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16835
2018-08-22 22:19:42 +00:00
Mark Johnston
36716fe2e6 Prepare the kernel linker to handle PC-relative ifunc relocations.
The boot-time ifunc resolver assumes that it only needs to apply
IRELATIVE relocations to PLT entries.  With an upcoming optimization,
this assumption no longer holds, so add the support required to handle
PC-relative relocations targeting GNU_IFUNC symbols.
- Provide a custom symbol lookup routine that can be used in early boot.
  The default lookup routine uses kobj, which is not functional at that
  point.
- Apply all existing relocations during boot rather than filtering
  IRELATIVE relocations.
- Ensure that we continue to apply ifunc relocations in a second pass
  when loading a kernel module.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16749
2018-08-22 20:44:30 +00:00
Michael Tuexen
6b01d4d433 Add SOL_SOCKET level socket option with name SO_DOMAIN to get
the domain of a socket.

This is helpful when testing and Solaris and Linux have the same
socket option using the same name.

Reviewed by:		bcr@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16791
2018-08-21 14:04:30 +00:00
Alan Cox
44d0efb215 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
Kyle Evans
d529de874b res_find: Fix fallback logic
The fallback logic was broken if hints were found in multiple environments.
If we found a hint in either the loader environment or the static
environment, fallback would be incremented excessively when we returned to
the environment-selection bits. These checks should have also been guarded
by the fbacklvl checks. As a result, fbacklvl could quickly get to a point
where we skip either the static environment and/or the static hints
depending on which environments contained valid hints.

The impact of this bug is minimal, mostly affecting mips boards that use
static hints and may have hints in either the loader environment or the
static environment.

There may be better ways to express the searchable environments and
describing their characteristics (immutable, already searched, etc.) but
this may be revisited after 12 branches.

Reported by:	Dan Nelson <dnelson_1901@yahoo.com>
Triaged by:	Dan Nelson <dnelson_1901@yahoo.com>
MFC after:	3 days
2018-08-18 19:45:56 +00:00
Xin LI
ed1fa01ac4 Regen after r337998. 2018-08-18 06:33:51 +00:00
Xin LI
0362ec1e8e getrandom(2) should not be restricted in capability mode. 2018-08-18 06:31:49 +00:00
Mark Johnston
1436ff1ebb Typo.
X-MFC with:	r337974
2018-08-17 16:07:06 +00:00
Mark Johnston
3ccbdc8254 Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update.  The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by:	jhibbits
Reviewed by:	kib, mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16756
2018-08-17 15:41:01 +00:00
Mariusz Zaborski
2fe6aefff8 capsicum: allow the setproctitle(3) function in capability mode
Capsicum in past allowed to change the process title.
This was broken with r335939.

PR:		230584
Submitted by:	Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by:	ian@niw.com.au
MFC after:	1 week
2018-08-17 14:35:10 +00:00
Kyle Evans
45625675e7 subr_prf: Don't write kern.boot_tag if it's empty
This change allows one to set kern.boot_tag="" and not get a blank line
preceding other boot messages. While this isn't super critical- blank lines
are easy to filter out both mentally and in processing dmesg later- it
allows for a mode of operation that matches previous behavior.

I intend to MFC this whole series to stable/11 by the end of the month with
boot_tag empty by default to make this effectively a nop in the stable
branch.
2018-08-17 03:42:57 +00:00
Jamie Gritton
c542c43ef1 Revert r337922, except for some documention-only bits. This needs to wait
until user is changed to stop using jail(2).

Differential Revision:	D14791
2018-08-16 19:09:43 +00:00
Jamie Gritton
284001a222 Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES).  These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do.  The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision:	D14791
2018-08-16 18:40:16 +00:00
Edward Tomasz Napierala
e77b6cfe34 In the help message at the mountroot prompt, suggest something that
actually works and matches the bsdinstall(8) default.

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2018-08-15 12:12:21 +00:00
Alan Cox
c65ed2ff53 Eliminate a redundant assignment.
MFC after:	1 week
2018-08-11 19:21:53 +00:00
Kyle Evans
0915d9d070 subr_prf: remove think-o that had returned to local patch
Reported by:	cognet
2018-08-10 15:35:02 +00:00
Kyle Evans
170bc29131 boot tagging: minor fixes
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.

The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
2018-08-10 15:29:06 +00:00
Kyle Evans
240fcda1e8 subr_prf: style(9) the sizeof
Reported by:	jkim, ian
2018-08-09 19:09:06 +00:00
Kyle Evans
4c793b68da subr_prf: Use "sizeof current_boot_tag" instead 2018-08-09 17:53:18 +00:00
Kyle Evans
2a4650cc11 BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable
BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great
for changing it or removing it. Move it into subr_prf.c and add options for
it to opt_printf.h.

One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the
buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add
a loader tunable by the same name that we'll fetch upon initialization of
the msgbuf.

This allows for flexibility and also ensures that there's a consistent way
to figure out the boot tag of the running kernel, rather than relying on
headers to be in-sync.

Prodded super-super-lightly by:	imp
2018-08-09 17:47:47 +00:00
Kyle Evans
21aa6e8345 msgbuf: Light detailing (const'ify and bool'itize) 2018-08-09 17:42:27 +00:00
Leandro Lupori
c8e2123b6a [ppc] Fix kernel panic when using BOOTP_NFSROOT
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.

This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.

2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.

PR:		Bug 230168
Reported by:	sbruno
Reviewed by:	jhibbits, mmacy, sbruno
Approved by:	jhibbits
Differential Revision:	D16633
2018-08-09 14:04:51 +00:00
Matt Macy
9fec45d8e5 epoch_block_wait: don't check TD_RUNNING
struct epoch_thread is not type safe (stack allocated) and thus cannot be dereferenced from another CPU

Reported by: novel@
2018-08-09 05:18:27 +00:00
Kyle Evans
2834d61202 kern: Add a BOOT_TAG marker at the beginning of boot dmesg
From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by
default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it
easier to do textproc magic to locate the start of each boot and, of
particular interest to some, the dmesg of the current boot.

The PR has a dmesg(8) component as well that I've opted not to include for
the moment- it was the more contentious part of this PR.

bde@ also made the statement that this boot tag should be written with an
ordinary printf, which I've- for the moment- declined to change about this
patch to keep it more transparent to observer of the boot process.

PR:		43434
Submitted by:	dak <aurelien.nephtali@wanadoo.fr> (basically rewritten)
MFC after:	maybe never
2018-08-09 01:32:09 +00:00
Konstantin Belousov
8f94195022 Followup to r337430: only call elf_reloc_ifunc on x86.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-07 20:43:50 +00:00
Konstantin Belousov
289ead7cb0 Add missed handling of local relocs against ifunc target in the obj modules.
Reported and tested by:	wulf
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-07 18:26:46 +00:00
Mark Johnston
c7902fbeae Improve handling of control message truncation.
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages.  In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table.  Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient.  To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
  while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
  message boundary, e.g., if the caller received two control messages
  and provided only the exact amount of buffer space needed for the
  first.

PR:		131876
Reviewed by:	ed (previous version)
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16561
2018-08-07 16:36:48 +00:00
Konstantin Belousov
a70e9a1388 Swap in WKILLED processes.
Swapped-out process that is WKILLED must be swapped in as soon as
possible.  The reason is that such process can be killed by OOM and
its pages can be only freed if the process exits.  To exit, the kernel
stack of the process must be mapped.

When allocating pages for the stack of the WKILLED process on swap in,
use VM_ALLOC_SYSTEM requests to increase the chance of the allocation
to succeed.

Add counter of the swapped out processes to avoid unneeded iteration
over the allprocs list when there is no work to do, reducing the
allproc_lock ownership.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16489
2018-08-04 20:45:43 +00:00
Mark Johnston
5b0480f2cc Don't check rcv sockbuf limits when sending on a unix stream socket.
sosend_generic() performs an initial comparison of the amount of data
(including control messages) to be transmitted with the send buffer
size. When transmitting on a unix socket, we then compare the amount
of data being sent with the amount of space in the receive buffer size;
if insufficient space is available, sbappendcontrol() returns an error
and the data is lost.  This is easily triggered by sending control
messages together with an amount of data roughly equal to the send
buffer size, since the control message size may change in uipc_send()
as file descriptors are internalized.

Fix the problem by removing the space check in sbappendcontrol(),
whose only consumer is the unix sockets code.  The stream sockets code
uses the SB_STOP mechanism to ensure that senders will block if the
receive buffer fills up.

PR:		181741
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16515
2018-08-04 20:26:54 +00:00
Mark Johnston
e62ca80bde Style. 2018-08-04 20:16:36 +00:00
Andriy Gapon
e0fa977ea5 safer wait-free iteration of shared interrupt handlers
The code that iterates a list of interrupt handlers for a (shared)
interrupt, whether in the ISR context or in the context of an interrupt
thread, does so in a lock-free fashion.   Thus, the routines that modify
the list need to take special steps to ensure that the iterating code
has a consistent view of the list.  Previously, those routines tried to
play nice only with the code running in the ithread context.  The
iteration in the ISR context was left to a chance.

After commit r336635 atomic operations and memory fences are used to
ensure that ie_handlers list is always safe to navigate with respect to
inserting and removal of list elements.

There is still a question of when it is safe to actually free a removed
element.

The idea of this change is somewhat similar to the idea of the epoch
based reclamation.  There are some simplifications comparing to the
general epoch based reclamation.  All writers are serialized using a
mutex, so we do not need to worry about concurrent modifications.  Also,
all read accesses from the open context are serialized too.

So, we can get away just two epochs / phases.  When a thread removes an
element it switches the global phase from the current phase to the other
and then drains the previous phase.  Only after the draining the removed
element gets actually freed. The code that iterates the list in the ISR
context takes a snapshot of the global phase and then increments the use
count of that phase before iterating the list.  The use count (in the
same phase) is decremented after the iteration.  This should ensure that
there should be no iteration over the removed element when its gets
freed.

This commit also simplifies the coordination with the interrupt thread
context.  Now we always schedule the interrupt thread when removing one
of handlers for its interrupt.  This makes the code both simpler and
safer as the interrupt thread masks the interrupt thus ensuring that
there is no interaction with the ISR context.

P.S.  This change matters only for shared interrupts and I realize that
those are becoming a thing of the past (and quickly).  I also understand
that the problem that I am trying to solve is extremely rare.

PR:		229106
Reviewed by:	cem
Discussed with:	Samy Al Bahra
MFC after:	5 weeks
Differential Revision: https://reviews.freebsd.org/D15905
2018-08-03 14:27:28 +00:00
Alan Somers
da4465506d Fix LOCAL_PEERCRED with socketpair(2)
Enable the LOCAL_PEERCRED socket option for unix domain stream sockets
created with socketpair(2). Previously, it only worked with unix domain
stream sockets created with socket(2)/listen(2)/connect(2)/accept(2).

PR:		176419
Reported by:	Nicholas Wilson <nicholas@nicholaswilson.me.uk>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16350
2018-08-03 01:37:00 +00:00
Andriy Gapon
a260971458 fix a typo resulting in a wrong variable in kern_syscall_deregister
The difference is between sysent, a global, and sysents, a function
parameter.
2018-08-02 09:41:55 +00:00
Mark Johnston
4765717321 Remove a redundant check.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-07-30 17:58:41 +00:00
Alan Somers
6040822c4e Make timespecadd(3) and friends public
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub.  NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions.  Solaris also defines a three-argument version, but
only in its kernel.  This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with:	cem, jilles, ian, bde
Differential Revision:	https://reviews.freebsd.org/D14725
2018-07-30 15:46:40 +00:00
Andrew Turner
cd2106eaea Ensure the DPCPU and VNET module spaces are aligned to hold a pointer.
Previously they may have been aligned to a char, leading to misaligned
DPCPU and VNET variables.

Sponsored by:	DARPA, AFRL
2018-07-30 14:25:17 +00:00
David E. O'Brien
455d358977 Correct copyright dates. 2018-07-30 07:01:00 +00:00
Antoine Brodin
ccd6ac9f6e Add allow.mlock to jail parameters
It allows locking or unlocking physical pages in memory within a jail

This allows running elasticsearch with "bootstrap.memory_lock" inside a jail

Reviewed by:	jamie@
Differential Revision:	https://reviews.freebsd.org/D16342
2018-07-29 12:41:56 +00:00
Don Lewis
290d906084 Fix the long term ULE load balancer so that it actually works. The
initial call to sched_balance() during startup is meant to initialize
balance_ticks, but does not actually do that since smp_started is
still zero at that time.  Since balance_ticks does not get set,
there are no further calls to sched_balance().  Fix this by setting
balance_ticks in sched_initticks() since we know the value of
balance_interval at that time, and eliminate the useless startup
call to sched_balance().  We don't need to randomize the intial
value of balance_ticks.

Since there is now only one call to sched_balance(), we can hoist
the tests at the top of this function out to the caller and avoid
the overhead of the function call when running a SMP kernel on UP
hardware.

PR:		223914
Reviewed by:	kib
MFC after:	2 weeks
2018-07-29 00:30:06 +00:00
David Bright
95c05062ec Allow a EVFILT_TIMER kevent to be updated.
If a timer is updated (re-added) with a different time period
(specified in the .data field of the kevent), the new time period has
no effect; the timer will not expire until the original time has
elapsed. This violates the documented behavior as the kqueue(2) man
page says (in part) "Re-adding an existing event will modify the
parameters of the original event, and not result in a duplicate
entry."

This modification, adapted from a patch submitted by cem@ to PR214987,
fixes the kqueue system to allow updating a timer entry. The
kevent timer behavior is changed to:

  * When a timer is re-added, update the timer parameters to and
    re-start the timer using the new parameters.
  * Allow updating both active and already expired timers.
  * When the timer has already expired, dequeue any undelivered events
    and clear the count of expirations.

All of these changes address the original PR and also bring the
FreeBSD and macOS kevent timer behaviors into agreement.

A few other changes were made along the way:

  * Update the kqueue(2) man page to reflect the new timer behavior.
  * Fix man page style issues in kqueue(2) diagnosed by igor.
  * Update the timer libkqueue system test to test for the updated
    timer behavior.
  * Fix the (test) libkqueue common.h file so that it includes
    config.h which defines various HAVE_* feature defines, before the
    #if tests for such variables in common.h. This enables the use of
    the actual err(3) family of functions.
  * Fix the usages of the err(3) functions in the tests for incorrect
    type of variables. Those were formerly undiagnosed due to the
    disablement of the err(3) functions (see previous bullet point).

PR:		214987
Reported by:	Brian Wellington <bwelling@xbill.org>
Reviewed by:	kib
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D15778
2018-07-27 13:49:17 +00:00
Andriy Gapon
111b043cdf change interrupt event's list of handlers from TAILQ to CK_SLIST
The primary reason for this commit is to separate mechanical and nearly
mechanical code changes from an upcoming fix for unsafe teardown of
shared interrupt handlers that have only filters (see D15905).

The technical rationale is that SLIST is sufficient.  The only operation
that gets worse performance -- O(n) instead of O(1) is a removal of a
handler,  but it is not a critical operation and the list is expected to
be rather short.

Additionally, it is easier to reason about SLIST when considering the
concurrent lock-free access to the list from the interrupt context and
the interrupt thread.

CK_SLIST is used because the upcoming change depends on the memory order
provided by CK_SLIST insert and the fact that CL_SLIST remove does not
trash the linkage in a removed element.

While here, I also fixed a couple of whitespace issues, made code under
ifdef notyet compilable, added a lock assertion to ithread_update() and
made intr_event_execute_handlers() static as it had no external callers.

Reviewed by:	cem (earlier version)
MFC after:	4 weeks
Differential Revision: https://reviews.freebsd.org/D16016
2018-07-23 12:51:23 +00:00
Emmanuel Vadot
c54fe25dcb Raise the size of L3 table for early devmap on arm64
Some driver (like efifb) needs to map more than the current L2_SIZE
Raise the size so we can map the framebuffer setup by the bootloader.

Reviewed by:	cognet
2018-07-19 21:58:06 +00:00
Mark Johnston
bf923a556d Delete an XXX comment addressed by r336505.
X-MFC with:	r336505
Sponsored by:	The FreeBSD Foundation
2018-07-19 20:11:08 +00:00
Mark Johnston
483f692ea6 Have preload_delete_name() free pages backing preloaded data.
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data.  This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded.  Previously,
it would have remained unused.

Reviewed by:	kib, royger
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16330
2018-07-19 20:00:28 +00:00
Mark Johnston
73624a804a Provide the full module path to preload_delete_name().
The basename will never match against the preload metadata, so these
calls previously had no effect.

Reviewed by:	kib, royger
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16330
2018-07-19 19:50:42 +00:00
Konstantin Belousov
53e20b2702 When reporting an error, print the errno value.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-19 19:03:18 +00:00
Emmanuel Vadot
326867616f kern_cpu: When adding abs frequency allow for unordered insertion
Keep the list ordered as some code assume that it is but allow for
unordered cf_settings sets.
2018-07-19 11:28:14 +00:00
Mark Johnston
9295517ac9 Add a FALLTHROUGH comment to kvprintf().
Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	3 days
2018-07-17 14:56:54 +00:00
Mariusz Zaborski
f1fe1e020f Extend amount of possible coredumps from 10 to 100000 when using index format.
The amount of digits in the name of corefile is assigned dynamically.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D16118
2018-07-15 17:10:12 +00:00
Mateusz Guzik
95ab076d6e lockmgr: tidy up slock/sunlock similar to other locks 2018-07-13 22:40:14 +00:00
Warner Losh
25bc561e68 There's two files in the sys tree named inflate.c, in addition
to it being a common name elsewhere. Rename the old kzip one
to subr_inflate.c.

This actually fixes the build issues on sparc64 that my inclusion of
.PATH ${SYSDIR}/kern created in r336244, so also revert the broken
workaround I committed in r336249.

This slipped passed me because apparently, I never did a clean build.
2018-07-13 17:41:28 +00:00
Warner Losh
52379d36a9 Create helper functions for parsing boot args.
boot_parse_arg		to parse a single arg
boot_parse_cmdline	to parse a command line string
boot_parse_args		to parse all the args in a vector
boot_howto_to_env	Convert howto bits to env vars
boot_env_to_howto	Return howto mask mased on what's set in the environment.

All these routines return an int that's the bitmask of the args
translated to RB_* flags. As a special case, the 'S' flag sets the
comconsole_speed env var. Any arg that looks like a=b will set the env
key 'a' to value 'b'. If =b is omitted, 'a' is set to '1'.  This
should help us reduce the number of redundant copies of these routines
in the tree.  It should also give a more uniform experience between
platforms.

Also, invent a new flag RB_PROBE that's set when 'P' is parsed.  On
x86 + BIOS, this means 'probe for the keyboard, and if it's not there
set both RB_MULTIPLE and RB_SERIAL (which means show the output on
both video and serial consoles, but make serial primary).  Others it
may be some similar concept of probing, but it's loader dependent
what, exactly, it means.

These routines are suitable for /boot/loader and/or the kernel,
though they may not be suitable for the tightly hand-rolled-for-space
environments like boot2.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16205
2018-07-13 16:43:05 +00:00
Brooks Davis
d92da75941 Round down the location of execpathp to slightly improve copyout speed.
In practice, this moves the padding from below the canary to above
execpathp has no impact on stack consumption.

Submitted by:		Wuyang-Chung (via github pull request #159)
MFC after:	1 week
2018-07-13 11:32:27 +00:00
Mateusz Guzik
bcbc8d35eb fd: stop passing M_ZERO to uma_zalloc
The optimisation seen with malloc cannot be used here as zone sizes are
now known at compilation. Thus bzero by hand to get the optimisation
instead.
2018-07-12 22:48:18 +00:00
Kyle Evans
44314c3509 kern_environment: Give the static environment a chance to disable MD env
This variable has been given the name "loader_env.disabled" as it's the
primary way most people will have an MD environment. This restores the
previously-default behavior of ignoring the loader(8) environment, which may
be useful for vendor distributions or other scenarios where inheriting the
loader environment may be considered a security issue or potentially
breaking of a more locked-down environment.

As the change to config(5) indicates, disabling the loader environment
should not be a choice made lightly since it may provide ACPI hints and
other useful things that the system can rely on to boot.

An UPDATING entry has been added to mention an upgrade path for those that
may have relied on the previous behavior.

Discussed with:	bde
Relnotes:	yes (maybe)
2018-07-12 02:51:50 +00:00
Alan Somers
8a894c1aa1 Don't acquire evclass_lock with a spinlock held
When the "pc" audit class is enabled and auditd is running, witness will
panic during thread exit because au_event_class tries to lock an rwlock
while holding a spinlock acquired upstack by thread_exit.

To fix this, move AUDIT_SYSCALL_EXIT futher upstack, before the spinlock is
acquired. Of thread_exit's 16 callers, it's only necessary to call
AUDIT_SYSCALL_EXIT from two, exit1 (for exiting processes) and kern_thr_exit
(for exiting threads). The other callers are all kernel threads, which
needen't call AUDIT_SYSCALL_EXIT because since they can't make syscalls
there will be nothing to audit.  And exit1 already does call
AUDIT_SYSCALL_EXIT, making the second call in thread_exit redundant for that
case.

PR:		228444
Reported by:	aniketp
Reviewed by:	aniketp, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16210
2018-07-11 19:38:42 +00:00
Brooks Davis
942ae5c8b8 Regen after r336171. 2018-07-10 14:04:52 +00:00
Brooks Davis
7cc923f8a8 Get rid of netbsd_lchown and netbsd_msync syscall entries.
No valid FreeBSD binary very called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.

This is a respin of r335983 with a workaround for the ancient BFD linker
in the libc stubs.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16193
2018-07-10 13:32:04 +00:00
Brooks Davis
3a20f06a1c Use uintptr_t alone when assigning to kvaddr_t variables.
Suggested by:	jhb
2018-07-10 13:03:06 +00:00
Kyle Evans
c7a82b9c6c kern_environment: bool'itize dynamic_kenv; fix small style(9) nit 2018-07-10 02:43:22 +00:00
Kyle Evans
0f46005e4b subr_hints: Skip static_env and static_hints if they don't contain hints
This is possible because, well, they're static. Both the dynamic environment
and the MD-environment (generally loader(8) environment) can potentially
have room for new variables to be set, and thus do not receive this
treatment.
2018-07-10 00:36:37 +00:00
Kyle Evans
dc4446df5f subr_hints: Convert some bool-like ints to bools 2018-07-10 00:34:19 +00:00
Kyle Evans
5768da6c21 subr_hints: Use goto/label instead of series of conditionals 2018-07-10 00:33:31 +00:00
Mark Johnston
013072f04c Fix pre-SI_SUB_CPU initialization of per-CPU counters.
r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the
backend allocator for PCPU UMA zones.  Unlike page_alloc(), it does
not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that.

r336020 also changed counter(9) to initialize each counter using a
CPU_FOREACH() loop instead of an SMP rendezvous.  Before SI_SUB_CPU,
smp_rendezvous() will only execute the callback on the current CPU
(i.e., CPU 0), so only one counter gets zeroed.  The rest are zeroed
by virtue of the fact that UMA gratuitously zeroes slabs when importing
them into a zone.

Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't
zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no
effect, and pcpu_page_alloc() didn't honour M_ZERO.  Fix this by
iterating over the full range of CPU IDs when zeroing counters,
ignoring whether the corresponding bits in all_cpus are set.

Reported and tested by:	pho (previous version)
Reviewed by:		kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D16190
2018-07-10 00:18:12 +00:00
Jamie Gritton
0a1724045e Change prison_add_vfs() to the more generic prison_add_allow(), which
can add any dynamic allow.* or allow.*.* parameter.  Also keep
prison_add_vfs() as a wrapper.

Differential Revision:	D16146
2018-07-06 18:50:22 +00:00
Kyle Evans
cae22dd904 kern_environment: Fix SYSINIT ordering
The dynamic environment was being initialized at SI_SUB_KMEM, SI_ORDER_ANY.
I added the hint-merging at SI_SUB_KMEM, SI_ORDER_ANY as well in r335998 -
this can only work by coincidence.

Re-do both to operate at SI_SUB_KMEM + 1, SI_ORDER_FIRST and SI_ORDER_SECOND
respectively to be safe. It's sufficiently obfuscated away as to when in
SU_SUB_KMEM malloc will be available, and the dynamic environment cannot be
relied upon there anyways since it's initialized at SI_ORDER_ANY.

Reported by:	bde
Discussed with:	bde
X-MFC-With: r335998
2018-07-06 16:51:35 +00:00
Brooks Davis
7524b4c14b Correct breakage on 32-bit platforms from r335979. 2018-07-06 10:03:33 +00:00
Matt Macy
822e50e3f6 epoch(9): simplify initialization
replace manual NUMA aware allocation with a pcpu zone
2018-07-06 06:20:03 +00:00
Matt Macy
ab3059a8e7 Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
  (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
  There are some misconceptions surrounding this field. It is the
  _VM_ NUMA domain and should only ever correspond to valid domain
  values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision:  https://reviews.freebsd.org/D15933
2018-07-06 02:06:03 +00:00
Andrew Turner
2bf9501287 Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
Bjoern A. Zeeb
1534cd19b5 Split up deadlkres() to make it more readable in anticipation of
further changes adding another level of indentation.

Some of the logic got simplified with the break out functions.
There should be no functional changes.

Reviewed by:	kib
Sponsored by:	iXsystems, Inc.
Differential Revision:		https://reviews.freebsd.org/D15914
2018-07-05 17:06:54 +00:00
Kyle Evans
39d44f7f15 kern_environment: use any provided environments, evict hintmode/envmode
At the moment, hintmode and envmode are used to indicate whether static
hints or static env have been provided in the kernel config(5) and the
static versions are mutually exclusive with loader(8)-provided environment.
hintmode *can* be reconfigured later to pull from the dynamic environment,
thus taking advantage of the loader(8) or post-kmem environment setting.

This changeset fixes both problems at once to move us from a semi-confusing
state to a consistent state: if an environment file, hints file, or
loader(8) environment are provided, we use them in a well-known order of
precedence:

- loader(8) environment
- static environment
- static hints file

Once the dynamic environment is setup this becomes a moot point. The
loader(8) and static environments are merged (respecting the above order of
precedence), and the static hints are merged in on an as-needed basis after
the dynamic environment has been setup.

Hints lookup are changed to respect all of the above. Before the dynamic
environment is setup, lookups use the above-mentioned order and fallback to
the next environment if a matching hint is not found. Once the dynamic
environment is setup, that is used on its own since it captures all of the
above information plus any dynamic kenv settings that came up later in boot.

The following tangentially related changes were made to res_find:

- A hintp cookie is now passed in so that related searches continue using
  the chain of environments (or dynamic environment) without relying on
  global state
- All three environments will be searched if they actually have valid hints
  to use, rather than just choosing the first environment that actually had
  a hint and rolling with that only

The hintmode sysctl has been ripped out. static_{env,hints}.disabled are
still honored and will disable their respective environments from being used
for hint lookups and from being merged into the dynamic environment, as
expected.

MFC after:	1 month (maybe)
Differential Revision:	https://reviews.freebsd.org/D15953
2018-07-05 16:30:32 +00:00
Kyle Evans
e28687347f Revert r335995 due to accidental changes snuck in 2018-07-05 16:28:43 +00:00
Kyle Evans
8ef5886303 kern_environment: use any provided environments, evict hintmode/envmode
At the moment, hintmode and envmode are used to indicate whether static
hints or static env have been provided in the kernel config(5) and the
static versions are mutually exclusive with loader(8)-provided environment.
hintmode *can* be reconfigured later to pull from the dynamic environment,
thus taking advantage of the loader(8) or post-kmem environment setting.

This changeset fixes both problems at once to move us from a semi-confusing
state to a consistent state: if an environment file, hints file, or
loader(8) environment are provided, we use them in a well-known order of
precedence:

- loader(8) environment
- static environment
- static hints file

Once the dynamic environment is setup this becomes a moot point. The
loader(8) and static environments are merged (respecting the above order of
precedence), and the static hints are merged in on an as-needed basis after
the dynamic environment has been setup.

Hints lookup are changed to respect all of the above. Before the dynamic
environment is setup, lookups use the above-mentioned order and fallback to
the next environment if a matching hint is not found. Once the dynamic
environment is setup, that is used on its own since it captures all of the
above information plus any dynamic kenv settings that came up later in boot.

The following tangentially related changes were made to res_find:

- A hintp cookie is now passed in so that related searches continue using
  the chain of environments (or dynamic environment) without relying on
  global state
- All three environments will be searched if they actually have valid hints
  to use, rather than just choosing the first environment that actually had
  a hint and rolling with that only

The hintmode sysctl has been ripped out. static_{env,hints}.disabled are
still honored and will disable their respective environments from being used
for hint lookups and from being merged into the dynamic environment, as
expected.

MFC after:	1 month (maybe)
Differential Revision:	https://reviews.freebsd.org/D15953
2018-07-05 16:25:48 +00:00
Bjoern A. Zeeb
0fb9f29bae With the introduction of reapers and reaplists in r275800,
proc0 and init are setup as a circular dependency.

create_init() calls fork1() which calls do_fork(). There the
newproc (initproc) is setup with a reaper of proc0 who's reaper
points to itself. The newproc (initproc) is then put on its
reaper's (proc0) p_reaplist (initproc is a descendants of proc0
for proc0 to reap). Upon return to create_init(), proc0 is
added to initproc's p_reaplist (which would mean proc0 is a
descendant of init, for init to reap). This creates a
circular dependency which eventually leads to LIST corruptions
when trying to kill init and a proc0.

For the base system we never really hit this case during reboot.
The problem only became visible after adding more virtual process
spaces which could go away cleanly (work existing in an experimental
branch).

Reviewed by:	kib
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D15924
2018-07-05 16:16:28 +00:00
Brooks Davis
714c03c81e Revert r335983.
The bfd linker in tree doesn't support multiple names for the same
symbol (at least with current flags).
2018-07-05 16:03:03 +00:00
Brooks Davis
5b04a71dae Get rid of netbsd_lchown and netbsd_msync syscall entries.
No valid FreeBSD binary ever called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15814
2018-07-05 14:12:56 +00:00
Konstantin Belousov
dbadb01591 Silence warnings about unused variables when RACCT is defined but RCTL
is not.

Reported by:	Dries Michiels <driesm.michiels@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 13:37:31 +00:00
Brooks Davis
f38b68ae8a Make struct xinpcb and friends word-size independent.
Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662.  This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR:		228301 (exp-run by antoine)
Reviewed by:	jtl, kib, rwatson (various versions)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15386
2018-07-05 13:13:48 +00:00
Matt Macy
10b8cd7f55 epoch(9): make nesting assert in epoch_wait_preempt more specific
Reported by:	markj
2018-07-04 21:34:08 +00:00
Mariusz Zaborski
6cad1a5d14 Add description to debug.ncores sysctl.
Reviewed by:	bcr
Differential Revision:	https://reviews.freebsd.org/D16123
2018-07-04 17:06:51 +00:00