Commit Graph

16303 Commits

Author SHA1 Message Date
cem
91754b1945 dev_refthread: Do not initialize *ref when reference was not acquired
Like the companion API devvn_refthread, leave *ref uninitialized when a
reference was not acquired.  Initializing to 1 provides a vaguely
correct-looking but bogus value for broken callers to (mistakenly) pass to
dev_relthread() when refthread fails.

Make it even more clear to consumers that dev_relthread is only valid when
dev_refthread succeeds.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16885
2018-10-20 19:42:38 +00:00
cem
03435f0383 tty info (^T): Add optional kernel stack(9) traces
It is often useful for developers and administrators to determine a running
thread's stack for debugging purposes.  With this feature, using ^T will
print that information

For now, the feature is disabled by default.  Enable with sysctl
kern.tty_info_kstacks=1.

Discussed with:	markj
Reviewed by:	oshogbo
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17621
2018-10-20 18:42:28 +00:00
cem
515c6c397b Replace ttyprintf with sbuf_printf and tty drain routine
Add string variants of cnputc and tty_putchar, and use them from the tty
sbuf drain routine.

Suggested by:	ed@
Sponsored by:	Dell EMC Isilon
2018-10-20 18:31:36 +00:00
cem
c92c63268e Add flags variants to linker_files / stack(9) symbol resolution
Some best-effort consumers may find trylock behavior for stack(9) symbol
resolution acceptable.  Expose that behavior to such consumers.

This API is ugly.  If in the future the modules and linker file list locking
is cleaned up such that the linker_files list can be iterated safely without
acquiring a sleepable lock, this API should be removed.  However, most of
the time nothing will be holding the linker files lock exclusive and the
acquisition can proceed.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17620
2018-10-20 18:08:43 +00:00
markj
8998a23151 Create some global domainsets and refactor NUMA registration.
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by:	alc, gallatin, jeff, kib
Tested by:	pho (part of a larger patch)
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17416
2018-10-20 17:36:00 +00:00
bz
635949fea7 In r78161 the lookup_set linker method was introduced which optionally
returns the section start and stop locations as well as a count if the
caller asks for them.
There was only one out-of-file consumer of count which did not actually
use it and hence was eliminated in r339407.
In r194784 parse_dpcpu(), and in r195699 parse_vnet() (a copy of the
former) started to use the link_elf_lookup_set() interface internally
also asking for the count.

count is computed as the difference of the void **stop - void **start
locations and as such, if the absoulte numbers
	(stop - start) % sizeof(void *) != 0
a round-down happens, e.g., **stop 0x1003 - **start 0x1000 => count 0.

To get the section size instead of "count is the number of pointer
elements in the section", the parse_*() functions do a
	count *= sizeof(void *).
They use the result to allocate memory and copy the section data
into the "master" and per-instance memory regions with a size of
count.

As a result of count possibly round-down this can miss the last
bytes of the section.  The good news is that we do not touch
out of bounds memory during these operations (we may at a later stage
if the last bytes would overflow the master sections).
Given relocation in elf_relocaddr() works based on the absolute
numbers of start and stop, this means that we can possibly try to
access relocated data which was never copied and hence we get
random garbage or at best zeroed memory.

Stop the two (last) consumers of count (the parse_*() functions)
from using count as well, and calculate the section size based on
the absolute numbers of stop and start and use the proper size for
the memory allocation and data copies.  This will make the symbols
in the last bytes of the pcpu or vnet sections be presented as
expected.

PR:			232289
Approved by:		re (gjb)
MFC after:		2 weeks
2018-10-18 20:20:41 +00:00
jamie
47dc3d2edc Fix typos from r339409.
Reported by:	maxim
Approved by:	re (gjb)
2018-10-18 15:02:57 +00:00
jtl
7739ba33c8 r334853 added a "socket destructor" callback. However, as implemented, it
was really a "socket close" callback.

Update the socket destructor functionality to run when a socket is
destroyed (rather than when it is closed). The original submitter has
confirmed that this change satisfies the intended use case.

Suggested by:	rwatson
Submitted by:	Michio Honda <micchie at sfc.wide.ad.jp>
Tested by:	Michio Honda <micchie at sfc.wide.ad.jp>
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17590
2018-10-18 14:20:15 +00:00
jamie
ae3e1ed6d1 Add a new jail permission, allow.read_msgbuf. When true, jailed processes
can see the dmesg buffer (this is the current behavior).  When false (the
new default), dmesg will be unavailable to jailed users, whether root or
not.

The security.bsd.unprivileged_read_msgbuf sysctl still works as before,
controlling system-wide whether non-root users can see the buffer.

PR:		211580
Submitted by:	bz
Approved by:	re@ (kib@)
MFC after:	3 days
2018-10-17 16:11:43 +00:00
bz
ad5fefcc73 The countp argument passed to linker_file_lookup_set() in
linker_load_dependencies() is unused, so no need to ask for the
value in first place.  Remove the unused "count" variable.

Approved by:	re (kib)
2018-10-17 10:31:08 +00:00
markj
6ccc1b583b Reparent a child of pdfork(2) to its reaper when the procdesc is closed.
Unconditionally reparenting to PID 1 breaks the procctl(2) reaper
functionality.

Add a regression test for this case.

Reviewed by:	kib
Approved by:	re (gjb)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17589
2018-10-16 20:06:56 +00:00
glebius
7e66c109f2 Plug sendfile(2) on a listening socket with proper error code.
Reported by:	ngie
Reviewed by:	ngie
Approved by:	re (delphij)
2018-10-16 15:57:16 +00:00
kevans
cbbb57703a Correct COMPAT* macro names in syscalls.master
Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master
cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places.

Approved by:	re (glebius)
2018-10-15 21:35:57 +00:00
mjg
468ae8ae63 capsicum: provide cap_rights_fde_inline
Reading caps is in the hot path (on each successful fd lookup), but
completely unnecessarily requires a function call.

Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
2018-10-12 23:48:10 +00:00
mjg
1b0cd3801a Add a file missed in r339321
Approved by:	re (implicit)
Sad face: mjg
2018-10-12 00:32:45 +00:00
mjg
7346592324 Provide string functions for use before ifuncs get resolved.
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.

Convert places which need them. Xen bits by royger.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17487
2018-10-11 23:28:04 +00:00
jamie
3755eafea0 Fix the test prohibiting jails from sharing IP addresses.
It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address.  This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.

VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.

Approved by:	re@ (kib@)
MFC after:	5 days
2018-10-06 02:10:32 +00:00
mmacy
9a1ff096d9 eliminate locking surrounding ui_vmsize and swap reserve by using atomics
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by:	alc@, markj@, kib@
Approved by:	re (delphij@)
Differential Revision:	https://reviews.freebsd.org/D16273
2018-10-05 05:50:56 +00:00
glebius
968d142094 In PR 227259, a user is reporting that they have code which is using
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.

It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.

This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.

PR:		227259
Obtained from:	jtl
Approved by:	re (kib)
Differential Revision:  D15019
2018-10-03 17:40:04 +00:00
andrew
2a6fe8c6e8 Add kernel ifunc support on arm64.
Tested with ifunc resolvers in the kernel and module with calls from
kernel to kernel, module to kernel, and module to module.

Reviewed by:	kib (previous version)
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17370
2018-10-01 18:51:08 +00:00
gallatin
770bffbee6 Allow empty NUMA memory domains to support Threadripper2
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2  memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
    rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
    Thanks to Jeff for suggesting this strategy.

Reviewed by:	alc, markj
Approved by:	re (gjb@)
Differential Revision:	https://reviews.freebsd.org/D1683
2018-10-01 14:14:21 +00:00
jhb
8b293ebe2a Regenerate after UNIMPL -> OBSOL changes in r339001.
Approved by:	re (gjb)
2018-09-28 17:25:28 +00:00
jhb
a5ac7b0f85 Mark various removed system calls as OBSOL instead of UNIMPL.
This is mostly a cosmetic change except that obsolete system calls are
assigned meaningful names in the names arrays which means that using
tools like kdump or truss against binaries invoking these system calls
will print out the name instead of the number.  The script I use to
generate the XML list of syscalls for GDB also ignores UNIMPL but not
OBSOL entries.  In general UNIMPL should only be used to reserve
placeholders for system calls that have never been implemented while
system calls that existed at one time in FreeBSD but were removed
should be marked OBSOL instead.

Reviewed by:	brooks, kib, imp
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17344
2018-09-28 17:23:54 +00:00
gordon
19c946c159 Clear stack allocated data structure to prevent kernel memory leak.
Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	wes@
Approved by:	re (implicit)
Approved by:	so
Security:	FreeBSD-EN-18:12.mem
Security:	CVE-2018-17155
2018-09-27 18:39:54 +00:00
mjg
e3c2d3a906 Eliminate false sharing in malloc due to statistic collection
Currently stats are collected in a MAXCPU-sized array which is not
aligned and suffers enormous false-sharing. Fix the problem by
utilizing per-cpu allocation.

The counter(9) API is not used here as it is too incomplete and does
not provide a win over per-cpu zone sized for malloc stats struct. In
particular stats are being reported for each cpu separately by just
copying what is supposed to be an array element for given cpu.

This eliminates significant false-sharing during malloc-heavy tests
e.g. on Skylake. See the review for details.

Reviewed by:	markj
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17289
2018-09-23 19:00:06 +00:00
mjg
00ab82889c select: stop doing zero-sized memsets
Approved by:	re (kib)
2018-09-21 13:20:41 +00:00
markj
c030a808b9 Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned.
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.

Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.

Reported by:	alc
Reviewed by:	alc, kib (previous version)
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17249
2018-09-20 18:29:55 +00:00
mjg
846a8dd029 vfs: remove lookup_shared tunable
Reviewed by:	kib, jhb
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17253
2018-09-20 18:25:26 +00:00
mjg
4279599452 fd: prevent inlining of _fdrop thorough kern_descrip.c
fdrop is used in several places in the file and almost never has to call
_fdrop. Thus inlining it is a pure waste of space.

Approved by:	re (kib)
2018-09-20 13:32:40 +00:00
kib
610ca65f57 Fix state of dquot-less vnodes after failed quotaoff.
UFS quotaoff iterates over all mp vnodes, and derefences and clears
the pointers to corresponding dquots. If SU work items transiently
reference some of dquots,quotaoff() would eventually fail, but all
processed vnodes are already stripped from dquots.  The state is
problematic, since quotas are left enabled, but there is no dquots
where blocks and inodes can be accounted.  The result is assertion
failures and NULL pointer dereferences.

Fix it by suspending writes around quotaoff() call.  Since the
filesystem is synced, no dandling references to dquots from SU
workitems can left behind, which means that quotaoff succeeds.

The complication there is that quotaoff VFS op is performed with the
mount point busied, while to suspend, we need to start write on the
mp.  If vn_start_write() is called on busied mp, system might deadlock
against parallel unmount request.  Handle this by unbusy-ing mp before
starting write, which in turn requires changing the quotaoff()
interface to return with the mount point not busied, same as was done
for quotaon().

Reviewed by:	mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17208
2018-09-19 14:36:57 +00:00
gordon
8f59931436 Correct ELF header parsing code to prevent invalid ELF sections from
disclosing memory.

Submitted by:	markj
Reported by:	Thomas Barabosch, Fraunhofer FKIE
Approved by:	re (implicit)
Approved by:	so
Security:	FreeBSD-SA-18:12.elf
Security:	CVE-2018-6924
Sponsored by:	The FreeBSD Foundation
2018-09-12 04:57:34 +00:00
markj
02b7bd84a2 Rename hardclock_cnt() to hardclock() and remove the old implementation.
Also remove some related and unused subroutines.  They have long been
replaced by variants that handle multiple coalesced events with a single
call.

No functional change intended.

Reviewed by:	cem, kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17029
2018-09-06 02:10:59 +00:00
markj
b1bab99c84 Correct the condition under which we allocate a terminator node.
We will have last_block < blocks if the block count is divisible
by BLIST_BMAP_RADIX, but a terminator node is still needed if the
tree isn't balanced.  In this case we were overruning the blist
array by 16 bytes during initialization.

While here, add a check for the invalid blocks == 0 case.

PR:		231116
Reviewed by:	alc, kib (previous version), Doug Moore <dougm@rice.edu>
Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17020
2018-09-05 19:05:30 +00:00
kib
0a635fe513 Add amd64 mdthread fields needed for the upcoming EFI RT exception
handling.

This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 21:16:43 +00:00
kib
c277253350 Improve error messages from clock_if.m method failures.
Print error message in verbose mode when CLOCK_SETTIME() clock_if.m
method failed.  For EFIRT RTC clock, add error code for the failure of
CLOCK_GETTIME() report.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 20:17:51 +00:00
markm
d8723e8b03 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
alc
3799d78beb Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
imp
2608f6dbcf Add a new device flag: DF_ATTACHED_ONCE
This flag is set once the device has been successfully attached. When
set, it inhibits devmatch from trying to match the device. This in
turn allows kldunload to work as expected. Prior to the change, the
driver would immediately reload because devmatch had no notion that
the driver had once been attached, and therefore shouldn't participate
in further matching.

Differential Revision: https://reviews.freebsd.org/D16735
2018-08-23 05:06:16 +00:00
imp
485cde8df8 Create devctl freeze/thaw.
This adds it to devctl, libdevctl, defines the two IOCTLs and
implements the kernel bits. causes any new drivers that are added via
kldload to be deferred until a 'thaw' comes in. These do not stack: it
is an error to freeze while frozen, or thaw while thawed.

Differential Revision: https://reviews.freebsd.org/D16735
2018-08-23 05:05:47 +00:00
cem
c843990202 devstat(9): Constify function parameters that can be const
No functional change.

When attempting to document the changed argument types in devstat.9, I
discovered the 20 year old manual page severely mismatched reality even
prior to my simple change.  So I took a first cut pass cleaning that up to
match reality.  I'm sure I've missed some things; the goal was just to leave
it better than when I started.

Sponsored by:	Dell EMC Isilon
2018-08-23 01:42:45 +00:00
cem
ffa2c7d287 KASSERT: Make runtime optionality optional
Add an option, KASSERT_PANIC_OPTIONAL, that allows runtime KASSERT()
behavior changes.  When this option is not enabled, code that allows
KASSERTs to become optional is not enabled, and all violated assertions
cause termination.

The runtime KASSERT behavior was added in r243980.

One important distinction here is that panic has __dead2
("attribute((noreturn))"), while kassert_panic does not.  Static analyzers
like Coverity understand __dead2.  Without it, KASSERTs go misunderstood,
resulting in many false positives that result from violation of program
invariants.

Reviewed by:	jhb, jtl, np, vangyzen
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16835
2018-08-22 22:19:42 +00:00
markj
1d715a5168 Prepare the kernel linker to handle PC-relative ifunc relocations.
The boot-time ifunc resolver assumes that it only needs to apply
IRELATIVE relocations to PLT entries.  With an upcoming optimization,
this assumption no longer holds, so add the support required to handle
PC-relative relocations targeting GNU_IFUNC symbols.
- Provide a custom symbol lookup routine that can be used in early boot.
  The default lookup routine uses kobj, which is not functional at that
  point.
- Apply all existing relocations during boot rather than filtering
  IRELATIVE relocations.
- Ensure that we continue to apply ifunc relocations in a second pass
  when loading a kernel module.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16749
2018-08-22 20:44:30 +00:00
tuexen
3e0ad7d794 Add SOL_SOCKET level socket option with name SO_DOMAIN to get
the domain of a socket.

This is helpful when testing and Solaris and Linux have the same
socket option using the same name.

Reviewed by:		bcr@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16791
2018-08-21 14:04:30 +00:00
alc
71b5b012c4 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
kevans
ad8cc64e2b res_find: Fix fallback logic
The fallback logic was broken if hints were found in multiple environments.
If we found a hint in either the loader environment or the static
environment, fallback would be incremented excessively when we returned to
the environment-selection bits. These checks should have also been guarded
by the fbacklvl checks. As a result, fbacklvl could quickly get to a point
where we skip either the static environment and/or the static hints
depending on which environments contained valid hints.

The impact of this bug is minimal, mostly affecting mips boards that use
static hints and may have hints in either the loader environment or the
static environment.

There may be better ways to express the searchable environments and
describing their characteristics (immutable, already searched, etc.) but
this may be revisited after 12 branches.

Reported by:	Dan Nelson <dnelson_1901@yahoo.com>
Triaged by:	Dan Nelson <dnelson_1901@yahoo.com>
MFC after:	3 days
2018-08-18 19:45:56 +00:00
delphij
2acd1f2a25 Regen after r337998. 2018-08-18 06:33:51 +00:00
delphij
4f62d03ca0 getrandom(2) should not be restricted in capability mode. 2018-08-18 06:31:49 +00:00
markj
d1a00acf4d Typo.
X-MFC with:	r337974
2018-08-17 16:07:06 +00:00
markj
4e68a99c04 Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update.  The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by:	jhibbits
Reviewed by:	kib, mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16756
2018-08-17 15:41:01 +00:00
oshogbo
1aa9b1400a capsicum: allow the setproctitle(3) function in capability mode
Capsicum in past allowed to change the process title.
This was broken with r335939.

PR:		230584
Submitted by:	Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by:	ian@niw.com.au
MFC after:	1 week
2018-08-17 14:35:10 +00:00