Commit Graph

244566 Commits

Author SHA1 Message Date
Dimitry Andric
a6f721ece8 Make fractional delays for top(1) work for interactive mode.
In r334906, the -s option was changed to allow fractional times, but
this only functioned correctly for batch mode.  In interactive mode, any
delay below 1.0 would get floored to zero.  This would put top(1) into a
tight loop, which could be difficult to interrupt.

Fix this by storing the -s option value (after validation) into a struct
timeval, and using that struct consistently for delaying with select(2).

Next up is to allow interactive entry of a fractional delay value.

MFC after:	3 days
2019-09-27 20:20:21 +00:00
Andrew Gallatin
b2dba6634b kTLS: Fix a bug where we would not encrypt anon data inplace.
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.

When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.

This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21796
2019-09-27 20:08:19 +00:00
Ed Maste
f84ed82834 controlelf: update man page
Some minor corrections, clarifications or rewording.
2019-09-27 19:26:52 +00:00
Andrew Gallatin
6554362c66 kTLS support for TLS 1.3
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by:	jhb, hselasky
Sponsored by:	Netflix
Differential Revision:	D21801
2019-09-27 19:17:40 +00:00
Mateusz Guzik
708cf7eb6c cache: decrease ncnegfactor to 5
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary

Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.

Sample result about afer 90 minutes of poudriere -j 104:

           current    no evict   % of the original
numneg     219737     2013157    916
numneghits 266711906  263544562  98 [1]

[1] this may look funny but there is a certain dose of variation to the
build

The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:14:03 +00:00
Mateusz Guzik
e643141838 cache: stop requeuing negative entries on the hot list
Turns out it does not improve hit ratio, but it does come with a cost
induces stemming from dirtying hit entries.

Sample result: hit counts of evicted entries after 2 buildworlds

before:

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                180865
               1 |@@@@@@@                                  49150
               2 |@@@                                      19067
               4 |@                                        9825
               8 |@                                        7340
              16 |@                                        5952
              32 |@                                        5243
              64 |@                                        4446
             128 |                                         3556
             256 |                                         3035
             512 |                                         1705
            1024 |                                         1078
            2048 |                                         365
            4096 |                                         95
            8192 |                                         34
           16384 |                                         26
           32768 |                                         23
           65536 |                                         8
          131072 |                                         6
          262144 |                                         0

after:
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                184004
               1 |@@@@@@                                   47577
               2 |@@@                                      19446
               4 |@                                        10093
               8 |@                                        7470
              16 |@                                        5544
              32 |@                                        5475
              64 |@                                        5011
             128 |                                         3451
             256 |                                         3002
             512 |                                         1729
            1024 |                                         1086
            2048 |                                         363
            4096 |                                         86
            8192 |                                         26
           16384 |                                         25
           32768 |                                         24
           65536 |                                         7
          131072 |                                         5
          262144 |                                         0

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:13:22 +00:00
Mateusz Guzik
312196df0f cache: make negative list shrinking a little bit concurrent
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.

While here count how many times we skipped shrinking due to the lock
being already taken.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:43 +00:00
Mateusz Guzik
95c6dd890a cache: stop recalculating upper limit each time a new entry is added
Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:20 +00:00
Ed Maste
43b40779db controlelf: exit with error if file endianness does not match host
We need to add support for cross-endian operation, but until that's done
just exit with an error rather than misbehaving.
2019-09-27 19:07:11 +00:00
Ed Maste
6abfc627d6 controlelf: simplify feature string parsing
Also add error handling on failure to seek/write updated value.
2019-09-27 18:49:13 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Ed Maste
3801c66a97 controlelf: tidy up option parsing
Sponsored by:	The FreeBSD Foundation
2019-09-27 18:39:05 +00:00
Ed Maste
d70d327f0e controlelf: add protmax control
Sponsored by:	The FreeBSD Foundation
2019-09-27 17:28:25 +00:00
Dimitry Andric
1a444441d8 Correct the final argument name in the top(1) manpage.
The description talks about 'number', while the final argument was
'count'.  Since 'count' is already used for the count of displays,
change the final argument name to 'number'.

MFC after:	3 days
2019-09-27 17:11:21 +00:00
Ed Maste
33ac844050 controlelf: some style(9) cleanup
Submitted by:	clang-format
2019-09-27 16:57:32 +00:00
Mark Johnston
7cc833c598 Fix a race in vm_page_swapqueue().
vm_page_swapqueue() atomically transitions a page between queues.  To do
so, it must hold the page queue lock for the old queue.  However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state.  Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21791
2019-09-27 16:46:08 +00:00
Ed Maste
02ebf42d80 controlelf: install standard BSD 2 clause license
Reported by:	kaktus
Sponsored by:	The FreeBSD Foundation
2019-09-27 16:44:29 +00:00
Mark Johnston
87e93ea6c3 Fix object locking in vm_object_unwire() after r352174.
Now, vm_page_busy_sleep() expects the page's object to be locked.
vm_object_unwire() does some unusual lazy locking of the object chain
and keeps objects locked until a busy page is encountered or the loop
terminates.  When a busy page is encountered, rather than unlocking all
but the "bottom-level" object, we must instead skip the object to which
"tm" belongs.

Reported and tested by:	pho
Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21790
2019-09-27 16:41:34 +00:00
Ed Maste
d6f95d83b4 controlelf: clean up warnings
- use explicit ELF note name when not found
- no trailing . on warnings
- no \n

Sponsored by:	The FreeBSD Foundation
2019-09-27 16:35:08 +00:00
Conrad Meyer
963c89ff4e nvdimm(4): Extract ACPI root bus driver
No functional change intended.

The intent is to add a "legacy" e820 pmem newbus bus for nvdimm device in a
subsequent revision, and it's a little more clear if the parent buses get
independent source files.

Quite a lot of ACPI-specific logic is left in nvdimm.c; disentangling that
is a much larger change (and probably not especially useful).

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21813
2019-09-27 16:32:44 +00:00
Ed Maste
d5cd80c811 Add tool to modify ELF binary feature control bits
This will allow feature control bits (e.g. for ASLR, PROT_MAX) to be
inspected or modified.

Some clean-up and additional work is likely still required, but we can
iterate on this in the tree.

Submitted by:	Bora Özarslan <borako.ozarslan@gmail.com>
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19290
2019-09-27 16:27:52 +00:00
Andrew Turner
50bb04b750 Check the vfs option length is valid before accessing through
When a VFS option passed to nmount is present but NULL the kernel will
place an empty option in its internal list. This will have a NULL
pointer and a length of 0. When we come to read one of these the kernel
will try to load from the last address of virtual memory. This is
normally invalid so will fault resulting in a kernel panic.

Fix this by checking if the length is valid before dereferencing.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2019-09-27 16:22:28 +00:00
Warner Losh
4470d73996 Document varadic args as int, since you can't have short varadic args (they are
promoted to ints).

- `mode_t` is `uint16_t` (`sys/sys/_types.h`)
- `openat` takes variadic args
- variadic args cannot be 16-bit, and indeed the code uses int
- the manpage currently kinda implies the argument is 16-bit by saying `mode_t`

Prompted by Rust things: https://github.com/tailhook/openat/issues/21
Submitted by: Greg V at unrelenting
Differential Revision: https://reviews.freebsd.org/D21816
2019-09-27 16:11:47 +00:00
Ed Maste
cdb42801d0 compiler-rt: correct RISC-V struct_kernel_stat64_sz
The value of struct_kernel_stat64_sz introduced by review D5021 for
RISC-V was incorrect.

Also add a __riscv_xlen == 64 conditional as the 32-bit ABI is not yet
finalized.

Submitted by:	Luís Marques
Differential Revision:	https://reviews.freebsd.org/D21684
2019-09-27 13:14:36 +00:00
Pawel Biernacki
944e67b9cb Add myself (kaktus) as a src commiter.
Reviewed by:	kib (mentor)
Approved by:	kib (mentor), mjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D21811
2019-09-27 10:19:28 +00:00
Alexander Motin
630d9800a1 Replace argument checks with assertions.
Those functions are used by kernel, and we can't check all possible argument
errors in production kernel.  Plus according to docs many of those errors
are checked by hardware.  Assertions should just help with code debugging.

MFC after:	2 weeks
2019-09-27 02:09:20 +00:00
Cy Schubert
a97e8d2fe4 Implement the dynamic add (-A) and removal (-R) of ippool pools
from the command line. Prior to this the functionality was mostly there
however since the pool type (-t) was not recognized by the -A and -R
command options -- not recognized by getopt(). Additionally the code to
implement the dynamic add and removal of pools didn't work.

When dynamically adding (-A) a pool a type (-t) to specify if the pool
is a tree or hash pool must  be specified. When dynamically removing (-R)
a pool, omitting -t will cause a search-and-destroy which will remove
both types of pools matching the name given (-m).

PR:		218433
MFC after:	1 week
2019-09-27 00:29:12 +00:00
Cy Schubert
e7257e1499 The no resolve (OPT_NORESOLVE) does nothing. Additionally, it (-R)
conflicts with the command option of the same name (also -R).
Remove the superfluous and confusing non-global non-command -R option.

PR:		218433
MFC after:	1 week
2019-09-27 00:29:09 +00:00
Cy Schubert
80aa6435f0 Sync with source:
Only a role of "ipf" is currentlysupported as the other documented
(and undocumented) roles are #ifdef'd out.

The plan is to complete ippool(8) as it is even in its current state
a powerful feature/tool.

PR:		218433
MFC after:	1 month
2019-09-27 00:29:06 +00:00
Cy Schubert
a263199455 Fix a typo.
MFC after:	3 days
2019-09-27 00:29:03 +00:00
Gleb Smirnoff
05ee75efe7 Move EPOCH_TRACE to opt_global.h, so that any external modules that
use epoch don't need Makefile tweaks.

The downside is that any developer who wants EPOCH_TRACE needs to
rebuild kernel in full, but that's fine.

Reviewed by:	imp
2019-09-26 21:12:47 +00:00
Oleksandr Tymoshenko
17b984a638 snd_hda: Add Intel Cannon Lake support
Add missing header change ommitted in r352775

MFC after:	2 weeks
X-MFC-with:	352775
2019-09-26 21:04:36 +00:00
Oleksandr Tymoshenko
c314e2aff2 snd_hda: Add Intel Cannon Lake support
Add PCI ids for Intel Cannon Lake PCH

Tested on:	HP Spectre x360 13-p0043dx
PR:		240574
Submitted by:	Neel Chauhan <neel@neelc.org>
Reviewed by:	imp, mizhka, ray
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D21789
2019-09-26 21:02:21 +00:00
Kyle Evans
e12ff89136 Further normalize copyright notices
- s/C/c/ where I've been inconsistent about it
- +SPDX tags
- Remove "All rights reserved" where possible

Requested by:	rgrimes (all rights reserved)
2019-09-26 16:19:22 +00:00
David Bright
d4f4430503 Correct mistake in MLINKS introduced in r352747
Messed up a merge conflict resolution and didn't catch that before
commit.

Sponsored by:	Dell EMC Isilon
2019-09-26 16:13:17 +00:00
David Bright
c4571256af sysent: regenerate after r352747.
Sponsored by:	Dell EMC Isilon
2019-09-26 15:41:10 +00:00
Mark Johnston
55248d32f2 Fix handling of invalid pages in exec_map_first_page().
exec_map_first_page() would unconditionally free an unbacked, invalid
page from the executable image.  However, it is possible that the page
is wired, in which case it is incorrect to free the page, so check for
additional wirings first.

Reported by:	syzkaller
Tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21767
2019-09-26 15:35:35 +00:00
David Bright
9afb12bab4 Add an shm_rename syscall
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.

truss support is included; audit support will come later.

This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	jilles (earlier revision)
Reviewed by:	brueffer (manpages, earlier revision)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21423
2019-09-26 15:32:28 +00:00
Jonathan T. Looney
0b18fb0798 Add new functionality to switch to using cookies exclusively when we the
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.

To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.

When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).

Reviewed by:	rrs, gallatin (previous version)
Discussed with:	tuexen
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:18:57 +00:00
Jonathan T. Looney
0bee4d631a Access the syncache secret directly from the V_tcp_syncache variable,
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.

This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:06:46 +00:00
Jonathan T. Looney
867e98f8ee Remove the unused sch parameter to the syncache_respond() function. The
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:02:34 +00:00
Alexander Motin
34a5c41c43 Add kern.cam.da.X.quirks tunable, similar existing for ada.
Submitted by:	Michael Lass
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20677
2019-09-26 14:48:39 +00:00
Ed Maste
20bd59416d bspatch: add integer overflow checks
Introduce a new add_off_t static function that exits with an error
message if there's an overflow, otherwise returns their sum.  Use this
when adding values obtained from the input patch.

Reviewed by:	delphij, allanjude (earlier)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D7897
2019-09-26 13:27:25 +00:00
Toomas Soome
11fc80a098 kernel terminal should initialize fg and bg variables before calling TUNABLE_INT_FETCH
We have two ways to check if kenv variable exists - either we check return
value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check
if this value did change. In terminal_init() it is more convinient to
use pre-initialized variables.

Problem was revealed by older loader.efi, which did not set teken.* variables.

Reported by:	tuexen
2019-09-26 07:19:26 +00:00
Toomas Soome
29f7096df9 vt: use proper return value check with TUNABLE_INT_FETCH
The TUNABLE_INT_FETCH is macro around getenv_int() and we will get
return value 0 or 1 for failure or success, we can use it to decide
which background color to use.
2019-09-26 07:14:54 +00:00
Cy Schubert
4fcb870612 Teach the ippool parser about address families. This is a precursor
to implementing IPv6 support within ippool which requires reworking
radix_ipf.c.

MFC after:	1 month
2019-09-26 03:09:45 +00:00
Cy Schubert
d096bd7911 ipf mistakenly regards UDP packets with a checksum of 0xffff as bad.
Obtained from:	NetBSD fil.c r1.30, NetBSD PR/54443
MFC after:	3 days
2019-09-26 03:09:42 +00:00
Rick Macklem
ee7201a725 Replace all mtx_assert() calls for n_mtx and ncl_iod_mutex with macros.
To be consistent with replacing the mtx_lock()/mtx_unlock() calls on
the NFS node mutex (n_mtx) and ncl_iod_mutex, this patch replaces
all mtx_assert() calls on these mutexes with macros as well.
This will simplify changing these locks to sx locks in a future commit.
However, this change may be delayed indefinitely, since it appears there
is a deadlock when vnode_pager_setsize() is called to shrink the size
and the NFS node lock is held.
There is no semantic change as a result of this commit.

Suggested by:	kib
MFC after:	1 week
2019-09-26 02:54:45 +00:00
Conrad Meyer
407c48f060 amd64 pmap: Clarify largemap bootverbose message units
A PML4 covers 512 gigabytes, not gigabits.  Use the typical B suffix for
bytes.  No functional change.

Sponsored by:	Dell EMC Isilon
2019-09-26 01:51:55 +00:00
Conrad Meyer
f7b69dd986 amd64: Expose vm.pmap.large_map_pml4_entries as a sysctl node
It's nice to have sysctl nodes for tunables.

Sponsored by:	Dell EMC Isilon
2019-09-26 01:50:26 +00:00