263006 Commits

Author SHA1 Message Date
mjg
2e8c47f811 amd64 pmap: batch chunk removal in pmap_remove_pages
pv list lock is the main bottleneck during poudriere -j 104 and
pmap_remove_pages is the most impactful consumer. It frees chunks with the lock
held even though it plays no role in correctness. Moreover chunks are often
freed in groups, sample counts during buildkernel (0-sized frees removed):

    value  ------------- Distribution ------------- count
          0 |                                         0
          1 |                                         8
          2 |@@@@@@@                                  19329
          4 |@@@@@@@@@@@@@@@@@@@@@@                   58517
          8 |                                         1085
         16 |                                         71
         32 |@@@@@@@@@@                               24919
         64 |                                         899
        128 |                                         7
        256 |                                         2
        512 |                                         0

Thus:
1. batch freeing
2. move it past unlocking pv list

Reviewed by:	alc (previous version), markj (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21832
2019-09-29 20:44:13 +00:00
kevans
a32b126360 memfd_create(3): Don't actually force hugetlb size with MFD_HUGETLB
The size flags are only required to select a size on systems that support
multiple sizes. MFD_HUGETLB by itself is valid.
2019-09-29 17:30:10 +00:00
jilles
2748dea7b2 Adjust tests after page fault changes in r352807
Commit r352807 fixed various signal numbers and codes from page faults;
adjust the tests so they expect the fixes to be present.

PR:		211924
2019-09-29 15:17:58 +00:00
tuexen
5f6d43232d RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by:		jtl@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21665
2019-09-29 10:45:13 +00:00
kevans
3e6fdbf4be MFD_*: swap ordering
This API is still young enough that I would expect no one to be dependant on
this yet... Swap the ordering while it's young to match Linux values to
potentially ease implementation of linuxolator syscall, being able to reuse
existing constants.
2019-09-29 03:26:29 +00:00
kevans
81f47d4eea fdt_slicer: bump to SI_ORDER_THIRD following r347183
r347183 bumped GEOM classes to SI_ORDER_SECOND to resolve a race between
them and the initialization of devsoftc.mtx in devinit, but missed this
dependency on g_flashmap that may now lose the race against GEOM
classes/g_init.

There's a great comment that describes the situation that has also been
updated with the new ordering of GEOM classes.

Reported by:	bdragon
MFC after:	4 days
2019-09-29 03:12:35 +00:00
manu
de6e65d7c9 Import DTS files from Linux 5.3 2019-09-28 23:08:19 +00:00
manu
f0911296ca arm: allwinner: Add pll_mipi to the files 2019-09-28 23:01:23 +00:00
manu
5b08cb7220 Import DTS files from Linux 5.2 2019-09-28 22:54:56 +00:00
manu
9fad8dc3b2 Import DTS files from Linux 5.3 2019-09-28 22:38:14 +00:00
manu
621885c930 Import DTS from Linux 5.2 2019-09-28 22:35:29 +00:00
manu
c1ed6afbc4 arm64: rockchip: Add usb2phy driver
This driver is for the usb phy present on rockchip SoC.
It only support RK3399 and host mode for now.
The driver expose the usb clock needed by the usb controller.
2019-09-28 22:25:21 +00:00
manu
35f05016ed dwc: Add more delay for chip reset
On rockchip board it seems that the value in the DTS
are not enough for reseting the chip, I don't know if
the value are really incorrect or if DELAY is not precise
enough or if the rockchip gpio driver have some "lag" of some
kind or not.
For now just add more delay.
2019-09-28 22:23:21 +00:00
manu
156f684147 arm64: rockchip: Fix map_gpio
The map_gpio function wasn't correct, the first element is the pin
and not the phandle.
2019-09-28 22:21:16 +00:00
manu
d3d27ce52f arm64: rockchip: Implement resets
Module resets where not implemented when rockchip clocks were commited.
Implement them.
Since all resets registers are contiguous a driver only need to give
the start offset and the number of resets. This avoid to have to declare
every resets.
2019-09-28 22:19:52 +00:00
manu
6e39845513 arm64: rockchip: rk3399: Add usb2 clocks 2019-09-28 22:17:26 +00:00
manu
755e490fa8 arm64: allwinner: a64: Add PLL_MIPI
PLL_MIPI is the last important PLL that we missed.
Add support for it.
Since it's one of the possible parent for TCON0 also add this clock
now that we can.
While here add some info about what video related clocks should be
enabled at boot and with what frequency.
2019-09-28 22:14:33 +00:00
alc
fe710b3242 Eliminate redundant calls to critical_enter() and critical_exit() from
pmap_update_entry().  It suffices that interrupts are blocked.

Reviewed by:	andrew, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21753
2019-09-28 17:16:03 +00:00
imp
075dc1b3fb Revert the mode_t -> int changes and add a warning in the BUGS section instead.
While FreeBSD's implementation of these expect an int inside of libc, that's an
implementation detail that we can hide from the user as it's the natural
promotion of the current mode_t type and before it is used in the kernel, it's
converted back to the narrower type that's the current definition of mode_t. As
such, documenting int is at best confusing and at worst misleading. Instead add
a note that these args are variadic and as such calling conventions may differ
from non-variadic arguments.
2019-09-28 17:15:48 +00:00
jhb
0b490330a3 Disable build of LOCAL_MODULES for cross-builds by default.
WITHOUT_LOCAL_MODULES can be set to disable LOCAL_MODULES for native
builds.  WITH_LOCAL_MODULES can be set to leave it enabled for cross
builds.

This does not use a knob in kern.opts.mk because the options framework
does not currently support options whose default varies on the build
type.  I discussed a few options there with Warner (e.g. maybe having
a tri-state where the default value is "auto" and having Makefile.inc1
apply logic when MK_LOCAL_MODULES is set to "auto"), but Warner ok'd
this approach for now until a better solution is implemented.

Requested by:	many
Reviewed by:	imp (in person at EuroBSDCon)
Differential Revision:	https://reviews.freebsd.org/D21608
2019-09-28 14:20:28 +00:00
jhb
7c6039a4fe Disable REPRODUCIBLE_BUILD for kernel builds.
The REPRODUCIBLE_BUILD option is actually managed in two separate
files.  src.opts.mk governs the setting for world builds and
kern.opts.mk governs it for kernel builds.  r350550 only changed the
default for world builds.

Reported by:	emaste
Reviewed by:	emaste
Differential Revision:	https://reviews.freebsd.org/D21444
2019-09-28 14:14:42 +00:00
tuexen
dc3d530b6c Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.

Reviewed by:		gallatin@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21616
2019-09-28 13:13:23 +00:00
tuexen
09c42850dd Ensure that the INP lock is released before leaving [gs]etsockopt()
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by:		rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21825
2019-09-28 13:05:37 +00:00
vmaffione
fc21e0e751 bhyve: support for enabling/disabling the net backend
Extend the net backend interface with two functions, namely netbe_rx_disable()
and netbe_rx_enable(), which can be used by the net device emulators to stop
the backend from invoking the receive callback. This is useful for device
emulators, i.e., on hardware resets or to implement receive backpressure.
The mevent module has been extendede to support the addition of a disabled
event. To prevent race conditions, the net backends will start with receive
operation disabled. A follow-up patch will use the new functionalities in
the virtio-net device.

Reviewed by:	jhb, markj
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D20973
2019-09-28 12:02:43 +00:00
trasz
4ed8c79723 Fix Q_TOSTR(3) with GCC when it's called with first parameter being const.
Discussed with:	cem
MFC after:	2 weeks
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D21766
2019-09-28 09:54:03 +00:00
trasz
d046858b85 Rename ARB_REBALANCE(3) to ARB_REINSERT(3) to match tree(3),
and document it.

MFC after:	2 weeks
Sponsored by:	Klara Inc, Netflix
2019-09-28 09:50:01 +00:00
trasz
6ceea0b500 Sort MLINKS for arb(3), and actually make them work by fixing a '=' vs '+='
mixup.

MFC after:	2 weeks
Sponsored by:	Klara Inc, Netflix
2019-09-28 09:37:05 +00:00
trasz
253e3db566 Add RB_REINSERT(3), a low overhead alternative to removing a node
and reinserting it back with an updated key.

This is one of dependencies for the upcoming stats(3) code.

Reviewed by:	cem
Obtained from:	Netflix
MFC after:	2 weeks
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D21786
2019-09-28 09:22:52 +00:00
trasz
0216ab4d73 Move the SysV IPC stuff out of the 'abi' rc script, into a new one:
'sysvipc' - it has nothing to do with ABIs, and I'd like to later
rename 'abi' to 'linux', which better describes its purpose and also
matches the rcvar name.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21615
2019-09-28 09:12:41 +00:00
jhibbits
3def8cb160 powerpc/booke64: Align initial stack setting to match that of aim64's
Clang9/LLD9 appears to get quite confused with the instruction stream used
to obtain the tmpstack pointer, almost as though it thinks this is a C
function, so tries to optimize it.  Since the AIM64 method doesn't use the
TOC to obtain the tmpstack, just follow that model, and lld won't get
confused.

Reported by:	bdragon
MFC after:	2 weeks
2019-09-28 03:33:07 +00:00
jhibbits
fe1dbc1156 dpaa(4): Fix memcpy size for threshold copy in NCSW contrib
On 64-bit platforms uintptr_t makes the copy twice as large as it should be.
This code isn't actually used in FreeBSD, since it's for guest mode only,
not hypervisor mode, but fixing it for completeness sake.

Reported by:	bdragon (clang9 build)
2019-09-28 02:49:46 +00:00
markj
d46f64003d Fix some problems with the SPARSE_MAPPING option in the kernel linker.
- Ensure that the end of the mapping passed to vm_page_wire() is
  page-aligned.  vm_page_wire() expects this.
- Wire pages before reading data into them.
- Apply protections specified in the segment descriptor using
  vm_map_protect() once relocation processing is done.
- On amd64, ensure that we load KLDs above KERNBASE, since they
  are compiled with the "kernel" memory model by default.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21756
2019-09-28 01:42:59 +00:00
markj
d130e737a2 Implement pmap_page_is_mapped() correctly on arm64 and riscv.
We must also check for large mappings.  pmap_page_is_mapped() is
mostly used in assertions, so the problem was not very noticeable.

Reviewed by:	alc
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21824
2019-09-27 23:37:01 +00:00
markj
967300fd1a Correct the scope of several global variables.
They are accessed from multiple compilation units.  No functional change
intended.

MFC after:	1 week
Sponsored by:	Netflix
2019-09-27 21:04:33 +00:00
imp
0bdcc47654 Use set -o xtrace in preference to set -x for consistency with
the rest of nanobsd.sh.
2019-09-27 20:56:49 +00:00
imp
6841dfe820 Push and pop xtrace correctly for run_early_customize
run_early_customize is run as a shell list, not as a subshell, so that the side
effects of setting variables can affect later stages of the build (for better or
worse, it's been like this since it was introduced). It therefore has the side
effect of turning off xtrace always, which limits the usefulness of sh -x
nanobsd.sh. Remember the old setting and only turn off tracing after the command
if tracing was off before. All the other places where we do similar things we use
a subshell, so we don't need to do this.
2019-09-27 20:56:44 +00:00
imp
a2dccd04c5 Remove workaround for building on FreeBSD hosts prior to FreeBSD 10.
rm -x was introduced in the FreeBSD 10 time frame. 4 years ago I added a
function to cope with building nanobsd images on hosts as old FreeBSD 7 that
lacked rm -x. The workaround is no longer needed as FreeBSD 9 hasn't been
supported for almost 3 years. Eliminate the wrapper and use rm -x directly
again.
2019-09-27 20:56:31 +00:00
dim
3f0dbc935b Allow entering fractional delays in top(1) interactive mode.
This uses the same logic as with the -s option, first validating the
entered value, then storing the result in a struct timeval.

MFC after:	3 days
X-MFC-With:	r352818
2019-09-27 20:53:31 +00:00
dim
0457e6c24d Make fractional delays for top(1) work for interactive mode.
In r334906, the -s option was changed to allow fractional times, but
this only functioned correctly for batch mode.  In interactive mode, any
delay below 1.0 would get floored to zero.  This would put top(1) into a
tight loop, which could be difficult to interrupt.

Fix this by storing the -s option value (after validation) into a struct
timeval, and using that struct consistently for delaying with select(2).

Next up is to allow interactive entry of a fractional delay value.

MFC after:	3 days
2019-09-27 20:20:21 +00:00
gallatin
db240be427 kTLS: Fix a bug where we would not encrypt anon data inplace.
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.

When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.

This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21796
2019-09-27 20:08:19 +00:00
emaste
2958109758 controlelf: update man page
Some minor corrections, clarifications or rewording.
2019-09-27 19:26:52 +00:00
gallatin
74ca068530 kTLS support for TLS 1.3
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by:	jhb, hselasky
Sponsored by:	Netflix
Differential Revision:	D21801
2019-09-27 19:17:40 +00:00
mjg
99543c9107 cache: decrease ncnegfactor to 5
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary

Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.

Sample result about afer 90 minutes of poudriere -j 104:

           current    no evict   % of the original
numneg     219737     2013157    916
numneghits 266711906  263544562  98 [1]

[1] this may look funny but there is a certain dose of variation to the
build

The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:14:03 +00:00
mjg
306794b4a0 cache: stop requeuing negative entries on the hot list
Turns out it does not improve hit ratio, but it does come with a cost
induces stemming from dirtying hit entries.

Sample result: hit counts of evicted entries after 2 buildworlds

before:

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                180865
               1 |@@@@@@@                                  49150
               2 |@@@                                      19067
               4 |@                                        9825
               8 |@                                        7340
              16 |@                                        5952
              32 |@                                        5243
              64 |@                                        4446
             128 |                                         3556
             256 |                                         3035
             512 |                                         1705
            1024 |                                         1078
            2048 |                                         365
            4096 |                                         95
            8192 |                                         34
           16384 |                                         26
           32768 |                                         23
           65536 |                                         8
          131072 |                                         6
          262144 |                                         0

after:
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                184004
               1 |@@@@@@                                   47577
               2 |@@@                                      19446
               4 |@                                        10093
               8 |@                                        7470
              16 |@                                        5544
              32 |@                                        5475
              64 |@                                        5011
             128 |                                         3451
             256 |                                         3002
             512 |                                         1729
            1024 |                                         1086
            2048 |                                         363
            4096 |                                         86
            8192 |                                         26
           16384 |                                         25
           32768 |                                         24
           65536 |                                         7
          131072 |                                         5
          262144 |                                         0

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:13:22 +00:00
mjg
a67ed57908 cache: make negative list shrinking a little bit concurrent
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.

While here count how many times we skipped shrinking due to the lock
being already taken.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:43 +00:00
mjg
0c1eba6ea2 cache: stop recalculating upper limit each time a new entry is added
Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:20 +00:00
emaste
128abcff95 controlelf: exit with error if file endianness does not match host
We need to add support for cross-endian operation, but until that's done
just exit with an error rather than misbehaving.
2019-09-27 19:07:11 +00:00
emaste
2e1c48e4b2 controlelf: simplify feature string parsing
Also add error handling on failure to seek/write updated value.
2019-09-27 18:49:13 +00:00
kib
957270782d Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
emaste
26118133a6 controlelf: tidy up option parsing
Sponsored by:	The FreeBSD Foundation
2019-09-27 18:39:05 +00:00