pv list lock is the main bottleneck during poudriere -j 104 and
pmap_remove_pages is the most impactful consumer. It frees chunks with the lock
held even though it plays no role in correctness. Moreover chunks are often
freed in groups, sample counts during buildkernel (0-sized frees removed):
value ------------- Distribution ------------- count
0 | 0
1 | 8
2 |@@@@@@@ 19329
4 |@@@@@@@@@@@@@@@@@@@@@@ 58517
8 | 1085
16 | 71
32 |@@@@@@@@@@ 24919
64 | 899
128 | 7
256 | 2
512 | 0
Thus:
1. batch freeing
2. move it past unlocking pv list
Reviewed by: alc (previous version), markj (previous version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21832
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.
Reviewed by: jtl@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21665
This API is still young enough that I would expect no one to be dependant on
this yet... Swap the ordering while it's young to match Linux values to
potentially ease implementation of linuxolator syscall, being able to reuse
existing constants.
r347183 bumped GEOM classes to SI_ORDER_SECOND to resolve a race between
them and the initialization of devsoftc.mtx in devinit, but missed this
dependency on g_flashmap that may now lose the race against GEOM
classes/g_init.
There's a great comment that describes the situation that has also been
updated with the new ordering of GEOM classes.
Reported by: bdragon
MFC after: 4 days
This driver is for the usb phy present on rockchip SoC.
It only support RK3399 and host mode for now.
The driver expose the usb clock needed by the usb controller.
On rockchip board it seems that the value in the DTS
are not enough for reseting the chip, I don't know if
the value are really incorrect or if DELAY is not precise
enough or if the rockchip gpio driver have some "lag" of some
kind or not.
For now just add more delay.
Module resets where not implemented when rockchip clocks were commited.
Implement them.
Since all resets registers are contiguous a driver only need to give
the start offset and the number of resets. This avoid to have to declare
every resets.
PLL_MIPI is the last important PLL that we missed.
Add support for it.
Since it's one of the possible parent for TCON0 also add this clock
now that we can.
While here add some info about what video related clocks should be
enabled at boot and with what frequency.
The REPRODUCIBLE_BUILD option is actually managed in two separate
files. src.opts.mk governs the setting for world builds and
kern.opts.mk governs it for kernel builds. r350550 only changed the
default for world builds.
Reported by: emaste
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D21444
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.
Reviewed by: gallatin@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21616
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by: rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21825
and reinserting it back with an updated key.
This is one of dependencies for the upcoming stats(3) code.
Reviewed by: cem
Obtained from: Netflix
MFC after: 2 weeks
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D21786
Clang9/LLD9 appears to get quite confused with the instruction stream used
to obtain the tmpstack pointer, almost as though it thinks this is a C
function, so tries to optimize it. Since the AIM64 method doesn't use the
TOC to obtain the tmpstack, just follow that model, and lld won't get
confused.
Reported by: bdragon
MFC after: 2 weeks
On 64-bit platforms uintptr_t makes the copy twice as large as it should be.
This code isn't actually used in FreeBSD, since it's for guest mode only,
not hypervisor mode, but fixing it for completeness sake.
Reported by: bdragon (clang9 build)
- Ensure that the end of the mapping passed to vm_page_wire() is
page-aligned. vm_page_wire() expects this.
- Wire pages before reading data into them.
- Apply protections specified in the segment descriptor using
vm_map_protect() once relocation processing is done.
- On amd64, ensure that we load KLDs above KERNBASE, since they
are compiled with the "kernel" memory model by default.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21756
We must also check for large mappings. pmap_page_is_mapped() is
mostly used in assertions, so the problem was not very noticeable.
Reviewed by: alc
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21824
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.
When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.
This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.
Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21796
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.
Reviewed by: jhb, hselasky
Sponsored by: Netflix
Differential Revision: D21801
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary
Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.
Sample result about afer 90 minutes of poudriere -j 104:
current no evict % of the original
numneg 219737 2013157 916
numneghits 266711906 263544562 98 [1]
[1] this may look funny but there is a certain dose of variation to the
build
The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.
Sponsored by: The FreeBSD Foundation
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.
While here count how many times we skipped shrinking due to the lock
being already taken.
Sponsored by: The FreeBSD Foundation
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap(). MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.
This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.
Change the delivered signal for accesses to mapped area after the
backing object was truncated. According to POSIX description for
mmap(2):
The system shall always zero-fill any partial page at the end of an
object. Further, the system shall never write out any modified
portions of the last page of an object which are beyond its
end. References within the address range starting at pa and
continuing for len bytes to whole pages following the end of an
object shall result in delivery of a SIGBUS signal.
An implementation may generate SIGBUS signals when a reference
would cause an error in the mapped object, such as out-of-space
condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.
For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.
vm_fault_hold() is renamed to vm_fault(). Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos. Removed
unneeded truncations of the fault addresses reported by hardware.
PR: 211924
Reviewed by: alc
Discussed with: jilles, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21566
vm_page_swapqueue() atomically transitions a page between queues. To do
so, it must hold the page queue lock for the old queue. However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state. Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.
Reported and tested by: pho
Reviewed by: kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21791
Now, vm_page_busy_sleep() expects the page's object to be locked.
vm_object_unwire() does some unusual lazy locking of the object chain
and keeps objects locked until a busy page is encountered or the loop
terminates. When a busy page is encountered, rather than unlocking all
but the "bottom-level" object, we must instead skip the object to which
"tm" belongs.
Reported and tested by: pho
Reviewed by: kib
Discussed with: jeff
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21790
No functional change intended.
The intent is to add a "legacy" e820 pmem newbus bus for nvdimm device in a
subsequent revision, and it's a little more clear if the parent buses get
independent source files.
Quite a lot of ACPI-specific logic is left in nvdimm.c; disentangling that
is a much larger change (and probably not especially useful).
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21813
When a VFS option passed to nmount is present but NULL the kernel will
place an empty option in its internal list. This will have a NULL
pointer and a length of 0. When we come to read one of these the kernel
will try to load from the last address of virtual memory. This is
normally invalid so will fault resulting in a kernel panic.
Fix this by checking if the length is valid before dereferencing.
MFC after: 3 days
Sponsored by: DARPA, AFRL
Those functions are used by kernel, and we can't check all possible argument
errors in production kernel. Plus according to docs many of those errors
are checked by hardware. Assertions should just help with code debugging.
MFC after: 2 weeks
use epoch don't need Makefile tweaks.
The downside is that any developer who wants EPOCH_TRACE needs to
rebuild kernel in full, but that's fine.
Reviewed by: imp
exec_map_first_page() would unconditionally free an unbacked, invalid
page from the executable image. However, it is possible that the page
is wired, in which case it is incorrect to free the page, so check for
additional wirings first.
Reported by: syzkaller
Tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21767
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.
truss support is included; audit support will come later.
This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: jilles (earlier revision)
Reviewed by: brueffer (manpages, earlier revision)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21423
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.
To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.
When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).
Reviewed by: rrs, gallatin (previous version)
Discussed with: tuexen
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21644
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.
This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.
Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21644
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.
Reviewed by: gallatin, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D21644
We have two ways to check if kenv variable exists - either we check return
value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check
if this value did change. In terminal_init() it is more convinient to
use pre-initialized variables.
Problem was revealed by older loader.efi, which did not set teken.* variables.
Reported by: tuexen
The TUNABLE_INT_FETCH is macro around getenv_int() and we will get
return value 0 or 1 for failure or success, we can use it to decide
which background color to use.