When disabling the MSIX IRQ vectors for a PCI device through the
LinuxKPI, make sure any old MSIX IRQ numbers are no longer visible to
the linux_pci_find_irq_dev() function else IRQs can be requested from
the wrong PCI device.
MFC after: 1 week
Sponsored by: Mellanox Technologies
summits at BSDCan and BSDCam in 2017.
The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.
It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.
You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.
This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.
There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.
Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085
Don't limit the copy to the size of the target string *pointer* (always
4 on 32-bit / 8 on 64-bit). Instead, just use strdup().
Reported by: Coverity
CID: 1386912
Reviewed by: cem, imp
MFC after: 1 week
drm_modeset_ctl() takes a signed in from userland, does a boundscheck,
and then uses it to index into a structure and write to it. The
boundscheck only checks upper bound, and never checks for nagative
values. If the int coming from userland is negative [after conversion]
it will bypass the boundscheck, perform a negative index into an array
and write to it, causing memory corruption.
Note that this is in the "old" drm driver; this issue does not exist
in drm2.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: cem
MFC after: 1 day
Sponsored by: The FreeBSD Foundation
drm_infobufs() has a structure on the stack, fills it out and copies it
to userland. There are 2 elements in the struct that are not filled out
and left uninitialized. This will leak uninitialized kernel stack data
to userland.
Submitted by: Domagoj Stolfa <ds815@cam.ac.uk>
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
MFC after: 1 day
Security: Kernel memory disclosure (798)
ibcs2_getdents() copies a dirent structure to userland. The ibcs2
dirent structure contains a 2 byte pad element. This element is never
initialized, but copied to userland none-the-less.
Note that ibcs2 has not built on HEAD since r302095.
Submitted by: Domagoj Stolfa <ds815@cam.ac.uk>
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
MFC after: 3 days
Security: Kernel memory disclosure (803)
The object lock was only needed when attempting to free B_DIRECT
buffer pages, and for testing for invalid pages (and freeing them
if so). Handle the latter by instead moving invalid pages near the head
of the inactive queue, where they will be reclaimed quickly.
Reviewed by: alc, kib, jeff
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14778
from entering the TIME_WAIT state. However, it omits sending the ACK for the
FIN, which results in RST. This becomes a bigger deal if the sysctl
net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of
the connection (also local) keeps retransmitting FINs.
To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead
we will allocate a tcptw on stack and proceed to the end of the function all
the way to tcp_twrespond(), to generate the correct ACK, then we will drop the
last PCB reference.
While here, make a few tiny improvements:
- use bools for boolean variable
- staticize nolocaltimewait
- remove pointless acquisiton of socket lock
Reported by: jtl
Reviewed by: jtl
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14697
The upstream repository is on github BLAKE2/libb2. Files landed in
sys/contrib/libb2 are the unmodified upstream files, except for one
difference: secure_zero_memory's contents have been replaced with
explicit_bzero() only because the previous implementation broke powerpc
link. Preferential use of explicit_bzero() is in progress upstream, so
it is anticipated we will be able to drop this diff in the future.
sys/crypto/blake2 contains the source files needed to port libb2 to our
build system, a wrapped (limited) variant of the algorithm to match the API
of our auth_transform softcrypto abstraction, incorporation into the Open
Crypto Framework (OCF) cryptosoft(4) driver, as well as an x86 SSE/AVX
accelerated OCF driver, blake2(4).
Optimized variants of blake2 are compiled for a number of x86 machines
(anything from SSE2 to AVX + XOP). On those machines, FPU context will need
to be explicitly saved before using blake2(4)-provided algorithms directly.
Use via cryptodev / OCF saves FPU state automatically, and use via the
auth_transform softcrypto abstraction does not use FPU.
The intent of the OCF driver is mostly to enable testing in userspace via
/dev/crypto. ATF tests are added with published KAT test vectors to
validate correctness.
Reviewed by: jhb, markj
Obtained from: github BLAKE2/libb2
Differential Revision: https://reviews.freebsd.org/D14662
An OCF-naive user program could use these primitives to implement HMAC, for
example. This would make the freed context sensitive data.
Probably other bzeros in this file should be explicit_bzeros as well.
Future work.
Reviewed by: jhb, markj
Differential Revision: https://reviews.freebsd.org/D14662 (minor part of a larger work)
"Under my tutelage Nicole did 85% of the work. At the time it seemed
simplest for a number of reasons to put my copyright on it. I now consider
that to have been a mistake."
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Approved by: shurd
Differential Revision: https://reviews.freebsd.org/D14766
through the lock-switching hoops.
A few of the INP lookup operations that lock INPs after the lookup do
so using this mechanism (to maintain lock ordering):
1. Lock lookup structure.
2. Find INP.
3. Acquire reference on INP.
4. Drop lock on lookup structure.
5. Acquire INP lock.
6. Drop reference on INP.
This change provides a slightly shorter path for cases where the INP
lock is uncontested:
1. Lock lookup structure.
2. Find INP.
3. Try to acquire the INP lock.
4. If successful, drop lock on lookup structure.
Of course, if the INP lock is contested, the functions will need to
revert to the previous way of switching locks safely.
This saves a few atomic operations when the INP lock is uncontested.
Discussed with: gallatin, rrs, rwatson
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D12911
On the Allwinner SoCs we need to set a custom endpoint configuration. To
allow for this use a table to store the configuration so the attachment
can override it.
Reviewed by: hselasky
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14783
Both sets are sorted in place, and with the introduction of read-only
permissions on the amd64 kernel text, the sorting override depended on
CR0.WP turned off. Make it correct by moving the sets into writeable
part of the KVA, also fixing boot on machines where hand-off from BIOS
to OS occurs with CR0.WP set.
Based on submission by: Peter Lei <peter.lei@ieee.org>
MFC after: 1 week
The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).
getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.
getentropy(3) is added as a compatibility shim for the OpenBSD API.
truss(1) support is included.
Tests for both system calls are provided. Coverage is believed to be at
least as comprehensive as LTP getrandom(2) test coverage. Additionally,
instructions for running the LTP tests directly against FreeBSD are provided
in the "Test Plan" section of the Differential revision linked below. (They
pass, of course.)
PR: 194204
Reported by: David CARLIER <david.carlier AT hardenedbsd.org>
Discussed with: cperciva, delphij, jhb, markj
Relnotes: maybe
Differential Revision: https://reviews.freebsd.org/D14500
flag and both the regular and "no" names, instead of two different string
arrays whose indices need to match the flag's bit position. This makes
them similar to the say "jailsys" options are represented.
Loop through either kind of option array with a structure pointer rather
then an integer index.
Currently each bfp descriptor uses u64 variables to maintain its counters.
On interfaces with high packet rate this leads to unnecessary contention
and inaccurate reporting.
PR: kern/205320
Reported by: elofu17 at hotmail.com
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D14726
do this right, except when there's no BP and we do a TUR by request.
In that case, we clear the flag, but don't release the reference,
leaking the reference on rare occasion.
PR: 226510
Sponsored by: Netflix
were the new kids on the block and F00F hacks were all the rage, one
needed to take out Giant to do anything moderately complicated with
the VM, mappings and such. So the pccard / cardbus code held Giant for
the entire insertion or removal process.
Today, the VM is MP safe. The lock is only needed for dealing with
newbus things. Move locking and unlocking Giant to be only around
adding and probing devices in pccard and cardbus.
<machine/stdarg.h> is a kernel-only header. The standard header for
userland is <stdarg.h>. Using the standard header in userland avoids
weird build errors when building with external compilers that include
their own stdarg.h header.
Reviewed by: arichardson, brooks, imp
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D14776
I accidentally swapped 'linux_fixup_elf' to 'linux_elf_fixup' in amd64's
declaration (only), while bringing this change over from git and
encountering a conflict.
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
context switch code.
Some BIOSes give control to the OS with CR0.WP already set, making the
kernel text read-only before cpu_startup().
Reported by: Peter Lei <peter.lei@ieee.org>
Reviewed by: jtl
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14768
This is a pure syntax patch to create an interface to enable and later
restore write access to the kernel text and other read-only mapped
regions. It is in line with e.g. vm_fault_disable_pagefaults() by
allowing the nesting.
Discussed with: Peter Lei <peter.lei@ieee.org>
Reviewed by: jtl
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14768
When using hardware crypto engines, the callback functions used to handle
an IPsec packet after it has been encrypted or decrypted can be invoked
asynchronously from a worker thread that is not associated with a vnet.
Extend 'struct xform_data' to include a vnet pointer and save the current
vnet in this new member when queueing crypto requests in IPsec. In the
IPsec callback routines, use the new member to set the current vnet while
processing the modified packet.
This fixes a panic when using hardware offload such as ccr(4) with IPsec
after VIMAGE was enabled in GENERIC.
Reported by: Sony Arpita Das and Harsh Jain @ Chelsio
Reviewed by: bz
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14763
It is possible to provide insane values for size in contigmalloc(9)
request, which usually not reaches the phys allocator due to failing
KVA allocation. But with the forthcoming 4/4 i386, where 32bit
architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable
outright and KVA might be available sometimes.
Then, the calculation of pa_end could wrap around, depending on the
physical address, and the checks in vm_phys_alloc_seg_contig() would
pass while the iteration in the loop after the 'done' label goes out
of the vm_page_array bounds.
Fix it by detecting the wrap.
Reported and tested by: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14767
It is incomplete, has not been adopted in the other locking primitives,
and we have other means of measuring lock contention (lock_profiling,
lockstat, KTR_LOCK). Drop it to slightly de-clutter the mutex code and
free up a precious KTR class index.
Reviewed by: jhb, mjg
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14771