sv_maxuser specifies the maximum addressable space for user space. Presently
this is all 64-bits worth, which is impossible for a 32-bit process.
This bug has existed since the initial import of powerpc64 in 2010.
MFC after: 2 weeks
This is more consistent with the rest of the function and lets us
unindent most of the function.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D20897
of pmap_load_clear(), in places where we don't care about the page table
entry's prior contents.
Eliminate an unnecessary pmap_load() from pmap_remove_all(). Instead, use
the value returned by the pmap_load_clear() on the very next line. (In the
future, when we support "hardware dirty bit management", using the value
from the pmap_load() rather than the pmap_load_clear() would have actually
been an error because the dirty bit could potentially change between the
pmap_load() and the pmap_load_clear().)
A KASSERT() in pmap_enter(), which originated in the amd64 pmap, was meant
to check the value returned by the pmap_load_clear() on the previous line.
However, we were ignoring the value returned by the pmap_load_clear(), and
so the KASSERT() was not serving its intended purpose. Use the value
returned by the pmap_load_clear() in the KASSERT().
MFC after: 2 weeks
Add VMBus protocol version 4.0. and 5.0 to support Windows 10 and newer HyperV hosts.
For VMBus 4.0 and newer HyperV, the netvsc gpadl teardown must be done after vmbus close.
Submitted by: whu
MFC after: 2 weeks
Sponsored by: Microsoft
r348164 added code to iicbus_request_bus/iicbus_release_bus to automatically
call device_busy()/device_unbusy() as part of aquiring exclusive use of the
bus (so modules can't be unloaded while the bus is exclusively owned and/or
IO is in progress). That broke the ability to do i2c IO from a slave device
probe method, because the slave isn't attached yet, so calling device_busy()
triggers a sanity-check panic for trying to busy a non-attached device.
Now we check whether the device status is < DS_ATTACHING, and if so we busy
the iicbus rather than the slave device. I think this leaves a small window
where a module could be unloaded while probing is in progress. But I think
that's true of all devices, and probably should be fixed by introducing a
DS_PROBING state for devices, and handling that at various points in the
newbus code.
Eliminate the TIMEDOUT state. This state really conveyed two different
concepts: I timed out during recovery (and my command got put on the
recovery queue), and I timed out diring discovery (which doesn't).
Separate those two concepts into two flags. Use the TIMEDOUT flag to
fail requests as timed out. Use the on queue flag to remove them from
the queue.
In mps_intr_locked for MPI2_RPY_DESCRIPT_FLAGS_ADDRESS_REPLY message
type, when completing commands, ignore the ones that are not in state
INQUEUE. They were already completed as part of the recovery
process. When we complete them twice, we wind up with entries on the
free queue that are marked as busy, trigging asserts.
Reviewed by: scottl (earlier version, just for mpr)
Differential Revision: https://reviews.freebsd.org/D20785
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
table VCTRL registers.
Unconditionally program the MSI-X vector control Mask field for MSI-X
table entries without regarud for Mask's previous value. Some devices
return all zeros on reads of the VCTRL registers, which would cause us
to skip disabling interrupts. This fixes the Samsung SM961/PM961 SSDs
which are return zero starting from offset 0x3084 within the memory
region specified by BAR0, even when they are active MSI-X vectors.
The Illumos kernel writes these unconditionally to 0 or 1. However,
section 6.8.2.9 of the PCI Local Bus 3.0 spec (dated Feb 3, 2004)
states for bits 31::01:
After reset, the state of these bits must be 0. However, for
potential future use, software must preserve the value of
these reserved bits when modifying the value of other Vector
Control bits. If software modifies the value of these reserved
bits, the result is undefined."
so we always set or clear the Mask bit, but otherwise preserves the
old value.
PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
Reviewed By: imp, jhb
Submitted by: Ka Ho Ng
MFC After: 1 week
Differential Revision: https://reviews.freebsd.org/D20873
While at it fix an invalid memory access issue when attaching external
USB HUBs, which are not mapped by ACPI, due to missing status check
when calling AcpiGetObjectInfo() from acpi_usb_hub_port_probe_cb().
Sponsored by: Mellanox Technologies
Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case. Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20859
Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.
Reviewed by: dougm, kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20858
When the system has no graphical console, such as bhyve in common
configurations, ignore kern.vt.splash_cpu, instead of panicking
on INVARIANTS kernels.
Reviewed by: cem dumbbell
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20877
Although PPC SLB code doesn't handle allocation failures,
which are rare, in most places it asserts that the pointer
returned by uma_zalloc() is not NULL, making it easier to
identify the failure and avoiding an invalid pointer dereference.
This change simply adds a missing KASSERT in SLB code.
These fields will not be equal only in case if bigalloc filesystem feature is turned on.
This feature is not supported for now.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-27-EXT2-12: Denial of Service in openat-0 (vm_fault_hold/ext2_clusteracct)
MFC after: 2 weeks
The ext2fs fragments are different from ufs fragments.
In case of ext2fs the fragment should be equal or more then block size.
The values more than block size are used only in case of bigalloc feature, which is does not supported for now.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-22-EXT2-9: Denial of service in ftruncate-0 (ext2_balloc)
MFC after: 2 weeks
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-11-EXT2-6: Denial Of Service in write-1 (ext2_balloc)
MFC after: 2 weeks
comment. Rewrite that comment to improve its clarity.
Reported by: cem
Reviewed by: alc, cem
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20871
1. Use _pmap_alloc_l3() instead of pmap_alloc_l3() in order to handle the
possibility that a superpage mapping for "va" was created while we slept.
(This is derived from the amd64 version.)
2. Eliminate code for allocating kernel page table pages. Kernel page
table pages are preallocated by pmap_growkernel().
3. Eliminate duplicated unlock operations when KERN_RESOURCE_SHORTAGE is
returned.
MFC after: 2 weeks
after the one where the possible block allocation begins, and allocate
a larger number of blocks than the current limit. This does not affect
the limit on minimum allocation size, which still cannot exceed
BLIST_MAX_ALLOC.
Use this change to modify swp_pager_getswapspace and its callers, so
that they can allocate more than BLIST_MAX_ALLOC blocks if they are
available.
Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20579
restructure cache_handle_range so that all of the data cache operations are
performed before any instruction cache operations. Then, we only need one
barrier between the data and instruction cache operations and one barrier
after the instruction cache operations.
On an Amazon EC2 a1.2xlarge instance, this simple change reduces the time
for a "make -j8 buildworld" by 9%.
Reviewed by: andrew
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20848
swap_pager_swapoff_object and swp_pager_force_pagein so that they can
page in multiple pages at a time to a swap device, rather than doing
one I/O operation for each page.
Tested by: pho
Submitted by: ota_j.email.ne.jp (Yoshihiro Ota)
Reviewed by: alc, markj, kib
Approved by: kib, markj (mentors)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20635
vm_page_dirty() when, in fact, we are write protecting the page and the L3
entry has PTE_D set. However, pmap_protect() was always calling
vm_page_dirty() when an L2 entry has PTE_D set. Handle L2 entries the
same as L3 entries so that we won't perform unnecessary calls to
vm_page_dirty().
Simplify the loop calling vm_page_dirty() on L2 entries.
Print the adapter name rather than the address of the adapter
to avoid kernel address leakage.
PR: Bug 238642
Submitted by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed by: vmaffione
MFC after: 1 week
New system calls between 2.6.32 and 2.6.26 are already implemented.
This should be mostly NFC as far as contemporary Linux applications are
concerned though, as Linux kernel 3.2 is the oldest supported by a
number of popular distros today; work is in progress by others to enable
support for those applications.
Discussed with: trasz
MFC after: 1 month
Linux man(1) calls it for no good reason; this avoids the console spam
(eg '(man): ioctl fd=4, cmd=0x660b ('f',11) is not implemented').
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20690
in some cases (strace -f man id > /dev/null).
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20691
return something reasonable, and helps linux binaries which attempt
to close all the files, eg apt(8).
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20692
We were otherwise failing to call funsetown() for some descriptors
associated with a tty, such as pts descriptors. Then, if the
descriptor is closed before the owner exits, we may get memory
corruption.
Reported by: syzbot+c9b6206303bf47bac87e@syzkaller.appspotmail.com
Reviewed by: ed
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
All MMCBR bridges have to implement all the MMCBR variables. This
implements them for everybody that currently doesn't.
A common routine for this should be written.
XCHAN_CAP_BOUNCE.
The only application that uses bounce buffering for now is the Government
Furnished Equipment (GFE) P2's dma core (AXIDMA) with its own dedicated
cacheless bounce buffer.
Sponsored by: DARPA, AFRL
There was an issue in pseries llan driver, that resulted in the first 2 bytes
of the MAC address getting stripped, and the last 2 being always 0.
In most cases the network interface still worked, despite the MAC being
different of what was specified to QEMU, but when some other host or DHCP
server expected a specific MAC, this would fail.
This change fixes this by shifting right by 2 the local-mac-address read from
device tree, if its length is 6 instead of 8, as observed in QEMU DT, that
always presents a 6 bytes value for this property.
PR: 237471
Reported by: Alfredo Dal'Ava Junior
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20843
Otherwise there is a window where they may be rescheduled. This
typically manifested as a page fault shortly after unloading if_iwm.ko.
Close the race by draining callouts after calling iwm_stop_device(),
which is also what Dragonfly does.
Change whitespace to reduce gratuitous diffs with Dragonfly.
Reported and tested by: seanc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
pmap_ts_referenced returns a count, not a boolean, and is supposed to
have int as the return type not boolean_t.
This worked previously because boolean_t is an int typedef.
Discussed with: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Save the last callout function pointer (and its argument) executed
on each CPU for inspection by a debugger. Add a ddb `show callout_last`
command to show these pointers. Add a kernel module that I used
for testing that command.
Relocate `ce_migration_cpu` to reduce padding and therefore preserve
the size of `struct callout_cpu` (320 bytes on amd64) despite the
added members.
This should help diagnose reference-after-free bugs where the
callout's mutex has already been freed when `softclock_call_cc`
tries to unlock it.
You might hope that the pointer would still be available, but it
isn't. The argument to that function is on the stack (because
`softclock_call_cc` uses it later), and that might be enough in
some cases, but even then, it's very laborious. A pointer to the
callout is saved right before these newly added fields, but that
callout might have been freed. We still have the pointer to its
associated mutex, and the name within might be enough, but it might
also have been freed.
Reviewed by: markj jhb
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20794
When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element
invalidates its queue linkage pointers. vm_pageout_collect_batch()
was relying on these pointers remaining valid after a removal, so
modify it to fetch the next queued page before dequeuing the current
page.
Submitted by: Don Morris <dgmorris@earthlink.net>
Reviewed by: cem, vangyzen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20842
Previously the TOE code used its own custom unmapped mbufs via
EXT_FLAG_VENDOR1. The old version always wired the entire AIO request
buffer first for the duration of the AIO operation and constructed
multiple mbufs which used the wired buffer as an external buffer.
The new version determines how much room is available in the socket
buffer and only wires the pages needed for the available room building
chains of M_NOMAP mbufs. This means that a large AIO write will now
limit the amount of wired memory it uses to the size of the socket
buffer.
Reviewed by: gallatin, np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20839
LINUXKPI_VERSION macro is not defined for any compiled LinuxKPI code
which basically means __GFP_NOTWIRED is never checked when allocating
pages. This should work fine with the existing external DRM code as
long as the page wiring and unwiring is balanced.
MFC after: 3 days
Sponsored by: Mellanox Technologies
This was added for emulation of Linux's CDROMSUBCHNL, but allows
users with read access to a cd(4) device to overwrite kernel memory
provided that the driver detects some media present.
Reimplement CDROMSUBCHNL by bouncing the data from CDIOCREADSUBCHANNEL
through the linux_cdrom_subchnl structure passed from userspace.
admbugs: 768
Reported by: Alex Fortune
Security: CVE-2019-5602
Security: FreeBSD-SA-19:11.cd_ioctl
Fix a mis-merge when extracting the unmapped mbuf changes from
Netflix's in-kernel TLS changes where the call to the function that
freed the backing pages from an unmapped mbuf was missed.
Sponsored by: Chelsio Communications
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase. In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break. Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.
The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.
Reviewed by: alc
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20763
feature bit.
In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement. Provide procctl(2) verbs to set and query
implicit PROTMAX handling. The knobs mimic the same per-image flag
and per-process controls for ASLR.
Reviewed by: emaste, markj (previous version)
Discussed with: brooks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D20795
Previously we would attempt to unlock the socket buffer despite having
failed to lock it. Simply return an error instead: no resources need
to be released at this point, and doing so is consistent with
soreceive_generic().
PR: 238789
Submitted by: Greg Becker <greg@codeconcepts.com>
MFC after: 1 week
that node is also compatible with syscon. For instance,
Rockchip RK3399's GRF (General Register Files) is compatible
with simple-mfd as well as syscon and has devices like
usb2-phy, emmc-phy and pcie-phy etc. under it.
Reviewed by: manu
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18774
is to notify the kernel that the file system is untrusted and it
should use more extensive checks on the file-system's metadata
before using it. This option is intended to be used when mounting
file systems from untrusted media such as USB memory sticks or other
externally-provided media.
It will initially be used by the UFS/FFS file system, but should
likely be expanded to be used by other file systems that may appear
on external media like msdosfs, exfat, and ext2fs.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20786
If g_mirror_taste encountered an error at g_mirror_add_disk, it might
try to g_mirror_destroy the device with the G_MIRROR_DEVICE_FLAG_TASTING
flag still set. This would wait on a worker to complete the destruction
with g_mirror_try_destroy, but that function bails out if the tasting
flag is set, resulting in a deadlock. Clear the tasting flag before
trying to destroy the device.
Test Plan:
sysctl debug.fail_point.mnowait="1%return"
kyua test -k /usr/tests/sys/geom/class/mirror/Kyuafile
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20744
simple change to the control flow. Replace an unnecessary test by a
KASSERT. Add a comment explaining an obscure test.
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D20812
otherwise we panic.
dwmmc don't handle VCCQ (voltage for the IO line of the SD/eMMC) or
TIMING.
Add the needed accessor in the {read,write}_ivar functions.
Reviewed by: imp (previous version)
This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20808
This patch fixes 2 panics. The first one is due to the current VNET not
being set in the emulated adapter transmission path. The second one
is caused by the M_PKTHDR flag not being set when preallocated mbufs
are recycled in the transmit path.
Submitted by: aleksandr.fedorov@itglobal.com
Reviewed by: vmaffione
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20824
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.
While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers. SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring. Such functions do
require drivers with a knowledge of a specific SuperIO.
At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices. So, I have not done the usual
split between the hardware driver and the bus functionality. Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip. The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions. The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.
I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.
Discussed with: imp, jhb
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D8175
That is, instead of the current GPIO00 - GPIO15 the names will be GPIO00
- GPIO07, GPIO10 - GPIO17. The first digit is a GPIO "bank" / group
number and the second one is a pin number within the bank. Alternative
view is that the pin names are changed from decimal numbering scheme to
octal one (as there are 8 pins per bank).
Discussed with: cem, gonzo
MFC after: 2 weeks
With more ports, some of the registers are shifted a bit to accommodate.
This switch also adds two high speed Serdes/SGMII interfaces (2.5 Gb/s).
Sponsored by: Rubicon Communications, LLC (Netgate)
After this change sys/bus.h includes sys/systm.h when _KERNEL is
defined.
This brings back r349459 but with systm.h hidden from userland.
MFC after: 2 weeks
fget_mmap() translates rights on the descriptor to a VM protection
mask. It was doing so without holding any locks on the descriptor
table, so a writer could simultaneously be modifying those rights.
Such a situation would be detected using a sequence counter, but
not before an inconsistency could trigger assertion failures in
the capability code.
Fix the problem by copying the fd's rights to a structure on the stack,
and perform the translation only once we know that that snapshot is
consistent.
Reported by: syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com
Reviewed by: brooks, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20800
We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where
the reader copies data directly from pages mapped into the writer.
However, when a reader finishes such a copy, it previously cleared
PIPE_DIRECTW, allowing multiple writers to race and corrupt the state
used to track wired pages belonging to the writer.
Fix this by having the writer clear PIPE_DIRECTW and instead use the
count of unread bytes to determine whether a write is finished.
Reported by: syzbot+21811cc0a89b2a87a9e7@syzkaller.appspotmail.com
Reviewed by: kib, mjg
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20784
Since cxgbe(4) uses sglist instead of bus_dma, this required updates
to the code that generates scatter/gather lists for packets. Also,
unmapped mbufs are always sent via DMA and never as immediate data in
the payload of a work request.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: np
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
Enable IFCAP_NOMAP for a vlan interface if it is supported by the
underlying trunk device.
Reviewed by: gallatin, hselasky, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready. Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer. However, sbcompress cannot do this for mbufs marked
M_NOTREADY. sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616