This effectively makes the stack base on the csu _start entry
randomized.
The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap. Setting it to zero
disables the gap, and max is capped at 50%.
Only amd64 for now.
Reviewed by: cem, markj
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21081
mapping and then destroy one of the 4 KB page mappings so that there is a
potential trigger for repromotion. Currently, we destroy the first 4 KB
page mapping that falls within the (current) superpage mapping or the
virtual address range [sva, eva). However, I have found empirically that
destroying the last 4 KB mapping produces slightly better results,
specifically, more promotions and fewer failed promotion attempts.
Accordingly, this revision changes pmap_advise() to destroy the last 4 KB
page mapping. It also replaces some nearby uses of boolean_t with bool.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21115
It is assembled using "${CC} -x assembler-with-cpp", which by convention
(bsd.suffixes.mk) uses the .asm extension.
This is a portion of the review referenced below (D18344). That review
also renamed linux_support.s to .S, but that is a functional change
(using the compiler's integrated assembler instead of as) and will be
revisited separately.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18344
The current implementation of gzipped a.out support was based
on a very old version of InfoZIP which ships with an ancient
modified version of zlib, and was removed from the GENERIC
kernel in 1999 when we moved to an ELF world.
PR: 205822
Reviewed by: imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp>
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D21099
if a demotion succeeds, then all of the 4KB page mappings within the
superpage-sized region must be valid, so there is no point in testing the
validity of the 4KB page mapping that is going to be write protected.
Deindent the nearby code.
Reviewed by: kib, markj
Tested by: pho (amd64, i386)
X-MFC after: r350004 (this change depends on arm64 dirty bit emulation)
Differential Revision: https://reviews.freebsd.org/D21027
Use 'struct bintime' instead of 'sbintime_t' to manage times in vPIT
to postpone rounding to final results rather than intermediate
results. In tests performed by Joyent, this reduced the error measured
by Linux guests by 59 ppm.
Smart OS bug: https://smartos.org/bugview/OS-6923
Submitted by: Patrick Mooney
Reviewed by: rgrimes
Obtained from: SmartOS / Joyent
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20335
syscallret() doesn't use error anymore. Fix a few other places to permit
removing the return value from syscallenter() entirely.
- Remove a duplicated assertion from arm's syscall().
- Use td_errno for amd64_syscall_ret_flush_l1d.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D2090
Casueword(9) on ll/sc architectures must be prepared for userspace
constantly modifying the same cache line as containing the CAS word,
and not loop infinitely. Otherwise, rogue userspace livelocks the
kernel.
To fix the issue, change casueword(9) interface to return new value 1
indicating that either comparision or store failed, instead of relying
on the oldval == *oldvalp comparison. The primitive no longer retries
the operation if it failed spuriously. Modify callers of
casueword(9), all in kern_umtx.c, to handle retries, and react to
stops and requests to terminate between retries.
On x86, despite cmpxchg should not return spurious failures, we can
take advantage of the new interface and just return PSL.ZF.
Reviewed by: andrew (arm64, previous version), markj
Tested by: pho
Reported by: https://xenbits.xen.org/xsa/advisory-295.txt
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D20772
hard-coded value. Don't allocate space for it from the kernel stack.
Account for prefix, suffix, and separator space in the name. This
takes the effective length up to 229 bytes on 13-current, and 37 bytes
on 12-stable. 37 bytes is enough to hold a full GUID string.
PR: 234134
MFC after: 1 week
Differential Revision: http://reviews.freebsd.org/D20924
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
in some cases (strace -f man id > /dev/null).
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20691
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18774
simple change to the control flow. Replace an unnecessary test by a
KASSERT. Add a comment explaining an obscure test.
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D20812
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.
While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers. SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring. Such functions do
require drivers with a knowledge of a specific SuperIO.
At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices. So, I have not done the usual
split between the hardware driver and the bus functionality. Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip. The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions. The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.
I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.
Discussed with: imp, jhb
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D8175
when, in fact, we are write protecting the page and the PTE has PG_M set.
However, pmap_protect_pde() was always calling vm_page_dirty() when the PDE
has PG_M set. So, adding PG_NX to a writeable PDE could result in
unnecessary (but harmless) calls to vm_page_dirty().
Simplify the loop calling vm_page_dirty() in pmap_protect_pde().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20793
This adds emulation for:
test r/m16, imm16
test r/m32, imm32
test r/m64, imm32 sign-extended to 64
OpenBSD guests compiled with clang 8.0.0 use TEST directly against a
Local APIC register instead of separate read via MOV followed by a
TEST against the register.
PR: 238794
Submitted by: jhb
Reported by: Jason Tubnor jason@tubnor.net
Tested by: Jason Tubnor jason@tubnor.net
Reviewed by: markj, Patrick Mooney patrick.mooney@joyent.com
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20755
Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.
This is a step towards an implementation of an atomic reference counter
for each physical page structure.
Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758
When pmap_pkru_on_remove() is called, the sva argument value was
advanced. Clear PKRU earlier when sva still specifies the start of
the region.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Port the code to block on turnstile instead of yielding, to lock-less
delayed invalidation. The yield might cause tight loop due to priority
inversion.
Since it is impossible to avoid race between block and wake-up, arm
1-tick callout to wakeup when thread blocks itself.
Reported and tested by: mjg
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
Differential revision: https://reviews.freebsd.org/D20636
translation units with differing capabilities
From the author via Bugzilla:
---
When an attempt is made to passthrough a PCI device to a bhyve VM
(causing initialisation of IOMMU) on certain Intel chipsets using
VT-d the PCI bus stops working entirely. This issue occurs on the
E3-1275 v5 processor on C236 chipset and has also been encountered
by others on the forums with different hardware in the Skylake
series.
The chipset has two VT-d translation units. The issue is caused by
an attempt to use the VT-d device-IOTLB capability that is
supported by only the first unit for devices attached to the
second unit which lacks that capability. Only the capabilities of
the first unit are checked and are assumed to be the same for all
units.
Attached is a patch to rectify this issue by determining which
unit is responsible for the device being added to a domain and
then checking that unit's device-IOTLB capability. In addition to
this a few fixes have been made to other instances where the first
unit's capabilities are assumed for all units for domains they
share. In these cases a mutual set of capabilities is determined.
The patch should hopefully fix any bugs for current/future
hardware with multiple translation units supporting different
capabilities.
A description is on the forums at
https://forums.freebsd.org/threads/pci-passthrough-bhyve-usb-xhci.65235
The thread includes observations by other users of the bug
occurring, and description as well as confirmation of the fix.
I'd also like to thank Ordoban for their help.
---
Personally tested on a Skylake laptop, Skylake Xeon server, and
a Xeon-D-1541, passing through XHCI and NVMe functions. Passthru
is hit-or-miss to the point of being unusable without this
patch.
PR: 229852
Submitted by: callum@aitchison.org
MFC after: 1 week
previously addressed in r348246.
This pmap problem also exists on arm64 and riscv. However, the original
solution developed for amd64 and i386 cannot be used on arm64 and riscv. In
particular, arm64 and riscv do not define a PG_PROMOTED flag in their level
2 PTEs. (A PG_PROMOTED flag makes no sense on arm64, where unlike x86 or
riscv we are required to break the old 4KB mappings before making the 2MB
mapping; and on riscv there are no unused bits in the PTE to define a
PG_PROMOTED flag.)
This commit implements an alternative solution that can be used on all four
architectures. Moreover, this solution has two other advantages. First, on
older AMD processors that required the Erratum 383 workaround, it is less
costly. Specifically, it avoids unnecessary calls to pmap_fill_ptp() on a
superpage demotion. Second, it enables the elimination of some calls to
pagezero() in pmap_kernel_remove_{l2,pde}().
In addition, remove a related stale comment from pmap_enter_{l2,pde}().
Reviewed by: kib, markj (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20538
Convert the array to use C99 initializers.
Make it constant.
Replace MAX_TRAP_MSG with nitems().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470
When disabling the last enabled userspace probe, fasttrap clears the
function pointers which hook in to the breakpoint handler. If a traced
thread hit a fasttrap breakpoint before it was removed, we must ensure
that it is able to call the hook; otherwise fasttrap will not consume
the trap and SIGTRAP will be delievered to the thread. Synchronize
with such threads by ensuring that they load the hook pointer with
interrupts disabled, and by completing an SMP rendezvous after removing
breakpoints and before clearing the pointers.
Reported by: Alexander Alexeev <Alexander.Alexeev@dell.com>
Tested by: Alexander Alexeev (earlier version)
Reviewed by: cem, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20526
pci_alloc_msix() requires both the table and PBA BARs to be allocated
by the driver. ppt was only allocating the table BAR so would fail
for devices with the PBA in a separate BAR. Fix this by allocating
the PBA BAR before pci_alloc_msix() if it is stored in a separate BAR.
While here, release BARs after calling pci_release_msi() instead of
before. Also, don't call bus_teardown_intr() in error handling code
if bus_setup_intr() has just failed.
Reported by: gallatin
Tested by: gallatin
Reviewed by: rgrimes, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20525
in the loss of a KASSERT that guarded against the invalidation a wired
mapping. Restore this KASSERT.
Remove an unnecessary KASSERT from pmap_demote_pde_locked(). It guards
against a state that was already handled at the start of the function.
Reviewed by: kib
X-MFC with: r348476
If service code faulted, we might end up unwinding with interrupts
disabled. Top-level kernel code should have interrupts enabled, which
is enforced by checks.
Save %rflags before entering EFI, and restore to the known good value
on return. This handles situation with disabled interrupts on fault
and perhaps other potential bugs, e.g. invalid value for PSL_D.
Reported and tested by: Jan Martin Mikkelsen <janm@transactionware.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
debugging checks.
In particular,
- Move the code to handle failure to allocate page table page into
a helper.
- After the previous item is done, it is possible to distinguish !PG_A
case and case of missed page, in the control flow.
- Make the variable to indicate that in-kernel mapping is demoted.
- Assert that missed page table page can only happen for in-kernel
mapping when demoting direct map.
- If DIAGNOSTIC is enabled, and the page table page should be already
filled, check all ptes instead of only first one.
Reviewed by: alc, markj
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20266
syscalls.conf is included using "." which per the Open Group:
If file does not contain a <slash>, the shell shall use the search
path specified by PATH to find the directory containing file.
POSIX shells don't fall back to the current working directory.
Submitted by: Nathaniel Wesley Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: bdrewery
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D20476
tables which affect demotion.
The last last-level page table under 2M mappings below KERNend was
only partially initialized. When that page was used as the hardware
page table for demotion of the 2M mapping, the result was not
consistent. Since pmap_demote_pde() is switched to use PG_PROMOTED as
the test for the validity of the saved last level page table page, we
can keep page table pages zero-initialized instead. Demotion would
fill them as needed.
Only map the created page tables beyond KERNend, there is no need to
pre-promote PTmap after KERNend, because the extra mapping is not used.
Only round up *firstaddr to 2M boundary when it is below rounded
KERNend. Sometimes the allocpages() calls advance *firstaddr past the
end of the last 2MB page mapping. In that case, this conditional
avoids wasting an average of 1MB of physical memory.
Update comments to explain action in more clean and direct language.
Reported and tested by: pho
In collaboration with: alc
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20380
The upper bound for the valid address from the large map used
LARGEMAP_MAX_ADDRESS instead of LARGEMAP_MIN_ADDRESS. Provide a
function-like macro for proper upper value.
Noted by: markj
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20386
Similar to PG_FRAME and PG_PS_FRAME, it denotes the mask of the
physical address component of 1G superpage PDP entry.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20386
It is possible for the kernel mapping to be created with superpage by
directly installing pde using pmap_enter_2mpage() without filling the
corresponding page table page. This can happen e.g. if the range is
already backed by reservation and vm_fault_soft_fast() conditions are
satisfied, which was observed on the pipe_map.
In this case, demotion must fill the page obtained from the pmap
radix, same as if the page is newly allocated. Use PG_PROMOTED bit as
an indicator that the page is valid, instead of the wire count of the
page table page.
Since the PG_PROMOTED bit is set on pde when we leave TLB entries for
4k pages around, which in particular means that the ptes were filled,
it provides more correct indicator. Note that pmap_protect_pde()
clears PG_PROMOTED, which handles the case when protection was changed
on the superpage without adjusting ptes.
Reported by: pho
In collaboration with: alc
Tested by: alc, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20380
If MDS mitigation is enabled by the tunable but MDS microcode is not
early-loaded, software mitigation is selected. This causes
initializecpu() to try to allocate memory which makes boot process
very unhappy.
Create SYSINIT that runs sufficiently late to succeed.
Reported by: naddy
PR: 237968
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
Submitted by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: kib
Tested by: Patrick on SmartOS with Linux and Windows guests
Obtained from: Joyent
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20296
Apparently, there is more code trying to call pmap_remove() early,
mostly to free preloaded memory. Instead of moving all deallocations
to the point where a scheduler is initialized, add missed setup of
thread0 di init at hammer_time().
The code in pmap_delayed_invl_start_u() is modified to not ever take
the thread lock if the thread priority is less or equal to PVM. Since
thread0 starts at priority 0, and then is reset to PVM at
proc0_init(), this eliminates taking the thread lock during early
boot.
While there, fix off by one in comparision of the base priority.
Reported and tested by: bcran (previous version)
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 29 days
In all practical situations, the resolver visibility is static.
Requested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: so (emaste)
Differential revision: https://reviews.freebsd.org/D20281
This is done mostly for debugging in field. Also added the sysctl of
the same name to report used mode.
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.
The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation. New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1. On DI finish, thread donates its generation to the previous
element. The generation of the fake head of the list is the last
passed DI generation. Basically, the implementation is a queued
spinlock but without spinlock.
Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
MFC note: td_md.md_invl_gen should go to the end of struct thread
Differential revision: https://reviews.freebsd.org/D19630
of the 'flags' argument to mmap(2) with Linux strace(1).
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20223
Microarchitectural buffers on some Intel processors utilizing
speculative execution may allow a local process to obtain a memory
disclosure. An attacker may be able to read secret data from the
kernel or from a process when executing untrusted code (for example,
in a web browser).
Reference: https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00233.html
Security: CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091
Security: FreeBSD-SA-19:07.mds
Reviewed by: jhb
Tested by: emaste, lwhsu
Approved by: so (gtetlow)
of them listed in opt_global.h which is not generated while building
modules outside of a kernel and such modules never match real cofigured
kernel.
So, we should prevent our users from building obviously defective modules.
Therefore, remove the root cause of the building of modules outside of a
kernel - the possibility of building modules with DEBUG or KTR flags.
And remove all of DEBUG printfs as it is incomplete and in threaded
programms not informative, also a half of system call does not have DEBUG
printf. For debuging Linux programms we have dtrace, ktr and ktrace ability.
PR: 222861
Reviewed by: trasz
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20178
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
This gets rid of the global cpu_ipi_pending array.
While replace cmpset with fcmpset in the delivery code and opportunistically
check if given IPI is already pending.
Sponsored by: The FreeBSD Foundation
Having IPSEC compiled into the kernel imposes a non-trivial
performance penalty on multi-threaded workloads due to IPSEC
refcounting. In my benchmarks of multi-threaded UDP
transmit (connected sockets), I've seen a roughly 20% performance
penalty when the IPSEC option is included in the kernel (16.8Mpps
vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon
2697v3)). This is largely due to key_addref() incrementing and
decrementing an atomic reference count on the default
policy. This cause all CPUs to stall on the same cacheline, as it
bounces between different CPUs.
Given that relatively few users use ipsec, and that it can be
loaded as a module, it seems reasonable to ask those users to
load the ipsec module so as to avoid imposing this penalty on the
GENERIC kernel. Its my hope that this will make FreeBSD look
better in "out of the box" benchmark comparisons with other
operating systems.
Many thanks to ae for fixing auto-loading of ipsec.ko when
ifconfig tries to configure ipsec, and to cy for volunteering
to ensure the the racoon ports will load the ipsec.ko module
Reviewed by: cem, cy, delphij, gnn, jhb, jpaetzel
Differential Revision: https://reviews.freebsd.org/D20163
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).
This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp
[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).
ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.
(MFC commentary)
This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.
I have no plans to do this MFC as of now.
Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill)
Input also from: melifaro
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20044
While Linux strace(1) doesn't strictly require it - it has a fallback
to PTRACE_GETREGS - it's a newer interface, so we better support it
before the old one is deprecated.
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20152
IPI_STOP is used after panic or when ddb is entered manually. MONITOR/
MWAIT allows CPUs that support the feature to sleep in a low power way
instead of spinning. Something similar is already used at idle.
It is perhaps especially useful in oversubscribed VM environments, and is
safe to use even if the panic/ddb thread is not the BSP. (Except in the
presence of MWAIT errata, which are detected automatically on platforms with
known wakeup problems.)
It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0
(off). This commit also introduces the tunable
"machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has
known errata, but may be set to "1" in loader.conf to signal that mwait
wakeup is broken on CPUs FreeBSD does not yet know about.
Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't
help bhyve hypervisors running FreeBSD guests.
Submitted by: Anton Rang <rang AT acm.org> (earlier version)
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20135
Rather than just accessing it via pointer cast.
No functional change intended.
Discussed with: kib (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20135
OVMF's flash variable storage is using add instructions when indexing
the variable store bootrom location.
Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Reviewed by: rgrimes
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19975
move bits that are MI out into the headers in compat/linux.
For that remove bogus _packed attribute from struct l_sockaddr
and use MI types for struct members.
And continue to move into the linux_common module a code that is
intended for both Linuxulator modules (both instruction set - 32 & 64 bit)
or for external modules like linsysfs or linprocfs.
To avoid header pollution introduce new sys/compat/linux_common.h header.
Reviewed by: emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20137
Use it wherever COMPAT_FREEBSD11 is currently specified, like r309749.
Reviewed by: imp, jhb, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20120
Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.
This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298
Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: bde (mentor), jhb (maintainer)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D18755
After the referenced commit, we did not set x87 and sse valid bits in
the xstate_bv bitmask for initial fpu state (stored in memory), when
using XSAVE.
The state is loaded into FPU register file to initialize the process
FPU state, and since both bits were clear, the default x87 and SSE
states were loaded. By chance, FreeBSD ABI SSE2 state is same as FPU
initial state, so the bug is not visible for 64bit processes. But on
i386, the precision control should be set to double (53bit mantissa),
instead of the default double extended (64bit mantissa). For 32bit
processes on amd64, kernel reloads control word with the right mask,
which only left native i386 and amd64 native but using x87 as
affected.
Fix it by setting minimal required xstate_bv mask.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Some early PCIe chipsets are explicitly listed in the white-list to
enable use of the MMIO config space accesses, perhaps because ACPI
tables were not reliable source of the base MCFG address at that time.
For that chipsets, MCFG base was read from the known chipset MCFGbase
config register.
During very early stage of boot, when access to the PCI config space
is performed (see e.g. pci_early_quirks.c), we cannot map 255MB of
registers because the method used with pre-boot pmap overflows initial
kernel page tables.
Move fallback to read MCFGbase to the attachment method of the
x86/legacy device, which removes code duplication, and results in the
use of io accesses until MCFG is parsed or legacy attach called.
For amd64, pre-initialize cfgmech with CFGMECH_1, right now we
dynamically assign CFGMECH_1 to it anyway, and remove checks for
CFGMECH_NONE.
There is a mention in the Intel documentation for corresponding
chipsets that OS must use either io port or MMIO access method, but we
already break this rule by reading MCFGbase register, so one more
access seems to be innocent.
Reported by: longwitz@incore.de
PR: 236838
Reviewed by: avg (other version), jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19833
For PCI device (i.e. child of a PCI bus), reset tries FLR if
implemented and worked, and falls to power reset otherwise.
For PCIe bus (child of a PCIe bridge or root port), reset
disables PCIe link and then re-trains it, performing what is known as
link-level reset.
Reviewed by: imp (previous version), jhb (previous version)
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19646
Remove redundant npxsave_core definition while here.
Suggested by: Anton Rang
Reviewed by: kib, Anton Rang <rang AT acm.org>
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19665
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.
MFC after: 1 month
AMD64_SET_**BASE expects a pointer to a pointer, we just passing in the pointer value itself.
Set PCB_FULL_IRET for doreti to restore %fs, %gs and its correspondig base.
PR: 225105
Reported by: trasz@
MFC after: 1 month
There are some unusual cases where a process may cause an mlock()ed
range of memory to be unmapped. If the application subsequently
faults on that region, the handler may attempt to create a superpage
mapping backed by the resident, wired pages. However, the pmap code
responsible for creating such a mapping (pmap_enter_pde() on i386
and amd64) does not ensure that a leaf page table page is available
if the superpage is later demoted; the demotion operation must therefore
perform a non-blocking page allocation and must unmap the entire
superpage if the allocation fails. The pmap layer ensures that this
can never happen for wired mappings, and so the case described above
breaks that invariant.
For now, simply ensure that the MI fault handler never attempts to
create a wired superpage except via promotion.
Reviewed by: kib
Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19670
It may happen on some machines, that even if SGX is disabled
in firmware, the driver would still attach despite EPC base and
size equal zero. Such behaviour causes a kernel panic when the
module is unloaded. Add a simple check to make sure we
only attach when these values are correctly set.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: br
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D19595
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
When the pmap with pti disabled (i.e. pm_ucr3 == PMAP_NO_CR3) is
activated, tss.rsp0 was not updated. Any interrupt that happen before
next context switch would use pti trampoline stack for hardware frame
but fault and interrupt handlers are not prepared to this. Correctly
update tss.rsp0 for both PMAP_NO_CR3 and pti pmaps.
Note that this case, pti = 1 but pmap->pm_ucr3 == PMAP_NO_CR3 is not
used at the moment.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514