A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator. Clean these up so that new
architectures do not inherit whitespace issues.
Clean up shared Linuxolator files while here.
Sponsored by: Turing Robotic Industries Inc.
Branch Predictors) mitigation.
DOcument 336996-001 promises that CPUs which implement IBRS but not
STIBP silently ignore setting of the bit instead of trapping.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.
For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0. Current status can be checked in sysctl
hw.ibrs_active. The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14029
The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM. This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest. But with a purely software emulated vAPIC the
advantage is also a problem. The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that). Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would. The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered. This creates a window between the actual
interrupt delivery and the update of IRR and ISR. That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.
The described deviation has been observed to cause an interrupt loss in
the following scenario. vCPU0 posts an inter-processor interrupt to
vCPU1. The interrupt is injected as a virtual interrupt by the
hypervisor. The interrupt is delivered to a guest and an interrupt
handler is invoked. The handler performs a requested action and
acknowledges the request by modifying a global variable. So far, there
is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only
after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared
in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.
I saw several possibilities of fixing the problem. One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment. The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts. I opted to use the latter
approach for several reasons. It's equivalent to what VMM/Intel does
(in !VMX case). It appears to be what VirtualBox and KVM do. The code
is already there (to support legacy interrupts).
Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.
Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.
PR: 215972
Reported by: ajschot@hotmail.com, grehan
Tested by: anish, grehan, Nils Beyer <nbe@renzel.net>
Reviewed by: anish, grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13780
This avoids a nested page fault when obtaining a stack trace in DDB if
the address from the first frame does not resolve to a known symbol.
MFC after: 1 week
Sponsored by: Chelsio Communications
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.
I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.
Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address. Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.
On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.
I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.
Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc (some aspects)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D13985
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.
Approved by: kib
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
still active.
Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup. Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.
I peek this trick in some article about Linux implementation.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
DIfferential revision: https://reviews.freebsd.org/D13956
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:
We believe AMD processors are not susceptible due to our use of
privilege level protections within paging architecture and no
mitigation is required.
Thus default KPTI to off for AMD CPUs, and to on for others. This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.
Submitted by: Mitchell Horne
Reviewed by: cem
Relnotes: Yes
Security: CVE-2017-5754
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D13971
Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF. This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs. Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3. Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).
This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.
While here, bypass trap() entirely and just call mca_intr(). This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).
Reviewed by: kib
Tested by: avg (on AMD without PTI)
Differential Revision: https://reviews.freebsd.org/D13962
In the !PTI case the NMI handler jumped past the instructions that set
%rdi to point to the current PCB, but the target instructions assumed %rdi
were set.
Reviewed by: kib
Tested by: pho
Apparently machinde/cpu.h is supposed to contain MD implementations of
MI interfaces. Also, remove kernphys declaration from machdep.c,
since it is already provided by md_var.h.
Requested and reviewed by: bde
MFC after: 13 days
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context. One result is that hardware
watchpoints do not work reliably within a guest under VT-x.
Due to differences in SVM and VT-x, slightly different approaches are
used.
For VT-x:
- Enable debug register save/restore for VM entry/exit in the VMCS for
DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
%rflags for the host. Note that because DR6 is "software" managed
and not stored in the VMCS a kernel debugger which single steps
through VM entry could corrupt the guest DR6 (since a single step
trap taken after loading the guest DR6 could alter the DR6
register). To avoid this, explicitly disable single-stepping via
the trace flag before loading the guest DR6. A determined debugger
could still defeat this by setting a breakpoint after the guest DR6
was loaded and then single-stepping.
For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM
saves the guest DR6 in the VMCB, the race with single-stepping
described for VT-x does not exist.
For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.
Discussed with: avg, grehan
Tested by: avg (SVM), myself (VT-x)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D13229
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
We already clear %RFLAGS.DF on the kernel entry due to the compiler's
ABI requirements.
Suggested by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Hardware already did it for us due to the mask loaded into the
MSR_SF_MASK msr register.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13838
The symbol is just an offset in the hardware TSS structure, it is not
limited to the common_tss instance.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This is a followup to r307903.
struct svm_softc takes more than 200 kilobytes while what we really need
is 3 contiguous pages for I/O permission map and 2 contiguous pages for
MSR permission map. Other physically mapped structures have a size of
a single page, so a proper alignment is sufficient for their correct
mapping.
Thus, only the permission maps are allocated with contigmalloc now,
the softc is allocated with a regular malloc.
Additionally, this commit adds a check that malloc returns memory with the
expected page alignment and that contigmalloc does not fail.
Unfortunately, at present svm_vminit() is expected to always succeed and
there is no way to report an error.
So, a contigmalloc failure leads to a panic.
We should probably fix this.
MFC after: 2 weeks
Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of
cpu_features, cpu_features2, cpu_stdext_features, and
std_stdext_features2.
The intent is to allow the kernel to see the changes in the CPU
features after micocode update. Of course, the update is not atomic
across variables and not synchronized with readers. See the man page
warning as well.
Reviewed by: imp (previous version), jilles
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13770
It does not change anything in the behavior of trap_pfault(), while
eliminating obfuscation of jumping to the code which checks for the
condition reversed of the goto cause. Also avoid force initialize the
rv variable, since it is now only accessed after storing vm_fault()
return value.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13725
The entry must be logged "manually" using TSRAW rather than TSENTER
since PCPU data structures have not yet been initialized and thus
curthread cannot be accessed; &thread0 is what will become curthread
later in hammer_time.
Other MD initialization code should be similarly instrumented in order
to gain visibility into the time spent before entering mi_startup; this
will require some care and testing from people with access to such
hardware.
They provide relaxed-ordered atomic access semantic. Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses. The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.
The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations. It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.
Suggested by: jhb
Reviewed by: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13534
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.
MFC after: 2 weeks
This variable should be pure MI except possibly for reading it in MD
dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998. There were many imperfections in
r36441. This commit fixes only a small one, to simplify fixing the
others 1 arch at a time. (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
This is better than directly changing PCI configuration space of the
device because it makes the PCI bus aware of the configuration.
Also, the change allows to drop a bunch of code that duplicated
pci_enable_msi() functionality.
I wonder if it's possible to further simplify the code by using
pci_alloc_msi().
ivhd should attach after the root PCI bus and, thus, after the ACPI
Host-PCI bridge off which the bus hangs. This is because ivhd changes
PCI configuration of a PCI IOMMU device that is located on the root bus.
If the bus attaches after ivhd it clears the MSI portion of the
configuration. As a result IOMMU event interrupts would never be
delivered.
For regular ACPI devices the order is calculated as
ACPI_DEV_BASE_ORDER + level * 10
where level is a depth of the device in the ACPI namespace.
I expect the depth of the Host-PCI bridge to be two or three,
so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order
for ivhd.
This should fix the setup of the AMD-Vi event interrupt when vmm is
preloaded (as opposed to kldload-ed).
This ensures that we can receive further event interrupts.
See the description of the bits in the specification for
MMIO Offset 2020h IOMMU Status Register.
The bits are defined as set-by-hardware write-1-to-clear, same as all
the bits in the status register.
Discussed with: anish
Stop issuing pre-assigned number to enumerate all page table pages,
the assignment is incorrect. Instead automatically calculate the next
unused index. This index in fact does not serve any purpose except to
be unique to satisfy vm_page_grab() interface, we do not look up the
page by the index later.
Reported and tested by: emaste
Reviewed by: andrew
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
PR: 223906
Differential revision: https://reviews.freebsd.org/D13273
In the linux ENOADATA is frequently #defined as ENOATTR.
The change is required for an xattrs support implementation.
MFC after: 1 week
Discussed with: netchild
Approved by: pfg
Differential Revision: https://reviews.freebsd.org/D13221
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in
this place. Also use it for all of the other source files, so that the
difference remains as minimal as possible.
MFC after: 2 weeks
Ensure that an opening bracket always has a matching closing one.
Ensure that there is always a new-line at the end of a report line.
Also, add a space before the printed event flag.
Reviewed by: anish
The defined bits are the lower bits, not the higher ones.
Also, the specification has been extended to define bits 0:18 and they
all could potentially be interesting to us, so extend the width of the
field accordingly.
Reviewed by: anish
Many 8-byte entries have zero at byte 4, so the second 4-byte part is
skipped as a 4-byte padding entry. But not all 8-byte entries have that
property and they get misinterpreted.
A real example:
48 00 00 00 ff 01 00 01
This an 8-byte ACPI_IVRS_TYPE_SPECIAL entry for IOAPIC with ID 255 (bogus).
It is reported as:
ivhd0: Unknown dev entry:0xff
Fortunately, it was completely harmless.
Also, bail out early if we encounter an entry of a variable length type.
We do not have proper handling for those yet.
Reviewed by: anish
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).
Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.
Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.
To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.
Reviewed by: kib, jhibbits
Differential Revision: https://reviews.freebsd.org/D13180
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
In reclaim_pv_chunk(), rotate the pv chunks list so that next
invocations of the reclaim do not scan the same pv chunks that could
not be freed. Only do the rotation when there is no parallel scan,
tracked by active_reclaims counter.
To rotate, move all chunks that are before current iteration marker,
after another marker that is inserted at the list tail on start of the
reclaim.
Reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Some callers of fpusetregs()/npxsetregs(), most importantly
set_fpcontext(), clear reserved bits. But some did not. Do the
clearing in fpusetregs() and remove now redundand operation from
set_fpcontext().
Reported by: Maxime Villard <max@m00nbsd.net>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is needed for the HDA emulation with FreeBSD guests.
Reviewed by: marcelo
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12832
timer frequency a power of two. This changes the frequency from 10 to
16.7 MHz (2 ^ 24 HZ). Using a power of two avoids roundoff errors when
doing arithmetic in sbintime_t units.
Testing shows this can fix erratic ntpd behavior in guests using the
hpet timer (which is the default for multicore guests).
Reported by: bsam@
Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated. The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption. Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.
As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.
PR: 222937
Tested by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: grehan
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12661
While here cache-align chains.
This shortens longest found chain during poudriere -j 80 from 32 to 16.
Pushing this higher up will probably require allocation on boot.
HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.
Disable building LINT-VIMAGE with VIMAGE being default.
This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.
Requested by: many
Reviewed by: kristof, emaste, hiren
X-MFC after: never
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12639
Note that pv_list_locks is an array and currently it fits 2 locks per line.
Resizing it and/or putting more locks in different lines requires several tests.
MFC after: 1 week
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.
This is needed to implement support for kernel dump compression in the
MI kernel dump code.
Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11722
For processing, reclaim_pv_chunk() removes the pv_chunk from the lru
list, which makes pc_lru linkage invalid. Then the pmap lock is
released, which allows for other thread to free the last pv entry
allocated from the chunk and call free_pv_chunk(), which tries to
modify the invalid linkage.
Similarly, the chunk is inserted into the private tailq new_tail
temporary. Again, free_pv_chunk() might be run and corrupt the
linkage for the new_tail after the pmap lock is dropped.
This is a consequence of r299788 elimination of pvh_global_lock, which
allowed for reclaim to run in parallel with other pmap calls which
free pv chunks.
As a fix, do not remove the chunk from pc_lru queue, use a marker to
remember the position in the queue iteration. We can safely operate
on the chunks after the chunk's pmap is locked, we fetched the chunk
after the marker, and we checked that chunk pmap is same as we have
locked, because chunk removal from pc_lru requires both pv_chunk_mutex
and the pmap mutex owned.
Note that the fix lost an optimization which was present in the
previous algorithm. Namely, new_tail requeueing rotated the pv chunks
list so that reclaim didn't scan the same pv chunks that couldn't be
freed (because they contained a wired and/or superpage mapping) on
every invocation. An additional change is planned which would improve
this.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
allocated, when requested range of descriptors does not fit into
currently allocated LDT, or trim the return if the range fits
partially. Before, the function returned EINVAL.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
use LDT segments immediately.
If the i386_set_ldt() call created a first LDT descriptor (and
consequently created the LDT) for our address space, LDTR is currently
loaded only on the CPU executing the syscall. Other CPUs executing
threads sharing the address space, would only load LDTR after context
switch.
Uncomment set_user_ldt_rv() and call it on all CPUs. Remove critical
section inside set_user_ldt(), it is not needed in the context of call
from smp_rendezvous().
Set md_ldt after md_ldt_sd is initialized using the same code sequence
as in user_ldt_free(). Do the whole initialization in a critical
section, to not race with the context switching while we set LDT.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
cpu_switch.S uses curproc->p_md.md_ldt value as the flag indicating
presence of the process LDT. The flag is checked and then ldt segment
descriptor is copied into the CPU' GDT slot.
Disallow context switches around clearing of the curproc LDT state by
performing the cleanup in critical section. Ensure that the md_ldt
flag is cleared before md_ldt_sd descriptor content is destroyed by
inserting fence between the operations.
We depend on the x86 memory model strong ordering guarantees, in
particular, that cpu_switch.S observes the writes to md_ldt and
md_ldt_sd in the expected order.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Provide consistent snapshot of the requested descriptors by preventing
other threads from modifying LDT while we fetch the data, lock dt_lock
around the read. Copy the data into intermediate buffer, which is
copied out after the lock is dropped.
Use guaranteed atomic (aligned volatile) reads of the descriptors to
use same-size atomic as CPU update to set A bit in the descriptor type
field.
Improve overflow checking for the descriptors range calculations and
remove unneeded casts.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Compilers are allowed to combine plain reads into group operations,
e.g. 64bit element copies of one array into another can be
legitimately optimized back to a memcpy() call, which r323772 tried to
prevent.
Qualify accesses to LDT descriptors with volatile dereference to
ensure that each write indeed occurs. After that, our usual claim of
native-size aligned writes being atomic applies.
This is equivalent to atomic_store(memory_order_relaxed) C11 accesses,
but our machine/atomic.h does not provide corresponding primitive.
Noted and reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
On i386, the function is used from the context switch code and needs
to be accessible externally. Amd64 MD context switch does not lock an
LDT spinlock and inlines switching in assembly.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This makes the LDT to use only one page with default settings,
avoiding the need to find contigous 2 pages in KVA. It seems that
most users are fine even with 512 segments.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
machine independent parts of the existing code to a new file that can be
shared between amd64 and arm64.
Reviewed by: kib (previous version), imp
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D12434
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12413
CAM_DEBUG_TRACE results in way too much debug output than needed now.
When debugging, it's always possible to turn on trace level using camcontrol.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12110
AMD Family 17h CPUs have an internal network used to communicate between
the host CPU and the PSP and SMU coprocessors. It exposes a simple
32-bit register space.
Reviewed by: avg (no +1), mjoras, truckman
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12217
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too. It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells. There are also 4 DMA engines
in those chips, but they are not yet supported.
While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
The actual cache line size has always been 64 bytes.
The 128 number arose as an optimization for Core 2 era Intel processors. By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently. Newer CPUs prefetch more intelligently.
The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington"). If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.
Reported by: mjg
Reviewed by: np
Discussed with: jhb
Sponsored by: Dell EMC Isilon
Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.
Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.
Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.
The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.
Reviewed by: jhb (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12023
- Use more relevant name 'signo' instead of 'i' for the local variable
which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
of the trap() function. Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.
Requested and reviewed by: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11603
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.
Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11584
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.
The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources. Reduced latency
was proven by, e.g., SPEC2008.
Submitted by: jeff@ (earlier version)
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10435
Intel SGX allows to manage isolated compartments "Enclaves" in user VA
space. Enclaves memory is part of processor reserved memory (PRM) and
always encrypted. This allows to protect user application code and data
from upper privilege levels including OS kernel.
This includes SGX driver and optional linux ioctl compatibility layer.
Intel SGX SDK for FreeBSD is also available.
Note this requires support from hardware (available since late Intel
Skylake CPUs).
Many thanks to Robert Watson for support and Konstantin Belousov
for code review.
Project wiki: https://wiki.freebsd.org/Intel_SGX.
Reviewed by: kib
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11113
It is quite useful when double fault is not caused by a stack overflow.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
reduces diff between amd64 and i386. Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.
Reported by: dexuan
Tested by: dexuan
libefivar expects opening /dev/efi to indicate if the we can make efi
runtime calls. With a null routine, it was always succeeding leading
efi_variables_supported() to return the wrong value. Only succeed if
we have an efi_runtime table. Also, while I'm hear, out of an
abundance of caution, add a likely redundant check to make sure
efi_systbl is not NULL before dereferencing it. I know it can't be
NULL if efi_cfgtbl is non-NULL, but the compiler doesn't.
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().
MFC after: 3 days
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__. Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.
Submitted by: Anton Rang <anton.rang AT isilon.com>
Sponsored by: Dell EMC Isilon
Similar to r321899, reduce sv_maxuser by one page inside of CloudABI.
This ensures that the stack, the vDSO and any allocations cannot touch
the top page of user virtual memory.
Considering that CloudABI userspace is completely oblivious to virtual
memory layout, don't bother making this conditional based on the CPU of
the running system.
Reviewed by: kib, truckman
Differential Revision: https://reviews.freebsd.org/D11808
when a signal is not intended to be sent.
The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.
For NMI, a return to usermode without ast processing is done. On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.
Reported by: Nils Beyer <nbe@renzel.net>
PR: 221151
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
from the top of user memory to one page lower on machines with the
Ryzen (AMD Family 17h) CPU. This pushes ps_strings and the stack
down by one page as well. On Ryzen there is some sort of interaction
between code running at the top of user memory address space and
interrupts that can cause FreeBSD to either hang or silently reset.
This sounds similar to the problem found with DragonFly BSD that
was fixed with this commit:
https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
but our signal trampoline location was already lower than the address
that DragonFly moved their signal trampoline to. It also does not
appear to be related to SMT as described here:
https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498
"Hi, Matt Dillon here. Yes, I did find what I believe to be a
hardware issue with Ryzen related to concurrent operations. In a
nutshell, for any given hyperthread pair, if one hyperthread is
in a cpu-bound loop of any kind (can be in user mode), and the
other hyperthread is returning from an interrupt via IRETQ, the
hyperthread issuing the IRETQ can stall indefinitely until the
other hyperthread with the cpu-bound loop pauses (aka HLT until
next interrupt). After this situation occurs, the system appears
to destabilize. The situation does not occur if the cpu-bound
loop is on a different core than the core doing the IRETQ. The
%rip the IRETQ returns to (e.g. userland %rip address) matters a
*LOT*. The problem occurs more often with high %rip addresses
such as near the top of the user stack, which is where DragonFly's
signal trampoline traditionally resides. So a user program taking
a signal on one thread while another thread is cpu-bound can cause
this behavior. Changing the location of the signal trampoline
makes it more difficult to reproduce the problem. I have not
been because the able to completely mitigate it. When a cpu-thread
stalls in this manner it appears to stall INSIDE the microcode
for IRETQ. It doesn't make it to the return pc, and the cpu thread
cannot take any IPIs or other hardware interrupts while in this
state."
since the system instability has been observed on FreeBSD with SMT
disabled. Interrupts to appear to play a factor since running a
signal-intensive process on the first CPU core, which handles most
of the interrupts on my machine, is far more likely to trigger the
problem than running such a process on any other core.
Also lower sv_maxuser to prevent a malicious user from using mmap()
to load and execute code in the top page of user memory that was made
available when the shared page was moved down.
Make the same changes to the 64-bit Linux emulator.
PR: 219399
Reported by: nbe@renzel.net
Reviewed by: kib
Reviewed by: dchagin (previous version)
Tested by: nbe@renzel.net (earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11780
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.
Reviewed by: alc, kib (previous revision)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11790
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping. (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.
Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).
Submitted by: Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by: kib, markj
Tested by: pho
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D11556
Pollution from counter.h made __pcpu visible in amd64/pmap.c. Delete
the existing extern decl of __pcpu in amd64/pmap.c and avoid referring
to that symbol, instead accessing the pcpu region via PCPU_SET macros.
Also delete an unused extern decl of __pcpu from mp_x86.c.
Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11666
this file. Previously, half of the pointers to a vm_page being used as
a page directory page were named "pdpg" and the rest were named "mpde".
Discussed with: kib
MFC after: 1 week
pmap_remove_ptes(). (This new function will also be used by an upcoming
change to pmap_enter() that adds support for psind == 1 mappings.)
Submitted by: Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by: kib, markj
MFC after: 1 week
The mutex protecting access to the registered realtime clock should not be
overloaded to protect access to the atrtc hardware, which might not even be
the registered rtc. More importantly, the resettodr mutex needs to be
eliminated to remove locking/sleeping restrictions on clock drivers, and
that can't happen if MD code for amd64 depends on it. This change moves the
protection into what's really being protected: access to the atrtc date and
time registers.
This change also adds protection when the clock is accessed from
xentimer_settime(), which bypasses the resettodr locking.
Differential Revision: https://reviews.freebsd.org/D11483
Implement the MMC/SD/SDIO protocol within a CAM framework. CAM's
flexible queueing will make it easier to write non-storage drivers
than the legacy stack. SDIO drivers from both the kernel and as
userland daemons are possible, though much of that functionality will
come later.
Some of the CAM integration isn't complete (there are sleeps in the
device probe state machine, for example), but those minor issues can
be improved in-tree more easily than out of tree and shouldn't gate
progress on other fronts. Appologies to reviews if specific items
have been overlooked.
Submitted by: Ilya Bakulin
Reviewed by: emaste, imp, mav, adrian, ian
Differential Review: https://reviews.freebsd.org/D4761
merge with first commit, various compile hacks.
amdvi_cmp_wait: gcc complained about a malformed string behind an ifdef.
struct amdvi_dte: widen the type of the first reserved bitfield so that
the packed representation would not cross an alignment boundary for that
type. Apparently that causes in-tree gcc (4.2) to insert padding
(despite packed, resulting in a wrong structure definition), and causes
more modern gcc to emit a warning.
ivrs_hdr_iterate_tbl: delete a misleading check about header length
being less than 0 (the type is unsigned) and replace it with a check
that the length doesn't exceed the table size.
Reviewed by: anish, grehan
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11485
start address is not required to be page aligned. However, the loop
within pmap_invalidate_cache_range() that performs the actual cache
line invalidations requires that the starting address be truncated to
a multiple of the cache line size. This change corrects an error in
that truncation.
Submitted by: Brett Gutstein <bgutstein@rice.edu>
Reviewed by: kib
MFC after: 1 week
--Remove special-case handling of sparc64 bus_dmamap* functions.
Replace with a more generic mechanism that allows MD busdma
implementations to generate inline mapping functions by
defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This
is currently useful for sparc64, x86, and arm64, which all
implement non-load dmamap operations as simple wrappers
around map objects which may be bus- or device-specific.
--Remove NULL-checked bus_dmamap macros. Implement the
equivalent NULL checks in the inlined x86 implementation.
For non-x86 platforms, these checks are a minor pessimization
as those platforms do not currently allow NULL maps. NULL
maps were originally allowed on arm64, which appears to have
been the motivation behind adding arm[64]-specific barriers
to bus_dma.h, but that support was removed in r299463.
--Simplify the internal interface used by the bus_dmamap_load*
variants and move it to bus_dma_internal.h
--Fix some drivers that directly include sys/bus_dma.h
despite the recommendations of bus_dma(9)
Reviewed by: kib (previous revision), marius
Differential Revision: https://reviews.freebsd.org/D10729
struct thread.
For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.
The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.
This move will also allow several more uses shortly.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
from machine/proc.h, consistently on all architectures.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
pmap_enter() by implementing a single return path. Otherwise, the
duplication will only increase with the upcoming support for psind == 1.
Reviewed by: kib (some time ago)
prevents one from running eg clang built with debug; the new one is
arbitrary (equal to MAXDSIZ) and... well, should be quite future-proof.
Same fix might be applicable to other 64 bit architectures; I'll ask
their respective maintainers to make sure it won't break anything.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10758
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.
ANSIfy related prototypes while here.
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193
This was introduced in r290156. It's present in 11.0, but not any 10.x
release unless someone decided to MFC it.
It affects ordinary pages right above the DMAP limit, which is effectively
system memory rounded up to a 1 GB (3rd level superpage) boundary (or up to
a minimum of 4 GB, on small systems).
Reported by: vangyzen
Reviewed by: kib, alc
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D4030
operation after processor is configured to allow all required
features.
In particular, NX must be enabled in EFER, otherwise load of page
table element with nx bit set causes reserved bit page fault. Since
malloc uses direct mapping for small allocations, in particular for
the suspension pcbs, and DMAP is nx after r316767, this commit tripped
fault on resume path.
Restore complete state of EFER while wakeup code is still executing
with custom page table, before calling resumectx, instead of trying to
guess which features might be needed before resumectx restored EFER on
its own.
Bisected and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
Demotions preserve PG_NX, so it is enough to set nx bit for initial
lowest-level paging entries.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().
Reviewed by: gnn, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10313
- renaming l_ifreq::ifru_metric to l_ifreq::ifru_ivalue;
- adding a definition for ifr_ifindex which points to l_ifreq::ifru_ivalue.
A quick search indicates that Linux already got the above changes since 2.1.14.
Reviewed by: kib, marcel, dchagin
MFC after: 1 week
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.
Also, the change could be not as useful as I thought it was.
Reported by: dchagin, Manfred Antar <null@pozo.com>, and many others
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before. i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386. powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.
Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.
-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.
The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.
Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.
Reviewed by: kib
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
Long ago, perhaps only on i386, kernel text was mapped read-only and
it was necessary to change the mapping to read-write to set breakpoints
in kernel text. Other writes by ddb to kernel text were also allowed.
This write protection is harder to implement with 4MB pages, and was
lost even for 4K pages when 4MB pages were implemented. So changing
the mapping became useless. It was actually worse than useless since
it followed followed various null and otherwise garbage pointers to
not change random memory instead of the mapping. (On i386s, the
pointers became good in pmap_bootstrap(), and on amd64 the pointers
became bad in pmap_bootstrap() if not before.)
Another bug broke detection of following of null pointers on i386,
except early in boot where not detecting this was a feature. When
I fixed the bug, I accidentally broke the feature and soon got traps
in db_write_bytes(). Setting breakpoints early in ddb was broken.
kib pointed out that a clean way to do the adjustment would be to use
a special [sub]map giving a small window on the bytes to be written.
The trap handler didn't know how to fix up errors for pagefaults
accessing the map itself. Such errors rarely need fixups, since most
traps for the map are for the first access which is a read.
Reviewed by: kib
matches static binaries.
Interpretation of the 'static' there is that the binary must not
specify an interpreter. In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.
This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.
Revert r315701.
Discussed with and tested by: ed (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
CloudABI executables are statically linked and don't have an
interpreter. Setting the interpreter path to NULL used to work
previously, but r314851 introduced code that checks the string
unconditionally. Running CloudABI executables now causes a null pointer
dereference.
Looking at the rest of imgact_elf.c, it seems various other codepaths
already leaned on the fact that the interpreter path is set. Let's just
go ahead and pick an obviously incorrect interpreter path to appease
imgact_elf.c.
MFC after: 1 week
On the original i386, %dr[4-5] were unimplemented but not very clearly
reserved, so debuggers read them to print them. i386 was still doing
this.
On the original athlon64, %dr[4-5] are documented as reserved but are
aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps.
On 2 of my systems, accessing %dr[4-5] trapped sometimes. On my Haswell
system, the apparent randomness was because the boot CPU starts with
CR4_DE set while all other CPUs start with CR4_DE clear. FreeBSD
doesn't support the data breakpoints enabled by CR4_DE and it never
changes this flag, so the flag remains different across CPUs and
the behaviour seemed inconsistent except while booting when the CPU
doesn't change.
The invalid accesses broke:
- read access for printing the registers in ddb "show watches" on CPUs
with CR4_DE set
- read accesses in fill_dbregs() on CPUs with CR4_DE set. This didn't
implement panic(3) since the user case always skipped %dr[4-5].
- write accesses in set_dbregs(). This also didn't affect userland.
When it didn't trap, the aliasing made it fragile.
Don't print the dummy (zero) values of %dr[4-5] in "show watches" for
i386 or amd64. Fix style bugs near this printing.
amd64 also has space in the dbregs struct for the reserved %dr[8-15]
and already didn't print the dummy values for these, and never accessed
any of the 10 reserved debug registers.
Remove cpufuncs for making the invalid accesses. Even amd64 had these.
Otherwise, recent Linux guests will use these instructions, resulting
in #UD exceptions since bhyve doesn't implement MONITOR/MWAIT exits.
This fixes boot-time hangs in recent Linux guests on Ryzen CPUs
(and probably Bulldozer aka AMD FX as well).
Reviewed by: kib
MFC after: 1 week
related struct definitions out into the MI path.
Invert the native ipc structs to the Linux ipc structs convesion logic.
Since 64-bit variant of ipc structs has more precision convert native ipc
structs to the 64-bit Linux ipc structs and then truncate 64-bit values
into the non 64-bit if needed. Unlike Linux, return EOVERFLOW if the
values do not fit.
Fix SYSV IPC for 64-bit Linuxulator which never sets IPC_64 bit.
MFC after: 1 month
the syscalls that are not implemented in Linux kernel itself.
Cleanup DUMMY() macros.
Reviewed by: dchagin, trasz
Approved by: dchagin
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D9804
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
Otherwise kernel traps on NULL dereference if fpu_kern(9) is used from the
thread0 context.
Reported by: cem
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
PG_PROMOTED, that indicates whether lingering 4KB page mappings might
need to be flushed on a PDE change that restricts or destroys a 2MB
page mapping. This flag allows the pmap to avoid range invalidations
that are both unnecessary and costly.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9665
show pte from the pmap of the process of the current DDB thread, instead
of necessarily the PCPU pmap.
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9645
This also adds support for LINUX_ARCH_SET_GS.
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9372
For the loop that dirties vm_pages in case superpage was written to,
check the complete condition before the loop.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.
Reviewed by: andreast, kan, lidl
Tested by: lidl(mips, sparc64), andreast(powerpc)
Differential Revision: https://reviews.freebsd.org/D9587
MTRR handlers are set in {amd64/i686}_mem_drvinit, which is called at
SI_SUB_DRIVERS, and that's too late when EARLY_AP_STARTUP is set because APs
have already started at this point. {amd64/i686}_mrinit is also called too late
for the BSP, since that happens when the memory device is attached, also after
APs have already started.
Move the position to SI_SUB_CPU, and also initialize the state for the BSP, so
that the APs can correctly get to the same state as the BSP.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D9630
but it allows to use 64 bit linux strace(1) on 64 bit linux binaries.
Reviewed by: dchagin (earlier version)
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9406
Refresh upstream driver before impending conversion to iflib.
Major new features:
- Support for Fortville-based 25G adapters
- Support for I2C reads/writes
(To prevent getting or sending corrupt data, you should set
dev.ixl.0.debug.disable_fw_link_management=1 when using I2C
[this will disable link!], then set it to 0 when done. The driver implements
the SIOCGI2C ioctl, so ifconfig -v works for reading I2C data,
but there are read_i2c and write_i2c sysctls under the .debug sysctl tree
[the latter being useful for upper page support in QSFP+]).
- Addition of an iWARP client interface (so the future iWARP driver for
X722 devices can communicate with the base driver).
- Compiling this option in is enabled by default, with "options IXL_IW" in
GENERIC.
Differential Revision: https://reviews.freebsd.org/D9227
Reviewed by: sbruno
MFC after: 2 weeks
Sponsored by: Intel Corporation
and wrong numbering for a few unimplemented syscalls.
For 32-bit Linuxulator, socketcall() syscall was historically
the entry point for the sockets API. Starting in Linux 4.3, direct
syscalls are provided for the sockets API. Enable it.
The initial version of patch was provided by trasz@ and extended by me.
Submitted by: trasz
MFC after: 2 week
Differential Revision: https://reviews.freebsd.org/D9381
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl
protection change.
On superpage promotion, x86 pmaps do not invalidate existing 4K
entries for the superpage range, because they are compatible with the
promoted 2/4M entry. But the invalidation on superpage removal or
protection change only did single INVLPG with the base address of the
superpage. This reliably flushed superpage TLB entry, and 4K entry
for the first page of the superpage, potentially leaving other 4K TLB
entries lingering. Do the invalidation of the whole superpage range
to correct the problem.
Note that the precise invalidation is done by x86 code for kernel_pmap
only, for user pmaps whole (per-AS) TLB is flushed. This made the bug
well hidden, because promotions of the kernel mappings require
specific load.
Reported and tested by: Jonathan Looney <jtl@netflix.com> (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The error is:
vmm_dev.c: In function 'alloc_memseg':
vmm_dev.c:261:11: error: null argument where non-null required (argument 1) [-Werror=nonnull]
Apparently, the gcc is unable to figure out that if a ternary operator
produced a non-NULL value once, then the operator with exactly the same
operands would produce the same value again.
MFC after: 1 week
We would previously invalidate such entries individually, resulting in more
IPIs than necessary.
Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D9094
- em(4) igb(4) and lem(4)
- deprecate the igb device from kernel configurations
- create a symbolic link in /boot/kernel from if_em.ko to if_igb.ko
Devices tested:
- 82574L
- I218-LM
- 82546GB
- 82579LM
- I350
- I217
Please report problems to freebsd-net@freebsd.org
Partial review from jhb and suggestions on how to *not* brick folks who
originally would have lost their igbX device.
Submitted by: mmacy@nextbsd.org
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Limelight Networks and Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8299
hammer_time(). This makes assembler exception handlers not fault
itself when setting PCB flags, and allow normal kernel trap handler to
get control. The pointer is reset after FPU parameters are obtained.
Set thread0.td_critnest to 1 for duration of hammer_time() as well.
In particular, page faults at that early stage panic immediately
instead of trying to call not yet operational VM to resolve it.
As result, faults during second half of the hammer_time() execution
have a chance to be reported instead of silent machine reboot or hang.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.
A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.
dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable. Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.
When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore
A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.
Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.
savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.
decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.
Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.
EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.
Designed by: def, pjd
Reviewed by: cem, oshogbo, pjd
Partial review: delphij, emaste, jhb, kib
Approved by: pjd (mentor)
Differential Revision: https://reviews.freebsd.org/D4712
module loading is successful, but attempts to use it will not be
successful. This is similar to what we do (did?) with ACPI on non-ACPI
systems. We succeed if we can't find the necessary information to hook
into EFI, but still fail if we're unable to allocate resources if we
do find EFI.
Not Objected to by: kib@
MFC Afer: 3 days
If we load a binary that is designed to be a library, it produces
relocatable code via assembler directives in the assembly itself
(rather than compiler options). This emits R_X86_64_PLT32 relocations,
which are not handled by the kernel linker.
Submitted by: gallatin
Reviewed by: kib
contain a vm_page_t at the specified index. However, with this
change, vm_radix_remove() no longer panics. Instead, it returns NULL
if there is no vm_page_t at the specified index. Otherwise, it
returns the vm_page_t. The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations. Instead of performing a lookup before every remove,
the pmap can simply perform the remove.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D8708
Rather than reporting a page fault due to a bad PTE as a protection
violation with the "rsv" flag, treat these faults as a separate type of
fault altogether.
MFC after: 1 month
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.
Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366
Previously these were only declared under #ifdef SMP in <machine/smp.h>.
However, these variables are defind in pmap.c unconditionally, and efirt.c
references them unconditionally. This fixes non-SMP kernel builds.
Discussed with: kib
MFC after: 1 week
Just using vm_paddr_t value with all bits set.
That should work as long as the type is unsigned.
While there, fix a couple of whitespace issues nearby.
MFC after: 1 week
X-MFC with: r307903
Reject attempts to read from or memory map offsets in /dev/mem that are
beyond the maximum-supported physical address of the current CPU.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7408
To achieve that the whole svm_softc is allocated with contigmalloc now.
It would be more effient to de-embed those arrays and allocate only them
with contigmalloc.
Previously, if malloc(9) used non-contiguous pages for the arrays, then
random bits in physical pages next to the first page would be used to
determine permissions for I/O port and MSR accesses. That could result
in a guest dangerously modifying the host hardware configuration.
One example is that sometimes NMI watchdog driver in a Linux guest
would be able to configure a performance counter on a host system.
The counter would generate an interrupt and if hwpmc(4) driver is loaded
on the host, then the interrupt would be delivered as an NMI.
Discussed with: jhb
Reviewed by: grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8321
- Make !KDB config buildable.
- Simplify interface to nmi_handle_intr() by evaluating panic_on_nmi
in one place, namely nmi_call_kdb(). This allows to remove do_panic
argument from the functions, and to remove i386/amd64 duplication of
the variable and sysctl definitions. Note that now NMI causes
panic(9) instead of trap_fatal() reporting and then panic(9),
consistently for NMIs delivered while CPU operated in ring 0 and 3.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.
When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb. All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work. One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".
Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.
Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs. If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD. More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.
Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.
The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8249
These two ALU instructions first appeared on Linux. Then, libpcap adopted
and made them available since 1.6.2. Now more platforms including NetBSD
have them in kernel. So do we.
--이 줄 이하는 자동으로 제거됩니다--
Split efirt.ko initialization into early stage where runtime services
KPI environment is created, to be used e.g. for RTC, and the later
devfs node creation stage, per module.
Switch the efi device to use make_dev_s(9) instead of make_dev(9). At
least, this gracefully handles the duplicated device name issue.
Remove ARGSUSED comment from efidev_ioctl(), all unused arguments are
annotated with __unused attribute.
Reported by: ambrisko, O. Hartmann <ohartman@zedat.fu-berlin.de>
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Using the device pager with /dev/kmem is not stable since KVA mappings
are transient, but the device pager caches the PA associated with a
given offset forever. Interestingly, mips' implementation of
memmap() already refused requests for /dev/kmem.
Note that kvm_read/kvm_write do not use mmap, but use read and write on
/dev/kmem, so this should not affect libkvm users.
Reviewed by: kib
MFC after: 2 months
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
userland. It supports userland interfaces to UEFI Runtime Services. This is
indended to the the MI portion of EFI RuntimeServices support.
Differential Revision: https://reviews.freebsd.org/D8128
Reviewed by: kib@, wblock@, Ganael Laplanche
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one. (Running a basic file
server workload.)
Submitted by: Anton Rang <rang at acm.org>
Reviewed by: cem (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8041
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one. (Running a basic file
server workload.)
Submitted by: Anton Rang <rang at acm.org>
Reviewed by: cem (earlier version), kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8041
like other PCI network drivers. The sys/ofed directory is now mainly
reserved for generic infiniband code, with exception of the mthca driver.
- Add new manual page, mlx4en(4), describing how to configure and load
mlx4en.
- All relevant driver C-files are now prefixed mlx4, mlx4_en and
mlx4_ib respectivly to avoid object filename collisions when compiling
the kernel. This also fixes an issue with proper dependency file
generation for the C-files in question.
- Device mlxen is now device mlx4en and depends on device mlx4, see
mlx4en(4). Only the network device name remains unchanged.
- The mlx4 and mlx4en modules are now built by default on i386 and
amd64 targets. Only building the mlx4ib module depends on
WITH_OFED=YES .
Sponsored by: Mellanox Technologies
i.e. SandyBridge and IvyBridge, correct a race between pmap_activate()
and invltlb_pcid_handler().
Reported by and tested by: Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after: 1 week
early printfs and debugging of vm86 initialization and some other early
initialization in some cases.) Add an option debug.late_console (with
default 1=off) to move console and kdb initialization back where it was.
Do the same for amd64 although there is no vm86 there.
On my test system, debug.late_console=0 works for the syscons, sio and
uart console drivers on amd64 and i386, and for vt on i386 but not on
amd64.
The early printfs fixed by debug.late_console=0 are:
- on i386, the message about lost memory above 4G
- with -v in otherwise normal use, about 20 printfs for SMAP
- other debugging messages for memory sizing. Mostly under -v and
not printed in normal use.
Document in a comment how much earlier the initialization and early
printf()s can be. That is very early for the console. Not much more
than curthread is needed. kdb use obviously needs to be not so early,
since it needs IDT initialization and that is done relatively late
for convenience and historical reasons.
Runtime services require special execution environment for the call.
Besides that, OS must inform firmware about runtime virtual memory map
which will be active during the calls, with the SetVirtualAddressMap()
runtime call, done while the 1:1 mapping is still used. There are two
complication: the SetVirtualAddressMap() effectively must be done from
loader, which needs to know kernel address map in advance. More,
despite not explicitely mentioned in the specification, both 1:1 and
the map passed to SetVirtualAddressMap() must be active during the
SetVirtualAddressMap() call. Second, there are buggy BIOSes which
require both mappings active during runtime calls as well, most likely
because they fail to identify all relocations to perform.
On amd64, we can get rid of both problems by providing 1:1 mapping for
the duration of runtime calls, by temprorary remapping user addresses.
As result, we avoid the need for loader to know about future kernel
address map, and avoid bugs in BIOSes. Typically BIOS only maps
something in low 4G. If not runtime bugs, we would take advantage of
the DMAP, as previous versions of this patch did.
Similar but more complicated trick can be used even for i386 and 32bit
runtime, if and when the EFI boot on i386 is supported. We would need
a trampoline page, since potentially whole 4G of VA would be switched
on calls, instead of only userspace portion on amd64.
Context switches are disabled for the duration of the call, FPU access
is granted, and interrupts are not disabled. The later is possible
because kernel is mapped during calls.
To test, the sysctl mib debug.efi_time is provided, setting it to 1
makes one call to EFI get_time() runtime service, on success the efitm
structure is printed to the control terminal. Load efirt.ko, or add
EFIRT option to the kernel config, to enable code.
Discussed with: emaste, imp
Tested by: emaste (mac, qemu)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
physical address of the EFI System Table. Add _KERNEL guard around
its declaration in sys/efi.h.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Note that lgdt() name is already used for function which, besides
loading GDT, also reloads segment descriptors cache, thus new function
is named bare_lgdt().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
amd64 pmap.
The new pmap_pinit_pml4() function initializes the level 4 page table
with entries for the kernel mappings. Both functions are needed for
upcoming EFI Runtime Services support.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
i386-only section, and fix a comment about the amd64 kernel trapframe
not having stackregs.
tf_rsp doesn't need decoding on amd64, but had an old clone of i386
code to do this in 1 place, and since the amd64 kernel trapframe does
have stackregs, the result was an off-by-16 error for %rsp in an error
message.
While here, avoid using the old variable 'code' and remove it
in trap(). ('code' was meant for holding things like %dr6,
but is too small to hold %dr6 on amd64 and was reduced to an
obfuscation of tf_err, with early truncation on amd64.)
Submitted by: Michael Butler (imb@...)
This is not very easy to do, since ddb didn't know when traps are
for single-stepping. It more or less assumed that traps are either
breakpoints or single-step, but even for x86 this became inadequate
with the release of the i386 in ~1986, and FreeBSD passes it other
trap types for NMIs and panics.
On x86, teach ddb when a trap is for single stepping using the %dr6
register. Unknown traps are now treated almost the same as breakpoints
instead of as the same as single-steps. Previously, the classification
of breakpoints was almost correct and everything else was unknown so
had to be treated as a single-step. Now the classification of single-
steps is precise, the classification of breakpoints is almost correct
(as before) and everything else is unknown and treated like a
breakpoint.
This fixes:
- breakpoints not set by ddb, including the main one in kdb_enter(),
were treated as single-steps and not stopped on when stepping
(except for the usual, simple case of a step with residual count 1).
As special cases, kdb_enter() didn't stop for fatal traps or panics
- similarly for "hardware breakpoints".
Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify
single-steps. This is excessively complicated for bug-for-bug and
backwards compatibilty. Design errors apparently started in Mach
in ~1990 or perhaps in the FreeBSD interface in ~1993. Common trap
types like single steps should have a unique MI code (like the TRAP*
codes for user SIGTRAP) so that debuggers don't need macros like
IS_SSTEP_TRAP() to decode them. But 'type' is actually an ambiguous
MD trap number, and code was always 0 (now it is (int)%dr6 on x86).
So it was impossible to determine the trap type from the args.
Global variables had to be used.
There is already a classification macro db_pc_is_single_step(), but
this just gets in the way. It is only used to recover from bugs in
IS_BREAKPOINT_TRAP(). On some arches, IS_BREAKPOINT_TRAP() just
duplicates the ambiguity in 'type' and misclassifies single-steps as
breakpoints. It defaults to 'false', which is the opposite of what is
needed for bug-for-bug compatibility.
When this is cleaned up, MI classification bits should be passed in
'code'. This could be done now for positive-logic bits, since 'code'
was always 0, but some negative logic is needed for compatibility so
a simple MI classificition is not usable yet.
After reading %dr6, clear the single-step bit in it so that the type
of the next debugger trap can be decoded. This is a little
ddb-specific. ddb doesn't understand the need to clear this bit and
doing it before calling kdb is easiest. gdb would need to reverse
this to support hardware breakpoints, but it just doesn't support
them now since gdbstub doesn't support %dr*.
Fix a bug involving %dr6: when emulating a single-step trap for vm86,
set the bit for it in %dr6. Userland debuggers need this. ddb now
needs this for vm86 bios calls. The bit gets copied to 'code' then
cleared again.
Fix related style bugs:
- when clearing bits for hardware breakpoints in %dr6, spell the mask
as ~0xf on both amd64 and i386 to get the correct number of bits
using sign extension and not need a comment about using the wrong
mask on amd64 (amd64 traps for invalid results but clearing the
reserved top bits didn't trap since they are 0).
- rewrite my old wrong comments about using %dr6 for ddb watchpoints.
The 'cpu' and 'cpu_class' variables were always set to the same value
on amd64 and are legacy holdovers from i386. Remove them entirely on
amd64.
Reviewed by: imp, kib (older version)
Differential Revision: https://reviews.freebsd.org/D7888
SEL_UPL and sometimes PSL_VM. This is just a style change on amd64,
but on i386 it fixes 1 unimportant place where the PSL_VM check was
missing and starts fixing 1 important place where the PSL_VM check
had a logic error.
Fix logic errors in treating vm86 bioscall mode as kernel mode. The
main place checked all the necessary flags, but put the necessary
parentheses for the PSL_VM and PCB_VM86CALL checks in the wrong
place. The broken case is only reached if a vm86 bioscall uses a
%cs which is nonzero mod 4, but that is unusual -- most bios calls
start with %cs = 0xc000 or 0xf000 and rarely change it. Another
place was missing the check for PCB_VM86CALL, but was only reachable
if there are bugs virtualizing PSL_I.
Add a macro TF_HAS_STACKREGS() and use this instead of converting
open-coded checks of SEL_UPL, etc. to TRAPF_USERMODE() when we only
care about whether the frame has stack registers. This fixes 3
places in my recent fix for register variables in vm86 mode where I
messed up the PSL_VM check and cleans up other places.
The flag specifies that the block which uses FPU must be executed in
critical section, i.e. take no context switches, and does not need an
FPU save area during the execution.
It is intended to be applied around fast and short code pathes where
save area allocation is impossible or undesirable, due to context or
due to the relative cost of calculation vs. allocation.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and
into vm/pmap.h, and describe what its purpose is. Eliminate the archaic
"XXX" comment about its value. I don't believe that its exact value, e.g.,
5 versus 6, matters.
Update the arm64 and riscv pmap implementations of pmap_ts_referenced()
to opportunistically update the page's dirty field.
On amd64, use the PDE value already cached in a local variable rather than
dereferencing a pointer again and again.
Reviewed by: kib, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D7836
Add routines to trigger a function level reset (FLR) of a PCI-express
device via the PCI-express device control register. This also includes
support routines to wait for pending transactions to complete as well
as calculating the maximum completion timeout permitted by a device.
Change the ppt(4) driver to reset pass through devices before attaching
to a VM during startup and before detaching from a VM during shutdown.
Reviewed by: imp, wblock (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7751
When the I/O MMU is active in bhyve, all PCI devices need valid entries
in the DMAR context tables. The I/O MMU code does a single enumeration
of the available PCI devices during initialization to add all existing
devices to a domain representing the host. The ppt(4) driver then moves
pass through devices in and out of domains for virtual machines as needed.
However, when new PCI devices were added at runtime either via SR-IOV or
HotPlug, the I/O MMU tables were not updated.
This change adds a new set of EVENTHANDLERS that are invoked when PCI
devices are added and deleted. The I/O MMU driver in bhyve installs
handlers for these events which it uses to add and remove devices to
the "host" domain.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7667
This allows a pass through device to be reset to a normal device driver
on the host and reused on the host. ppt devices are now always active in
some I/O MMU domain when the I/O MMU is active, either the host domain
or the domain of a VM they are attached to.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7666
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.
Reviewed by: alc, imp, kib
Differential Revision: https://reviews.freebsd.org/D7714
dependent pmap_ts_referenced() so that it updates the page's dirty field
if a modified bit is found while counting reference bits. This
opportunistic update can be performed at low cost and can eliminate the
need for some future calls to pmap_is_modified() by the machine-
independent layer.
Reviewed by: kib, markj
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7722
for zeroing pages in idle where nontemporal writes are clearly best.
This is almost a no-op since zeroing in idle works does nothing good
and is off by default. Fix END() statement forgotten in previous
commit.
Align the loop in sse2_pagezero(). Since it writes to main memory,
the loop doesn't have to be very carefully written to keep up.
Unrolling it was considered useless or harmful and was not done on
i386, but that was too careless.
Timing for i386: the loop was not unrolled at all, and moved only 4
bytes/iteration. So on a 2GHz CPU, it needed to run at 2 cycles/
iteration to keep up with a memory speed of just 4GB/sec. But when
it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/
iteration so it gave a maximum speed of 2.67GB/sec and couldn't even
keep up with PC3200 memory. Fix the alignment so that it keep up with
4GB/sec memory, and unroll once to get nearer to 8GB/sec. Further
unrolling might be useless or harmful since it would prevent the loop
fitting in 16-bytes. My test system with an old CPU and old DDR1 only
needed 5+ GB/sec. My test system with a new CPU and DDR3 doesn't need
any changes to keep up ~16GB/sec.
Timing for amd64: with 8-byte accesses and newer faster CPUs it is
easy to reach 16GB/sec but not so easy to go much faster. The
alignment doesn't matter much if the CPU is not very old. The loop
was already unrolled 4 times, but needs 32 bytes and uses a fancy
method that doesn't work for 2-way unrolling in 16 bytes. Just
align it to 32-bytes.
same name as for i386). It is not reconnected yet.
Which method is better is too machine-dependent and system-dependent
to replace the old method unconditionally.
Rather than enabling the I/O MMU when the vmm module is loaded,
defer initialization until the first attempt to pass a PCI device
through to a guest. If the I/O MMU fails to initialize or is not
present, than fail the attempt to pass a PCI device through to a
guest.
The hw.vmm.force_iommu tunable has been removed since the I/O MMU is
no longer enabled during boot. However, the I/O MMU support can be
disabled by setting the hw.vmm.iommu.enable tunable to 0 to prevent
use of the I/O MMU on any systems where it is buggy.
Reviewed by: grehan
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D7448
A nice thing about requiring a vDSO is that it makes it incredibly easy
to provide full support for running 32-bit processes on 64-bit systems.
Instead of letting the kernel be responsible for composing/decomposing
64-bit arguments across multiple registers/stack slots, all of this can
now be done in the vDSO. This means that there is no need to provide
duplicate copies of certain system calls, like the sys_lseek() and
freebsd32_lseek() we have for COMPAT_FREEBSD32.
This change imports a new vDSO from the CloudABI repository that has
automatically generated code in it that copies system call arguments
into a buffer, padding them to eight bytes and zero-extending any
pointers/size_t arguments. After returning from the kernel, it does the
inverse: extracting return values, in the process truncating
pointers/size_t values to 32 bits.
Obtained from: https://github.com/NuxiNL/cloudabi
In all of these source files, the userspace pointer size corresponds
with the kernelspace pointer size, meaning that casting directly works.
As I'm planning on making 32-bit execution on 64-bit systems work as
well, use TO_PTR() here as well, so that the changes between source
files remain minimal.
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.
PR: 212014
Reported by: Glyn Grinstead <glyn@grinstead.org>
MFC after: 3 days
The si(4) driver supported multiport serial adapters for ISA, EISA, and
PCI buses. This driver does not use bus_space, instead it depends on
direct use of the pointer returned by rman_get_virtual(). It is also
still locked by Giant and calls for patch testing to convert it to use
bus_space were unanswered.
Relnotes: yes
CloudABI executables already provide support for passing in vDSOs. This
functionality is used by the emulator for OS X to inject system call
handlers. On FreeBSD, we could use it to optimize calls to
gettimeofday(), etc.
Though I don't have any plans to optimize any system calls right now,
let's go ahead and already pass in a vDSO. This will allow us to
simplify the executables, as the traditional "syscall" shims can be
removed entirely. It also means that we gain more flexibility with
regards to adding and removing system calls.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D7438
exception is caught in kernel mode. There are third-party modules
which trigger the issue, and since the problem causes usermode state
corruption at least, panic in production kernels as well.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
difference between files.
For pc98, put x86/mp_x86.c into the same place as used by i386 file list.
Fix typo in comment.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The current implementation uses non-temporal writes. This turns out to
be detrimental to performance if the page is used shortly after, which
is the typical case with page faults.
Switch to rep stos.
Reviewed by: kib
MFC after: 1 week
Any sensible workflow will include a revision control system from which
to restore the old files if required. In normal usage, developers just
have to clean up the mess.
Reviewed by: jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D7353
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.
Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.). (Spotted
by: vangyzen).
Differential revision: https://reviews.freebsd.org/D7172
Sponsored by: Dell Inc.
Approved by: kib (mentor), vangyzen (mentor)
Reviewed by: alc
MFC after: 4 weeks
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.
It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Reviewed by: jhb
Differential revision: https://reviews.freebsd.org/D7148
FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation
use READ_IMPLIES_EXEC flag, introduced in r302515.
While here move common part of mmap() and mprotect() code to the files in compat/linux
to reduce code dupcliation between Linuxulator's.
Reported by: Johannes Jost Meixner, Shawn Webb
MFC after: 1 week
XMFC with: r302515, r302516
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.
READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
It turns out that this value is not used within the system call code
under normal conditions, except when using tracing tools like ktrace.
If we forget to set this value, it is set to random garbage. This may
cause ktrace to hang indefinitely, making it impossible to kill.
Reported by: Michael Plass
PR: 210800
MFC before: 11.0-RELEASE
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
threads, to make it less confusing and using modern kernel terms.
Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
cpu_set_upcall -> cpu_copy_thread (for forks)
cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
Differential revision: https://reviews.freebsd.org/D6731
does not cover the dynamically registered ficititious ranges, and
fictitious pages mappings are not promoted. Offer a dummy struct
md_page to fetch constant superpage pv list generation to satisfy
logic. Also, by initializing the pv_dummy pv_list to empty, we can
remove several explicit PG_FICTITIOUS tests.
Reported and tested by: Michael Butler <imb@protected-networks.net>
(previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D6728
Approved by: re (hrs)
Do not try to change attributes for DMAP when working on a mapping
which is not covered by the DMAP. This was reported on real system
where a BAR of a device (NTB) was mapped outside the PCI window.
Reported and tested by: mav
Reviewed by: jhb, mav
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D6668
The only current purpose of the pvh lock was explained there
On Wed, Jan 09, 2013 at 11:46:13PM -0600, Alan Cox wrote:
> Let me lay out one example for you in detail. Suppose that we have
> three processors and two of these processors are actively using the same
> pmap. Now, one of the two processors sharing the pmap performs a
> pmap_remove(). Suppose that one of the removed mappings is to a
> physical page P. Moreover, suppose that the other processor sharing
> that pmap has this mapping cached with write access in its TLB. Here's
> where the trouble might begin. As you might expect, the processor
> performing the pmap_remove() will acquire the fine-grained lock on the
> PV list for page P before destroying the mapping to page P. Moreover,
> this processor will ensure that the vm_page's dirty field is updated
> before releasing that PV list lock. However, the TLB shootdown for this
> mapping may not be initiated until after the PV list lock is released.
> The processor performing the pmap_remove() is not problematic, because
> the code being executed by that processor won't presume that the mapping
> is destroyed until the TLB shootdown has completed and pmap_remove() has
> returned. However, the other processor sharing the pmap could be
> problematic. Specifically, suppose that the third processor is
> executing the page daemon and concurrently trying to reclaim page P.
> This processor performs a pmap_remove_all() on page P in preparation for
> reclaiming the page. At this instant, the PV list for page P may
> already be empty but our second processor still has a stale TLB entry
> mapping page P. So, changes might still occur to the page after the
> page daemon believes that all mappings have been destroyed. (If the PV
> entry had still existed, then the pmap lock would have ensured that the
> TLB shootdown completed before the pmap_remove_all() finished.) Note,
> however, the page daemon will know that the page is dirty. It can't
> possibly mistake a dirty page for a clean one. However, without the
> current pvh global locking, I don't think anything is stopping the page
> daemon from starting the laundering process before the TLB shootdown has
> completed.
>
> I believe that a similar example could be constructed with a clean page
> P' and a stale read-only TLB entry. In this case, the page P' could be
> "cached" in the cache/free queues and recycled before the stale TLB
> entry is flushed.
TLBs for addresses with updated PTEs are always flushed before pmap
lock is unlocked. On the other hand, amd64 pmap code does not always
flushes TLBs before PV list locks are unlocked, if previously PTEs
were cleared and PV entries removed.
To handle the situations where a thread might notice empty PV list but
third thread still having access to the page due to TLB invalidation
not finished yet, introduce delayed invalidation. Comparing with the
pvh_global_lock, DI does not block entered thread when
pmap_remove_all() or pmap_remove_write() (callers of
pmap_delayed_invl_wait()) are executed in parallel. But _invl_wait()
callers are blocked until all previously noted DI blocks are leaved,
thus ensuring that neccessary TLB invalidations were performed before
returning from pmap_remove_all() or pmap_remove_write().
See comments for detailed description of the mechanism, and also for
the explanations why several pmap methods, most important
pmap_enter(), do not need DI protection.
Reviewed by: alc, jhb (turnstile KPI usage)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5747
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).