- Allow the userland hypervisor to intercept breakpoint exceptions
(BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to
enable this feature. These exceptions are reported to userland via
a new VM_EXITCODE_BPT that includes the length of the original
breakpoint instruction. If userland wishes to pass the exception
through to the guest, it must be explicitly re-injected via
vm_inject_exception().
- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
pseudo-register. Injecting a BP# on Intel requires setting this to
the length of the breakpoint instruction. AMD SVM currently ignores
writes to this register (but reports success) and fails to read it.
- Rework the per-vCPU state tracked by the debug server. Rather than
a single 'stepping_vcpu' global, add a structure for each vCPU that
tracks state about that vCPU ('stepping', 'stepped', and
'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is
currently reporting an event. Event handlers for MTRAP and
breakpoint exits loop until the associated event is reported to the
debugger.
Breakpoint events are discarded if the breakpoint is not present
when a vCPU resumes in the breakpoint handler to retry submitting
the breakpoint event.
- Maintain a linked-list of active breakpoints in response to the GDB
'Z0' and 'z0' packets.
Reviewed by: markj (earlier version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D20309
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.
This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).
Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695
... after the initial common TSS is copied into its final location
during PCPU reallocation.
Reported by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Use the power of variable to avoid spelling out source and generated
files too many times. The previous Makefiles were hard to read, hard to
edit, and badly formatted.
Reviewed by: kevans, emaste
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22714
They're in both the old and new places in HEAD for the moment for
discussion and transition. The old locations will be garbage collected
in 4 weeks. MFCs to 12 an 11 will keep the old and new for transition
purposes.
Reviewed by: kib
MFC after: 4 weeks
Sponsored by: Intel
Differential Revision: https://reviews.freebsd.org/D22590
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
All Rights Reserved on same line as other copyright holders (but not
me). Other such holders are also listed last where it's clear.
- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.
Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.
Reviewed by: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22611
No need to log all the commands in command ring but only the last one for which completion failed.
Reported by: np@freebsd.org
Reviewed by: np, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22566
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.
This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22315
Typical reasons for doublefault faults are either kernel stack
overflow or bugs in the code that manipulates protection CPU state.
The later code is the code which often has to set up gsbase for
kernel. Switching to explicit load of GSBASE MSR in the fault handler
makes it more probable to output a useful information.
Now all IST handlers have nmi_pcpu structure on top of their stacks.
It would be even more useful to save gsbase value at the moment of the
fault. I did not this because I do not want to modify PCB layout now.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.
Reviewed by: brooks
Differential Revision: https://reviews.freebsd.org/D21894
The purpose of this option is to make it easier to track down memory
corruption bugs by reducing the number of malloc(9) types that might
have recently been associated with a given chunk of memory. However, it
increases fragmentation and is disabled in release kernels.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This CVE has already been announced in FreeBSD SA-19:26.mcu.
Mitigation for TAA involves either turning off TSX or turning on the
VERW mitigation used for MDS. Some CPUs will also be self-mitigating
for TAA and require no software workaround.
Control knobs are:
machdep.mitigations.taa.enable:
0 - no software mitigation is enabled
1 - attempt to disable TSX
2 - use the VERW mitigation
3 - automatically select the mitigation based on processor
features.
machdep.mitigations.taa.state:
inactive - no mitigation is active/enabled
TSX disable - TSX is disabled in the bare metal CPU as well as
- any virtualized CPUs
VERW - VERW instruction clears CPU buffers
not vulnerable - The CPU has identified itself as not being
vulnerable
Nothing in the base FreeBSD system uses TSX. However, the instructions
are straight-forward to add to custom applications and require no kernel
support, so the mitigation is provided for users with untrusted
applications and tenants.
Reviewed by: emaste, imp, kib, scottph
Sponsored by: Intel
Differential Revision: 22374
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.
Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355
This driver allows to usage of the paravirt SCSI controller
in VMware products like ESXi. The pvscsi driver provides a
substantial performance improvement in block devices versus
the emulated mpt and mps SCSI/SAS controllers.
Error handling in this driver has not been extensively tested
yet.
Submitted by: vbhakta@vmware.com
Relnotes: yes
Sponsored by: VMware, Panzura
Differential Revision: D18613
from usermode.
If CPU supports RDFSBASE, the flag also means that userspace fsbase
and gsbase are already written into pcb, which might be not true when
we handle #gp from kernel.
The offender is rdmsr_safe(), and the visible result is corrupted
userspace TLS base.
Reported by: pstef
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Disable the use of executable 2M page mappings in EPT-format page
tables on affected CPUs. For bhyve virtual machines, this effectively
disables all use of superpage mappings on affected CPUs. The
vm.pmap.allow_2m_x_ept sysctl can be set to override the default and
enable mappings on affected CPUs.
Alternate approaches have been suggested, but at present we do not
believe the complexity is warranted for typical bhyve's use cases.
Reviewed by: alc, emaste, markj, scottl
Security: CVE-2018-12207
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21884
to the size of hardware gdt.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22302
On some AMD CPUs, in particular, machines that do not implement
CLFLUSHOPT but do provide CLFLUSH, the CLFLUSH instruction is only
synchronized with MFENCE.
Code using CLFLUSH typicall needs to brace it with MFENCE both before
and after flush, see for instance pmap_invalidate_cache_range(). If
context switch occurs while inside the protected region, we need to
ensure visibility of flushes done on the old CPU, to new CPU.
For all other machines, locked operation done to lock switched thread,
should be enough. For case of different address spaces, reload of
%cr3 is serializing.
Reviewed by: cem, jhb, scottph
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22007
Besides the confusing name, this type is effectively unused.
In all cases where it could be set, the INTERRUPT type is set by the
earlier code. The conditions for TRAP_INTERRUPT are a subset of the
conditions for INTERRUPT.
Reviewed by: kib, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22305
This saves some memory, around 256K I think. It removes some code,
e.g. KPTI does not need to specially map common_tss anymore. Also,
common_tss become domain-local.
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22231
This saves 320 bytes of the precious stack space.
The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that. Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D22138
This significantly reduces contention since chunks get created and removed
all the time. See the review for sample results.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21976
No functional change (in program code). Additional DWARF metadata is
generated in the .eh_frame section. Also, it is now a compile-time
requirement that machine/asm.h ENTRY() and END() macros are paired. (This
is subject to ongoing discussion and may change.)
This DWARF metadata allows llvm-libunwind to unwind program stacks when the
program is executing the function. The goal is to collect accurate
userspace stacktraces when programs have entered syscalls.
(The motivation for "Call Frame Information," or CFI for short -- not to be
confused with Control Flow Integrity -- is to sufficiently annotate assembly
functions such that stack unwinders can unwind out of the local frame
without the requirement of a dedicated framepointer register; i.e.,
-fomit-frame-pointer. This is necessary for C++ exception handling or
collecting backtraces.)
For the curious, a more thorough description of the metadata and some
examples may be found at [1] and documentation at [2]. You can also look at
'cc -S -o - foo.c | less' and search for '.cfi_' to see the CFI directives
generated by your C compiler.
[1]: https://www.imperialviolet.org/2017/01/18/cfi.html
[2]: https://sourceware.org/binutils/docs/as/CFI-directives.html
Reviewed by: emaste, kib (with reservations)
Differential Revision: https://reviews.freebsd.org/D22122
Instead of superpages use. The current code employs superpage-wide locking
regardless and the better locking granularity is welcome with NUMA enabled
even when superpage support is not used.
Requested by: alc
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21982
After removing wmb(), vm_set_rendezvous_func() became super trivial, so
there was no point in keeping it.
The wmb (sfence on amd64, lock nop on i386) was not needed. This can be
explained from several points of view.
First, wmb() is used for store-store ordering (although, the primitive
is undocumented). There was no obvious subsequent store that needed the
barrier.
Second, x86 has a memory model with strong ordering including total
store order. An explicit store barrier may be needed only when working
with special memory (device, special caching mode) or using special
instructions (non-temporal stores). That was not the case for this
code.
Third, I believe that there is a misconception that sfence "flushes" the
store buffer in a sense that it speeds up the propagation of stores from
the store buffer to the global visibility. I think that such
propagation always happens as fast as possible. sfence only makes
subsequent stores wait for that propagation to complete. So, sfence is
only useful for ordering of stores and only in the situations described
above.
Reviewed by: jhb
MFC after: 23 days
Differential Revision: https://reviews.freebsd.org/D21978
- We load the kernel at 0x200000. Memory below that address need not
be executable, so do not map it as such.
- Remove references to .ldata and related sections in the kernel linker
script. They come from ld.bfd's default linker script, but are not
used, and we now use ld.lld to link the amd64 kernel. lld does not
contain a default linker script.
- Pad the .bss to a 2MB as we do between .text and .data. This
forces the loader to load additional files starting in the following
2MB page, preserving the use of superpage mappings for kernel data.
- Map memory above the kernel image with NX. The kernel linker now
upgrades protections as needed, and other preloaded file types
(e.g., entropy, microcode) need not be mapped with execute permissions
in the first place.
Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21859
Move futex_mtx to linux_common.ko for amd64 and aarch64 along
with respective list/mutex init/destroy.
PR: 240989
Reported by: Alex S <iwtcex@gmail.com>
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.
There are three pieces in the complete NetGDB system.
First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.
Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic. Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).
Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.
The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU). There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.
Submitted by: John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with: emaste, markj
Relnotes: sure
Differential Revision: https://reviews.freebsd.org/D21568