Implement vdso - virtual dynamic shared object. Through vdso Linux
exposes functions from kernel with proper DWARF CFI information so that
it becomes easier to unwind through them.
Using vdso is a mandatory for a thread cancelation && cleanup
on a modern glibc.
To reduce code duplication introduce linux_copyout_rusage() method.
Use it in linux_wait4() system call and move linux_wait4() to the MI path.
While here add a prototype for the static bsd_to_linux_rusage().
Switch linuxulator to use the native 1:1 threads.
The reasons:
1. Get rid of the stubs/quirks with process dethreading,
process reparent when the process group leader exits and close
to this problems on wait(), waitpid(), etc.
2. Reuse our kernel code instead of writing excessive thread
managment routines in Linuxulator.
Implementation details:
1. The thread is created via kern_thr_new() in the clone() call with
the CLONE_THREAD parameter. Thus, everything else is a process.
2. The test that the process has a threads is done via P_HADTHREADS
bit p_flag of struct proc.
3. Per thread emulator state data structure is now located in the
struct thread and freed in the thread_dtor() hook.
Mandatory holdig of the p_mtx required when referencing emuldata
from the other threads.
4. PID mangling has changed. Now Linux pid is the native tid
and Linux tgid is the native pid, with the exception of the first
thread in the process where tid and pid are one and the same.
Ugliness:
In case when the Linux thread is the initial thread in the thread
group thread id is equal to the process id. Glibc depends on this
magic (assert in pthread_getattr_np.c). So for system calls that
take thread id as a parameter we should use the special method
to reference struct thread.
In preparation for switching linuxulator to the use the native 1:1
threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval().
Add a kern_sched_rr_get_interval() counterpart which takes a targettd
parameter to allow specify target thread directly by callee (new Linuxulator).
Linuxulator temporarily uses first thread in proc.
Move linux_sched_rr_get_interval() to the MI part.
In preparation for switching linuxulator to the use the native 1:1
threads introduce linux_exit() stub instead of sys_exit() call
(which terminates process).
In the new linuxulator exit() system call terminates the calling
thread (not a whole process).
amd64: allow base memory segment to start at address different than 0
Current code requires that the first physical memory segment starts at 0,
but this is not really needed. We only need to make sure the bootstrap code
and page tables for APs are allocated below 4GB.
This patch removes this requirement and allows booting a Dell R710 from
UEFI, where the first physical memory segment starts at 0x10000.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D1417
Various changes to the registers displayed in DDB for x86.
- Fix segment registers to only display the low 16 bits.
- Remove unused handlers and entries for the debug registers.
- Display xcr0 (if valid) in 'show sysregs'.
- Add '0x' prefix to MSR values to match other values in 'show sysregs'.
- MFamd64: Display various MSRs in 'show sysregs'.
- Add a 'show dbregs' to display the value of debug registers.
- Dynamically size the column width for register values to properly
align columns on 64-bit platforms.
- Display %gs for i386 in 'show registers'.
Various fixes for stack unwinding in DDB on x86.
285773:
Remove some dead code from DDB's amd64 stack unwinder.
The amd64 port copied some code from i386 to fetch function arguments and
display them in backtraces. However, it was commented out and can't easily
be implemented since the function arguments are passed in
registers rather than on the stack in amd64. Remove it in preparation for
some bug fixes in this area.
285775:
Improve stack unwinding on i386 and amd64 after an IP fault.
If we can't find a symbol corresponding to the faulting instruction, assume
that the previously-executed function is a call and attempt to find the
calling function using the return address on the stack. Otherwise we end
up associating the last stack frame with the current call, which is
incorrect and causes the unwinder to skip printing of the calling function,
resulting in a confusing backtrace.
285776:
Let the unwinder handle faults during function prologues or epilogues.
The i386 and amd64 DDB stack unwinders contain code to detect and handle
the case where the first frame is not completely set up or torn down. This
code was accidentally unused however, since db_backtrace() was never called
with a non-NULL trap frame. This change fixes that.
Also remove get_rsp() from the amd64 code. It appears to have come from
i386, which needs to take into account whether the exception triggered a
CPL switch, since SS:ESP is only pushed onto the stack if so. On amd64,
SS:RSP is pushed regardless, so get_rsp() was doing the wrong thing for
kernel-mode exceptions. As a result, we can also remove custom print
functions for these registers.
Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.
- Improve support for Macs that have a stride not equal to the
horizonal resolution (width).
- Support frame buffers that are larger than the default screen
size.
- Support large frame buffers: add 24 more page table pages we
allocate on boot-up.
PR: 193745
MFC r284918:
Add helper fill_based_sd(9).
MFC r284919:
Add x86 PT_GETFSBASE, PT_GETGSBASE machine-depended ptrace requests to
obtain the thread %fs and %gs bases. Add x86 PT_SETFSBASE and
PT_SETGSBASE requests to set the bases from debuggers. The set
requests, similarly to the sysarch({I386,AMD64}_SET_FSBASE), override
the corresponding segment registers.
MFC r284965:
Document x86 machine-specific ptrace(2) requests.
MFC r285011:
Disallow a debugger on 64bit system to set fs/gs bases of the 32bit
process beyond the end of the process address space.
MFC r285104:
Grammar and language fixes.
Pull pmspcv (pms(4)) from GENERIC. It has PCI ID conflicts
with ahd(4), mvs(4), and likely other drivers.
With hat: re
Sponsored by: The FreeBSD Foundation
Make the creation of the free lists dynamic, i.e., it is based on the
available physical memory at boot time. For amd64 systems with 64 GB
or more of physical memory, create free lists for managing pages with
physical addresses below 4 GB.
PR: 185727
Requested by: alc
Approved by: re (gjb)
Emulate the 'bit test' instruction.
MFC r282259:
Re-implement RTC current time calculation to eliminate the possibility of
losing time.
MFC r282281:
Advertise the MTRR feature via CPUID and emulate the minimal set of MTRR MSRs.
MFC r282284:
When an instruction cannot be decoded just return to userspace so bhyve(8)
can dump the instruction bytes.
MFC r282287:
Don't require <sys/cpuset.h> to be always included before <machine/vmm.h>.
MFC r282296:
Emulate MSR_SYSCFG which is accessed by Linux on AMD cpus when MTRRs are
enabled.
MFC r282301:
Relax limits when transitioning a vector from the IRR to the ISR and also
when extinguishing it from the ISR in response to an EOI.
MFC r282335:
Advertise an additional memory BAR in the "dummy" device emulation.
MFC r282336:
Emulate machine check related MSRs to allow guest OSes like Windows to boot.
MFC r282351:
Don't advertise the Intel SMX capability to the guest.
MFC r282407:
Emulate the 'CMP r/m8, imm8' instruction.
MFC r282519:
Add macros for AMD-specific bits in MSR_EFER: LMSLE, FFXSR and TCE.
MFC r282520:
Emulate guest writes to EFER_MSR properly.
MFC r282558:
Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup().
MFC r282571:
Check 'td_owepreempt' and yield the vcpu thread if it is set.
MFC r282595:
Allow byte reads of AHCI registers.
MFC r282784:
Handling indirect descriptors is a capability of the host and not one that
needs to be negotiated. Use the host capabilities field and not the negotiated
field when verifying that indirect descriptors are supported.
MFC r282788:
Allow configuration of the sector size advertised to the guest.
MFC r282865:
Set the subvendor field in config space to the vendor ID. This is required
by the Windows virtio drivers to correctly match a device.
MFC r282922:
Bump the size of the blockif scatter-gather list to 67.
MFC r283075:
Fix off-by-one in array index bounds check. bhyveload would allow you to
create 33 entries on an array that only has 32 slots
MFC r283168:
Temporarily revert r282922 which bumped the max descriptors.
MFC r283255:
Emulate the "CMP r/m, reg" instruction (opcode 39H).
MFC r283256:
Add an option "--get-vmcs-exit-inst-length" to display the instruction length
of the instruction that caused the VM-exit.
MFC r283264:
Change the header type of the emulated host-bridge from type 1 to type 0.
MFC r283293:
Don't rely on the 'VM-exit instruction length' field in the VMCS to always
have an accurate length on an EPT violation.
MFC r283299:
Remove bogus verification of instruction length after instruction decode.
MFC r283308:
Exceptions don't deliver an error code in real mode.
MFC r283657:
Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state.
MFC r283973:
Use tunable 'hw.vmm.svm.features' to disable specific SVM features even
though they might be available in hardware. Use tunable 'hw.vmm.svm.num_asids'
to limit the number of ASIDs used by the hypervisor.
MFC r284046:
Fix regression in 'verify_gla()' with the RIP-relative addressing mode.
MFC r284174:
Support guest writes to the TSC by enabling the "use TSC offsetting"
execution control.
Allow passthrough devices to be hinted.
MFC r279683:
When ICW1 is issued the edge sense circuit is reset which means that
following an initialization a low-to-high transistion is necesary to
generate an interrupt.
MFC r279925:
Add -p parameter to list PCI device to pass through to the guest.
MFC r281559:
Fix handling of BUS_PROBE_NOWILDCARD in 'device_probe_child()'.
MFC r280447:
When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.
MFC r280725:
Move legacy interrupt allocation for virtio devices to common code.
MFC r280775:
Fix the RTC device model to operate correctly in 12-hour mode.
MFC r280929:
Fix "MOVS" instruction memory to MMIO emulation.
MFC r280968:
Display instruction bytes and %rip prior to aborting due to an instruction
emulation error.
MFC r281145:
Enhance the support for Group 1 Extended opcodes for CMP, AND, OR instructions.
MFC r281542:
Initialize 'error' before use (Coverity IDs 1249748, 1249747, 1249751, 1249749)
MFC r281561:
Prior to aborting due to an ioport error, it is always interesting to see what
the guest's %rip is.
MFC r281611:
If the number of guest vcpus is less than '1' then flag it as an error.
MFC r281612:
Prefer 'vcpu_should_yield()' over checking 'curthread->td_flags' directly.
MFC r281630:
Relax the check on which vectors can be delivered through the APIC. According
to the Intel SDM vectors 16 through 255 are allowed to be delivered via the
local APIC.
MFC r281879:
Missing break in switch case (Coverity ID 1292499)
MFC r281946:
Don't allow guest to modify readonly bits in the PCI config 'status' register.
MFC r281987:
STOS/STOSB/STOSW/STOSD/STOSQ instruction emulation.
MFC r282206:
Implement the century byte in the RTC.
Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
MFC r276432:
Initialize all fields of 'struct vm_exception exception' before passing it
to vm_inject_exception().
MFC r276763:
Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception.
MFC r277149:
Clean up usage of 'struct vm_exception' to only to communicate information
from userspace to vmm.ko when injecting an exception.
MFC r277168:
Fix typo (missing comma).
MFC r277309:
Make the error message explicit instead of just printing the usage if the
virtual machine name is not specified.
MFC r277310:
Simplify instruction restart logic in bhyve.
MFC r277359:
Fix a bug in libvmmapi 'vm_copy_setup()' where it would return success even
if the 'gpa' was in the guest MMIO region.
MFC r277360:
MOVS instruction emulation.
MFC r277626:
Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.
MFC r279220:
Don't close a block context if it couldn't be opened avoiding a null deref.
MFC r279225:
Add "-u" option to bhyve(8) to indicate that the RTC should maintain UTC time.
MFC r279227:
Emulate MSR 0xC0011024 when running on AMD processors.
MFC r279228:
Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore
capability of VT-x. This lets bhyve run nested in older VMware versions that
don't support the PAT save/restore capability.
MFC r279540:
Fix warnings/errors when building vmm.ko with gcc.
Revert MFC of r270223, which bumped MAXCPU on amd64 from 64 to 256.
The cpuset_getaffinity(2) and cpuset_setaffinity(2) check minimum set
size, which now fails for binaries compiled on 10.0 with MAXCPU == 64.
Submitted by: jhb
PR: 200802
Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).
MFC r282901:
Build GENERIC with RACCT/RCTL support by default. Note that it still
needs to be enabled by adding "kern.racct.enable=1" to /boot/loader.conf.
Note those two are MFC-ed together, because the latter one changes the
name of RACCT_DISABLED option to RACCT_DEFAULT_TO_DISABLED. Should have
committed the renaming separately...
Relnotes: yes
Sponsored by: The FreeBSD Foundation