Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).
MFC r282901:
Build GENERIC with RACCT/RCTL support by default. Note that it still
needs to be enabled by adding "kern.racct.enable=1" to /boot/loader.conf.
Note those two are MFC-ed together, because the latter one changes the
name of RACCT_DISABLED option to RACCT_DEFAULT_TO_DISABLED. Should have
committed the renaming separately...
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Remove unneeded NULL checks in amd64's trap_fatal().
Since td_name is an array member of struct thread, it can never be NULL,
so the check can be removed. In addition, curproc can never be NULL,
so remove the if statement, and splice the two printfs() together.
While here, remove the u_long cast, and use the correct printf format
specifier curproc->p_pid.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D2695
- Fix pf(4) to build with MAXCPU set to 256. MAXCPU is actually a count,
not a maximum ID value (so it is a cap on mp_ncpus, not mp_maxid).
- Bump MAXCPU on amd64 from 64 to 256. In practice APIC only permits 255
CPUs (IDs 0 through 254). Getting above that limit requires x2APIC.
Microsoft vmbus, storage and other related driver enhancements for HyperV.
- Vmbus multi channel support.
- Vector interrupt support.
- Signal optimization.
- Storvsc driver performance improvement.
- Scatter and gather support for storvsc driver.
- Minor bug fix for KVP driver.
Thanks royger, jhb and delphij from FreeBSD community for the reviews
and comments. Also thanks Hovy Xu from NetApp for the contributions to
the storvsc driver.
PR: 195238
Submitted by: whu
Reviewed by: royger
Approved by: royger
Relnotes: yes
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D2575
The page presence memory test takes a long time on large memory systems
and has little value on contemporary amd64 hardware.
Relnotes: Yes
Reviewed by: jhb, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1544
Revert the IPI startup sequence to match what is described in the
Intel Multiprocessor Specification v1.4. The Intel SDM claims that
278325:
Revert the IPI startup sequence to match what is described in the
Intel Multiprocessor Specification v1.4. The Intel SDM claims that
the INIT IPIs here are invalid, but other systems follow the MP
spec instead.
While here, fix the IPI wait routine to accept a timeout in microseconds
instead of a raw spin count, and don't spin forever during AP startup.
Instead, panic if a STARTUP IPI is not delivered after 20 us.
280866:
Wait 100 microseconds for a local APIC to dispatch each startup-related IPI
rather than 20. The MP 1.4 specification states in Appendix B.2:
"A period of 20 microseconds should be sufficient for IPI dispatch to
complete under normal operating conditions".
(Note that this appears to be separate from the 10 millisecond (INIT) and
200 microsecond (STARTUP) waits after the IPIs are dispatched.) The
Intel SDM is silent on this issue as far as I can tell.
At least some hardware requires 60 microseconds as noted in the PR, so
bump this to 100 to be on the safe side.
PR: 196542, 197756
On some Intel CPUs with a P-state but not C-state invariant TSC the TSC
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST). Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.
PR: 192316
Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
PCI domain/segment. Host-PCI bridge drivers use this API to allocate
bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
full range of bus numbers from secbus to subbus from their parent bridge.
The drivers also always program their primary bus register. The bridge
drivers also support growing their bus range by extending the bus resource
and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.
PR: 197076
Report ARAT (APIC-Timer-always-running) feature for virtual CPU.
This makes FreeBSD guest to not avoid using LAPIC timer, preferring HPET
due to worries about non-existing for virtual CPUs deep sleep states.
Benchmarks of usleep(1) on guest and host show such extra latencies:
- 51us for virtual HPET,
- 22us for virtual LAPIC timer,
- 22us for host HPET and
- 3us for host LAPIC timer.
Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
Sponsored by: Google, Inc.
If the boot-time memory test is enabled, output a dot ('.') for
each GB of RAM tested so people watching the console can see that
the machine is making progress and not hung.
PR: 196650
MFC support for PCI Alternate RID Interpretation. ARI is an optional PCIe
feature that allows PCI devices to present up to 256 functions on a bus.
This is effectively a prerequisite for PCI SR-IOV support.
r264007:
Add a method to get the PCI RID for a device.
Reviewed by: kib
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264008:
Re-implement the DMAR I/O MMU code in terms of PCI RIDs
Under the hood the VT-d spec is really implemented in terms of
PCI RIDs instead of bus/slot/function, even though the spec makes
pains to convert back to bus/slot/function in examples. However
working with bus/slot/function is not correct when PCI ARI is
in use, so convert to using RIDs in most cases. bus/slot/function
will only be used when reporting errors to a user.
Reviewed by: kib
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264009:
Re-write bhyve's I/O MMU handling in terms of PCI RID.
Reviewed by: neel
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264011:
Add support for PCIe ARI
PCIe Alternate RID Interpretation (ARI) is an optional feature that
allows devices to have up to 256 different functions. It is
implemented by always setting the PCI slot number to 0 and
re-purposing the 5 bits used to encode the slot number to instead
contain the function number. Combined with the original 3 bits
allocated for the function number, this allows for 256 functions.
This is enabled by default, but it's expected to be a no-op on currently
supported hardware. It's a prerequisite for supporting PCI SR-IOV, and
I want the ARI support to go in early to help shake out any bugs in it.
ARI can be disabled by setting the tunable hw.pci.enable_ari=0.
Reviewed by: kib
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264012:
Print status of ARI capability in pciconf -c
Teach pciconf how to print out the status (enabled/disabled) of the ARI
capability on PCI Root Complexes and Downstream Ports.
MFC after: 2 months
Sponsored by: Sandvine Inc.
r264013:
Add missing copyright date.
MFC after: 2 months
Improve support for XSAVE with debuggers.
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
to match what Linux does in that 1) it dumps the entire XSAVE area
including the fxsave state, and 2) it stashes a copy of the current
xsave mask in the unused padding between the fxsave state and the
xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
only the extra portion. This avoids having to always make two
ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
and the current XSAVE mask (%xcr0).
Rework virtual machine hypervisor detection.
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
the 0x40000000 leaf to determine the hypervisor vendor ID. Export the
vendor ID and the highest supported hypervisor CPUID leaf via
hv_vendor[] and hv_high variables, respectively. The hv_vendor[]
array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in
the TSC code to identify VMWare.
Disable ACPI and P4TCC throttling by default, following discussion on
freebsd-current. These CPU speed control techniques are usually unhelpful
at best. For now, continue building the relevant code into GENERIC so that
it can trivially be re-enabled at runtime if anyone wants it.
Relnotes: yes
Change the way the lcall $7,$0 is reflected to usermode. Instead of
setting call gate, which must be 64 bit, put a code segment descriptor
into ldt slot 0.
By the time that pmap_init() runs, vm_phys_segs[] has been initialized.
Obtaining the end of memory address from vm_phys_segs[] is a little
easier than obtaining it from phys_avail[].
Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the
default on i386 PAE. (The use of VM_PHYSSEG_SPARSE on i386 PAE saves
us some precious kernel virtual address space that would have been
wasted on unused vm_page structures.)
Move the ACPI PM timer emulation into vmm.ko.
MFC r273706
Change the type of the first argument to the I/O emulation handlers to
'struct vm *'.
MFC r273710
Add a comment explaining the intent behind the I/O reservation [0x72-0x77].
MFC r273744
Add foo_genassym.c files to DPSRCS so dependencies for them are generated.
This ensures these objects are rebuilt to generate an updated header of
assembly constants if needed.
MFC r274045
If the start bit, PxCMD.ST, is cleared and nothing is in-flight then
PxCI, PxSACT, PxCMD.CCS and PxCMD.CR should be 0.
MFC r274076
Improve the ability to cancel an in-flight request by using an interrupt,
via SIGCONT, to force the read or write system call to return prematurely.
MFC r274330
To allow a request to be submitted from within the callback routine of
a completing one increase the total by 1 but don't advertise it.
MFC r274931
Change the lower bound for guest vmspace allocation to 0 instead of using
the VM_MIN_ADDRESS constant.
MFC r275817
For level triggered interrupts clear the PIC IRR bit when the interrupt pin
is deasserted.
MFC r275850
Fix 8259 IRQ priority resolver.
MFC r275952
Various 8259 device model improvements.
MFC r275965
Emulate writes to the IA32_MISC_ENABLE MSR.
Add support AMD processors with the SVM/AMD-V hardware extensions.
MFC r273749
Remove bhyve SVM feature printf's now that they are available in the general
CPU feature detection code.
MFC r273766
Add missing 'break' pointed out by Coverity CID 1249760.
MFC r276098
Allow ktr(4) tracing of all guest exceptions via the tunable "hw.vmm.trace_guest_exceptions"
MFC r276392
Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on an
AMD/SVM host.
MFC r276402
Remove "svn:mergeinfo" property that was dragged along when these files were
svn copied in r273375.
Rename the AMD MSR_PERFCTR[0-3] so the Pentium Pro MSR_PERFCTR[0-1] aren't
redefined.
MFC r273214
Fix build to not bogusly always rebuild vmm.ko.
MFC r273338
Add support for AMD's nested page tables in pmap.c:
- Provide the correct bit mask for various bit fields in a PTE (e.g. valid bit)
for a pmap of type PT_RVI.
- Add a function 'pmap_type_guest(pmap)' that returns TRUE if the pmap is of
type PT_EPT or PT_RVI.
Add CPU_SET_ATOMIC_ACQ(num, cpuset):
This is used when activating a vcpu in the nested pmap. Using the 'acquire'
variant guarantees that the load of the 'pm_eptgen' will happen only after
the vcpu is activated in 'pm_active'.
Add defines for various AMD-specific MSRs.
Discussed with: kib (r261321)
Fix a recursive lock acquisition in vi_reset_dev().
MFC r270434
Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot find
any unmasked pin with an interrupt asserted.
MFC r270436
Fix a bug in the emulation of CPUID leaf 0x4.
MFC r270437
Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package"
tunables to modify the default cpu topology advertised by bhyve.
MFC r270855
Set the 'inst_length' to '0' early on before any error conditions are detected
in the emulation of the task switch. If any exceptions are triggered then the
guest %rip should point to instruction that caused the task switch as opposed
to the one after it.
MFC r270857
The "SUB" instruction used in getcc() actually does 'x -= y' so use the
proper constraint for 'x'. The "+r" constraint indicates that 'x' is an
input and output register operand.
While here generate code for different variants of getcc() using a macro
GETCC(sz) where 'sz' indicates the operand size.
Update the status bits in %rflags when emulating AND and OR opcodes.
MFC r271439
Initialize 'bc_rdonly' to the right value.
MFC r271451
Optimize the common case of injecting an interrupt into a vcpu after a HLT
by explicitly moving it out of the interrupt shadow.
MFC r271888
Restructure the MSR handling so it is entirely handled by processor-specific
code.
MFC r271890
MSR_KGSBASE is no longer saved and restored from the guest MSR save area. This
behavior was changed in r271888 so update the comment block to reflect this.
MFC r271891
Add some more KTR events to help debugging.
MFC r272197
mmap(2) requires either MAP_PRIVATE or MAP_SHARED for non-anonymous mappings.
MFC r272395
Get rid of code that dealt with the hardware not being able to save/restore
the PAT MSR on guest exit/entry. This workaround was done for a beta release
of VMware Fusion 5 but is no longer needed in later versions.
All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT
in the VM exit and entry controls.
MFC r272670
Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.
MFC r272710
Implement the FLUSH operation in the virtio-block emulation.
MFC r272838
iasl(8) expects integer fields in data tables to be specified as hexadecimal
values. Therefore the bit width of the "PM Timer Block" was actually being
interpreted as 50-bits instead of the expected 32-bit.
This eliminates an error message emitted by a Linux 3.17 guest during boot:
"Invalid length for FADT/PmTimerBlock: 50, using default 32"
MFC r272839
Support Intel-specific MSRs that are accessed when booting up a linux in bhyve:
- MSR_PLATFORM_INFO
- MSR_TURBO_RATIO_LIMITx
- MSR_RAPL_POWER_UNIT
MFC r273108
Emulate "POP r/m". This is needed to boot OpenBSD/i386 MP kernel in bhyve.
MFC r273212
Support stopping and restarting the AHCI command list via toggling PxCMD.ST
from '1' to '0' and back. This allows the driver a chance to recover if
for instance a timeout occurred due to activity on the host.
- Remove spaces from boot messages when we print the CPU ID/Family/Stepping
- Move prototypes for various functions into out of C files and into
<machine/md_var.h>.
- Reduce diffs between i386 and amd64 initcpu.c and identcpu.c files.
- Move blacklists of broken TSCs out of the printcpuinfo() function
and into the TSC probe routine.
- Merge the amd64 and i386 identcpu.c into a single x86 implementation.
The iret instruction may generate #np and #ss fault, besides #gp.
When returning to usermode, the handler for that exceptions is also
executed with wrong gs base. Handle all three possible faults in the
same way, checking for iret fault, and performing full iret.