Commit Graph

7523 Commits

Author SHA1 Message Date
jhb
731da97ac4 Enable I/O MMU when PCI pass through is first used.
Rather than enabling the I/O MMU when the vmm module is loaded,
defer initialization until the first attempt to pass a PCI device
through to a guest.  If the I/O MMU fails to initialize or is not
present, than fail the attempt to pass a PCI device through to a
guest.

The hw.vmm.force_iommu tunable has been removed since the I/O MMU is
no longer enabled during boot.  However, the I/O MMU support can be
disabled by setting the hw.vmm.iommu.enable tunable to 0 to prevent
use of the I/O MMU on any systems where it is buggy.

Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D7448
2016-08-26 20:15:22 +00:00
ed
d81be03d3f Make execution of 32-bit CloudABI executables work on amd64.
A nice thing about requiring a vDSO is that it makes it incredibly easy
to provide full support for running 32-bit processes on 64-bit systems.
Instead of letting the kernel be responsible for composing/decomposing
64-bit arguments across multiple registers/stack slots, all of this can
now be done in the vDSO. This means that there is no need to provide
duplicate copies of certain system calls, like the sys_lseek() and
freebsd32_lseek() we have for COMPAT_FREEBSD32.

This change imports a new vDSO from the CloudABI repository that has
automatically generated code in it that copies system call arguments
into a buffer, padding them to eight bytes and zero-extending any
pointers/size_t arguments. After returning from the kernel, it does the
inverse: extracting return values, in the process truncating
pointers/size_t values to 32 bits.

Obtained from:	https://github.com/NuxiNL/cloudabi
2016-08-24 10:51:33 +00:00
ed
c0aa6fd209 Convert pointers obtained from the threadattr_t structure with TO_PTR().
In all of these source files, the userspace pointer size corresponds
with the kernelspace pointer size, meaning that casting directly works.
As I'm planning on making 32-bit execution on 64-bit systems work as
well, use TO_PTR() here as well, so that the changes between source
files remain minimal.
2016-08-24 10:13:18 +00:00
jhb
4e659fa057 Fix build for !SMP kernels after the Xen MSIX workaround.
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.

PR:		212014
Reported by:	Glyn Grinstead <glyn@grinstead.org>
MFC after:	3 days
2016-08-22 21:23:17 +00:00
jhb
e24281ea43 Remove the si(4) driver and sicontrol(8) for Specialix serial cards.
The si(4) driver supported multiport serial adapters for ISA, EISA, and
PCI buses.  This driver does not use bus_space, instead it depends on
direct use of the pointer returned by rman_get_virtual().  It is also
still locked by Giant and calls for patch testing to convert it to use
bus_space were unanswered.

Relnotes:	yes
2016-08-19 21:14:27 +00:00
kib
04bce34a47 The pmap_delayed_invl_wait() function blocks on turnstile, it does not
spin, in the committed version.  Remove stray '*' in the text.

Sponsored by:	The FreeBSD Foundation.
MFC after:	3 days
2016-08-11 12:37:11 +00:00
ed
cc2c089a3f Provide the CloudABI vDSO to its executables.
CloudABI executables already provide support for passing in vDSOs. This
functionality is used by the emulator for OS X to inject system call
handlers. On FreeBSD, we could use it to optimize calls to
gettimeofday(), etc.

Though I don't have any plans to optimize any system calls right now,
let's go ahead and already pass in a vDSO. This will allow us to
simplify the executables, as the traditional "syscall" shims can be
removed entirely. It also means that we gain more flexibility with
regards to adding and removing system calls.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7438
2016-08-10 21:02:41 +00:00
kib
acae466016 Unconditionally perform checks that FPU region was entered, when #NM
exception is caught in kernel mode.  There are third-party modules
which trigger the issue, and since the problem causes usermode state
corruption at least, panic in production kernels as well.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-10 13:44:03 +00:00
jhb
1c2702c041 Don't permit mappings of invalid physical addresses on amd64 via /dev/mem.
Discussed with:	kib
2016-08-04 17:55:23 +00:00
jhb
b1de97ff2b Correct assertion on vcpuid argument to vm_gpa_hold().
PR:		208168
Submitted by:	Dave Cameron <daverabbitz@ihug.co.nz>
Reviewed by:	grehan
MFC after:	1 month
2016-08-03 15:20:10 +00:00
kib
9a5f028012 Merge i386 and amd64 variants of mp_watchdog.c into x86/, there is no
difference between files.
For pc98, put x86/mp_x86.c into the same place as used by i386 file list.
Fix typo in comment.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-03 13:51:53 +00:00
mjg
b584a8e1ae amd64: implement pagezero using rep stos
The current implementation uses non-temporal writes. This turns out to
be detrimental to performance if the page is used shortly after, which
is the typical case with page faults.

Switch to rep stos.

Reviewed by:	kib
MFC after:	1 week
2016-07-31 11:34:08 +00:00
brooks
017f31c108 Don't create pointless backups of generated files in "make sysent".
Any sensible workflow will include a revision control system from which
to restore the old files if required.  In normal usage, developers just
have to clean up the mess.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7353
2016-07-28 21:29:04 +00:00
mav
fcb5c9368b Add more UEFI/e820 memory types from latest specifications.
This is only cosmetics.

MFC after:	2 weeks
2016-07-24 09:15:11 +00:00
mav
480936abd1 Increase number of I/O APIC pins from 24 to 32 to give PCI up to 16 IRQs.
Move HPET to the top of the supported 0-31 range.

Proposed by:	jhb@, grehan@
2016-07-14 14:35:25 +00:00
avg
00ea714475 remove a stray change from r302834
MFC after:	3 weeks
X-MFC with:	r302834
2016-07-14 11:13:26 +00:00
avg
555e531ac1 fix-up for configuration of AMD Family 10h processors borrowed from Linux
http://lxr.free-electrons.com/source/arch/x86/kernel/cpu/amd.c#L643
BIOS may configure Family 10h processors to convert WC+ cache type
to CD.  That can hurt performance of guest VMs using nested paging.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D6059
2016-07-14 11:03:05 +00:00
badger
5908cb719e Add explicit detection of KVM hypervisor
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.

Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.).  (Spotted
by: vangyzen).

Differential revision:	https://reviews.freebsd.org/D7172
Sponsored by:	Dell Inc.
Approved by:	kib (mentor), vangyzen (mentor)
Reviewed by:	alc
MFC after:	4 weeks
2016-07-13 19:19:18 +00:00
royger
844ce8697a xen: automatically disable MSI-X interrupt migration
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.

It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.

Sponsored by:		Citrix Systems R&D
MFC after:		3 days
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D7148
2016-07-12 08:43:09 +00:00
dchagin
c93d4a7bde Fix a copy/paste bug introduced during X86_64 Linuxulator work.
FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation
use READ_IMPLIES_EXEC flag, introduced in r302515.

While here move common part of mmap() and mprotect() code to the files in compat/linux
to reduce code dupcliation between Linuxulator's.

Reported by:    Johannes Jost Meixner, Shawn Webb

MFC after:	1 week
XMFC with:	r302515, r302516
2016-07-10 08:22:04 +00:00
dchagin
7acd3da18d Regen for r302215 (Linux personality). 2016-07-10 08:17:16 +00:00
dchagin
50efd461d3 Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag.
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.

READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
2016-07-10 08:15:50 +00:00
ed
887bfdc0a4 Don't forget to set sa->narg for CloudABI system calls.
It turns out that this value is not used within the system call code
under normal conditions, except when using tracing tools like ktrace.
If we forget to set this value, it is set to random garbage. This may
cause ktrace to hang indefinitely, making it impossible to kill.

Reported by: Michael Plass
PR: 210800
MFC before: 11.0-RELEASE
2016-07-08 20:09:21 +00:00
nwhitehorn
89d01c24d1 Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
kib
496a3b1f65 Update comments for the MD functions managing contexts for new
threads, to make it less confusing and using modern kernel terms.

Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
  cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
  cpu_set_upcall -> cpu_copy_thread (for forks)
  cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:05:44 +00:00
kib
7e7c56668a Do not access pv_table array for fictitious pages, since the array
does not cover the dynamically registered ficititious ranges, and
fictitious pages mappings are not promoted.  Offer a dummy struct
md_page to fetch constant superpage pv list generation to satisfy
logic.  Also, by initializing the pv_dummy pv_list to empty, we can
remove several explicit PG_FICTITIOUS tests.

Reported and tested by:	Michael Butler <imb@protected-networks.net>
	(previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D6728
Approved by:	re (hrs)
2016-06-13 03:45:08 +00:00
kib
d80c39f48c Avoid spurious EINVAL in amd64 pmap_change_attr().
Do not try to change attributes for DMAP when working on a mapping
which is not covered by the DMAP. This was reported on real system
where a BAR of a device (NTB) was mapped outside the PCI window.

Reported and tested by:	mav
Reviewed by:	jhb, mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D6668
2016-06-05 17:11:23 +00:00
kib
b049cb19c0 In pmap_advise(), avoid leaking DI start for EPT pmaps which needs A/D
emulation.  Assert that syscalls do not leak DI.

Reported by:	gjb
Sponsored by:	The FreeBSD Foundation
2016-05-27 18:45:11 +00:00
jkim
45ae491494 Both Clang and GCC cannot generate efficient reserve_pv_entries().
http://docs.freebsd.org/cgi/mid.cgi?552BFEB2.8040407

Re-implement it entirely in inline assembly not to let compilers do silly
spilling to memory.  For non-POPCNT case, use newly added bit_count(3).

Reported by:	alc
Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D6541
2016-05-25 23:06:52 +00:00
jkim
63911f4577 Document POPCNT erratum for 6th Generation Intel Core processors. 2016-05-23 23:00:47 +00:00
dchagin
791b4b1122 Add macro to convert errno and use it when appropriate.
MFC after:	1 week
2016-05-22 12:46:34 +00:00
dchagin
d8b7958da0 Regen after r300359 (struct l_sched_param removal).
MFC after:	1 week
2016-05-21 08:03:13 +00:00
dchagin
4f09fcdf95 Correct an argument param of linux_sched_* system calls as a struct l_sched_param
does not defined due to it's nature.

MFC after:	1 week
2016-05-21 08:01:14 +00:00
kib
515614230c Check for overflow and return EINVAL if detected. Backport this and
r300305 to i386.

PR:	209661
Reported and reviewed by:	cturt
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-05-20 19:50:32 +00:00
kib
b357843884 Use unsigned type for the loop index to make overflow checks effective.
PR:	209661
Reported by:	cturt
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-05-20 15:32:48 +00:00
eadler
156fd4834a Don't repeat the the word 'the'
(one manual change to fix grammar)

Confirmed With: db
Approved by: secteam (not really, but this is a comment typo fix)
2016-05-17 12:52:31 +00:00
sephe
6babf96582 atomic: Add testandclear on i386/amd64
Reviewed by:	kib
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6381
2016-05-16 07:19:33 +00:00
kib
b4ddfd6167 Eliminate pvh_global_lock from the amd64 pmap.
The only current purpose of the pvh lock was explained there
On Wed, Jan 09, 2013 at 11:46:13PM -0600, Alan Cox wrote:
> Let me lay out one example for you in detail.  Suppose that we have
> three processors and two of these processors are actively using the same
> pmap.  Now, one of the two processors sharing the pmap performs a
> pmap_remove().  Suppose that one of the removed mappings is to a
> physical page P.  Moreover, suppose that the other processor sharing
> that pmap has this mapping cached with write access in its TLB.  Here's
> where the trouble might begin.  As you might expect, the processor
> performing the pmap_remove() will acquire the fine-grained lock on the
> PV list for page P before destroying the mapping to page P.  Moreover,
> this processor will ensure that the vm_page's dirty field is updated
> before releasing that PV list lock.  However, the TLB shootdown for this
> mapping may not be initiated until after the PV list lock is released.
> The processor performing the pmap_remove() is not problematic, because
> the code being executed by that processor won't presume that the mapping
> is destroyed until the TLB shootdown has completed and pmap_remove() has
> returned.  However, the other processor sharing the pmap could be
> problematic.  Specifically, suppose that the third processor is
> executing the page daemon and concurrently trying to reclaim page P.
> This processor performs a pmap_remove_all() on page P in preparation for
> reclaiming the page.  At this instant, the PV list for page P may
> already be empty but our second processor still has a stale TLB entry
> mapping page P.  So, changes might still occur to the page after the
> page daemon believes that all mappings have been destroyed.  (If the PV
> entry had still existed, then the pmap lock would have ensured that the
> TLB shootdown completed before the pmap_remove_all() finished.)  Note,
> however, the page daemon will know that the page is dirty.  It can't
> possibly mistake a dirty page for a clean one.  However, without the
> current pvh global locking, I don't think anything is stopping the page
> daemon from starting the laundering process before the TLB shootdown has
> completed.
>
> I believe that a similar example could be constructed with a clean page
> P' and a stale read-only TLB entry.  In this case, the page P' could be
> "cached" in the cache/free queues and recycled before the stale TLB
> entry is flushed.

TLBs for addresses with updated PTEs are always flushed before pmap
lock is unlocked.  On the other hand, amd64 pmap code does not always
flushes TLBs before PV list locks are unlocked, if previously PTEs
were cleared and PV entries removed.

To handle the situations where a thread might notice empty PV list but
third thread still having access to the page due to TLB invalidation
not finished yet, introduce delayed invalidation.  Comparing with the
pvh_global_lock, DI does not block entered thread when
pmap_remove_all() or pmap_remove_write() (callers of
pmap_delayed_invl_wait()) are executed in parallel.  But _invl_wait()
callers are blocked until all previously noted DI blocks are leaved,
thus ensuring that neccessary TLB invalidations were performed before
returning from pmap_remove_all() or pmap_remove_write().

See comments for detailed description of the mechanism, and also for
the explanations why several pmap methods, most important
pmap_enter(), do not need DI protection.

Reviewed by:	alc, jhb (turnstile KPI usage)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D5747
2016-05-14 23:35:11 +00:00
alc
dcd0f3bb88 Eliminate an unused #include. For a brief period of time, _unrhdr.h was
used to implement PCID support on amd64.

Reviewed by:	kib
2016-05-13 20:14:41 +00:00
kib
05241d701e Add locking annotations to amd64 struct md_page members.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-10 09:58:51 +00:00
jhb
6bae79f884 Add a new bus method to fetch device-specific CPU sets.
bus_get_cpus() returns a specified set of CPUs for a device.  It accepts
an enum for the second parameter that indicates the type of cpuset to
request.  Currently two valus are supported:

 - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
   the device when DEVICE_NUMA is enabled)
 - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)

For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL.  INTR_CPUS is mapped to 'all_cpus'
by default.  The idea is that INTR_CPUS should always return a valid set.

Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).

The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.

The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled.  They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.

Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
2016-05-09 20:50:21 +00:00
dchagin
c62a8ee2d0 Add a forgotten in r283424 .eh_frame section with CFI & FDE records to allow
stack unwinding through signal handler.

Reported by:	Dmitry Sivachenko
MFC after:	2 weeks
2016-05-09 07:38:47 +00:00
jhb
eb663acb54 Native PCI-express HotPlug support.
PCI-express HotPlug support is implemented via bits in the slot
registers of the PCI-express capability of the downstream port along
with an interrupt that triggers when bits in the slot status register
change.

This is implemented for FreeBSD by adding HotPlug support to the
PCI-PCI bridge driver which attaches to the virtual PCI-PCI bridges
representing downstream ports on HotPlug slots. The PCI-PCI bridge
driver registers an interrupt handler to receive HotPlug events. It
also uses the slot registers to determine the current HotPlug state
and drive an internal HotPlug state machine. For simplicty of
implementation, the PCI-PCI bridge device detaches and deletes the
child PCI device when a card is removed from a slot and creates and
attaches a PCI child device when a card is inserted into the slot.

The PCI-PCI bridge driver provides a bus_child_present which claims
that child devices are present on HotPlug-capable slots only when a
card is inserted. Rather than requiring a timeout in the RC for
config accesses to not-present children, the pcib_read/write_config
methods fail all requests when a card is not present (or not yet
ready).

These changes include support for various optional HotPlug
capabilities such as a power controller, mechanical latch,
electro-mechanical interlock, indicators, and an attention button.
It also includes support for devices which require waiting for
command completion events before initiating a subsequent HotPlug
command. However, it has only been tested on ExpressCard systems
which support surprise removal and have none of these optional
capabilities.

PCI-express HotPlug support is conditional on the PCI_HP option
which is enabled by default on arm64, x86, and powerpc.

Reviewed by:	adrian, imp, vangyzen (older versions)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D6136
2016-05-05 22:26:23 +00:00
alc
52f3fcfa90 Explain why pmap_copy(), pmap_enter_pde(), and pmap_enter_quick_locked()
call pmap_invalidate_page() even though they are not destroying a leaf-
level page table entry.

Eliminate some bogus white-space characters in a comment.

Reviewed by:	kib
2016-05-04 17:54:13 +00:00
pfg
7f85f79cce sys/amd64: Small spelling fixes.
No functional change.
2016-05-03 22:13:04 +00:00
pfg
826c10b2f3 vmm(4): Small spelling fixes.
Reviewed by:	grehan
2016-05-03 22:07:18 +00:00
jhb
c71e075efb Revert bus_get_cpus() for now.
I really thought I had run this through the tinderbox before committing,
but many places need <sys/types.h> -> <sys/param.h> for <sys/bus.h> now.
2016-05-03 01:17:40 +00:00
jhb
2da46e01a0 Add a new bus method to fetch device-specific CPU sets.
bus_get_cpus() returns a specified set of CPUs for a device.  It accepts
an enum for the second parameter that indicates the type of cpuset to
request.  Currently two valus are supported:

 - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
   the device when DEVICE_NUMA is enabled)
 - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)

For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL.  INTR_CPUS is mapped to 'all_cpus'
by default.  The idea is that INTR_CPUS should always return a valid set.

Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).

The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.

The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled.  They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.

Reviewed by:	wblock (manpage)
Differential Revision:	https://reviews.freebsd.org/D5519
2016-05-02 18:00:38 +00:00
jhb
050f1049b2 Move 'device pci' for the PCI bus driver to the MI NOTES file.
The PCI bus was already listed in all of the MD NOTES files and the
driver should at least compile on all platforms.
2016-04-29 23:53:55 +00:00
avg
08fbecbeed fix missing variable in r298736
Pointyhat to:	avg
Reported by:	Ivan Klymenko <fidaj@ukr.net>
MFC after:	2 weeks
X-MFC with:	r298736
2016-04-28 09:40:24 +00:00