Commit Graph

7783 Commits

Author SHA1 Message Date
Ed Maste
8a3b44cfc2 Additional linuxolator whitespace cleanup, missed in r328890 2018-02-05 18:39:06 +00:00
Ed Maste
132f90c660 Linuxolator whitespace cleanup
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator.  Clean these up so that new
architectures do not inherit whitespace issues.

Clean up shared Linuxolator files while here.

Sponsored by:	Turing Robotic Industries Inc.
2018-02-05 17:29:12 +00:00
Konstantin Belousov
f7f14d9dea When switching IBRS on, also enable STIBP (Single Thread Indirect
Branch Predictors) mitigation.

DOcument 336996-001 promises that CPUs which implement IBRS but not
STIBP silently ignore setting of the bit instead of trapping.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-31 16:56:02 +00:00
Konstantin Belousov
319117fd57 IBRS support, AKA Spectre hardware mitigation.
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0.  Current status can be checked in sysctl
hw.ibrs_active.  The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14029
2018-01-31 14:36:27 +00:00
Andriy Gapon
6a8b7aa424 vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts
The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM.  This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest.  But with a purely software emulated vAPIC the
advantage is also a problem.  The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that).  Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would.  The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered.  This creates a window between the actual
interrupt delivery and the update of IRR and ISR.  That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.

The described deviation has been observed to cause an interrupt loss in
the following scenario.  vCPU0 posts an inter-processor interrupt to
vCPU1.  The interrupt is injected as a virtual interrupt by the
hypervisor.  The interrupt is delivered to a guest and an interrupt
handler is invoked.  The handler performs a requested action and
acknowledges the request by modifying a global variable.  So far, there
is no VMEXIT and the hypervisor is unaware of the events.  Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1.  Only
after that a VMEXIT of vCPU1 occurs.  At that time the vector is cleared
in the IRR and is set in the ISR.  vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.

I saw several possibilities of fixing the problem.  One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment.  The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts.  I opted to use the latter
approach for several reasons.  It's equivalent to what VMM/Intel does
(in !VMX case).  It appears to be what VirtualBox and KVM do.  The code
is already there (to support legacy interrupts).

Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.

Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.

PR:		215972
Reported by:	ajschot@hotmail.com, grehan
Tested by:	anish, grehan,  Nils Beyer <nbe@renzel.net>
Reviewed by:	anish, grehan
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D13780
2018-01-31 11:14:26 +00:00
John Baldwin
05d56d83b6 Ensure 'name' is not NULL before passing to strcmp().
This avoids a nested page fault when obtaining a stack trace in DDB if
the address from the first frame does not resolve to a known symbol.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-01-30 23:29:27 +00:00
Bryan Drewery
595109196a Don't use an .OBJDIR for 'make sysent'.
Reported by:	emaste, jhb
Sponsored by:	Dell EMC
2018-01-29 19:14:15 +00:00
Warner Losh
d6b6639713 Add ISA PNP tables to ISA drivers. Fix a few incidental comments.
ACPI ISA PBP tables not tagged, there's bigger issues with them.
2018-01-29 00:22:30 +00:00
Konstantin Belousov
c8f9c1f3d9 Use PCID to optimize PTI.
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.

I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.

Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address.  Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.

On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.

I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc (some aspects)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D13985
2018-01-27 11:49:37 +00:00
Edward Tomasz Napierala
28f3d8b2c2 Add SPDX identifiers to linux_ptrace.c and cfumass.c.
MFC after:	2 weeks
2018-01-24 17:04:01 +00:00
Ed Maste
7eb2159f6a Use BSD-2-Clause-FreeBSD license on linux_support.s
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.

Approved by:	kib
2018-01-23 20:35:43 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Konstantin Belousov
c398c14664 Use correct symbol name in r328202.
Sponsored by:	The FreeBSD Foundation
MFC after:	11 days
2018-01-20 18:05:14 +00:00
Konstantin Belousov
3a5e472e17 Use predefined symbol for the CR3.PCID mask.
Sponsored by:	The FreeBSD Foundation
MFC after:	11 days
2018-01-20 17:46:09 +00:00
Roger Pau Monné
50a53194f6 xen: fix IDT setup after PTI
On amd64 the IDT handler was not set correctly when using PTI.

While there also fix the selectors to SEL_KPL.

Obtained from:	kib
MFC with:	r328083
2018-01-20 14:59:37 +00:00
Konstantin Belousov
b4dfc9d7ad PTI: Trap if we returned to userspace with kernel (full) page table
still active.

Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup.  Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.

I peek this trick in some article about Linux implementation.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
DIfferential revision:	https://reviews.freebsd.org/D13956
2018-01-19 22:10:29 +00:00
Nathan Whitehorn
9a8196ce19 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
Ed Maste
b3327f62f0 Enable KPTI by default on amd64 for non-AMD CPUs
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability.  AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:

    We believe AMD processors are not susceptible due to our use of
    privilege level protections within paging architecture and no
    mitigation is required.

Thus default KPTI to off for AMD CPUs, and to on for others.  This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.

Submitted by:	Mitchell Horne
Reviewed by:	cem
Relnotes:	Yes
Security:	CVE-2017-5754
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D13971
2018-01-19 15:42:34 +00:00
John Baldwin
68fd3b0ef5 Use a dedicated per-CPU stack for machine check exceptions.
Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF.  This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs.  Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3.  Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).

This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.

While here, bypass trap() entirely and just call mca_intr().  This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).

Reviewed by:	kib
Tested by:	avg (on AMD without PTI)
Differential Revision:	https://reviews.freebsd.org/D13962
2018-01-18 23:50:21 +00:00
John Baldwin
f36b1fe0bd Remove two no-longer-used labels from the NMI interrupt handler.
Reviewed by:	kib
2018-01-18 22:13:53 +00:00
John Baldwin
7f513d17b2 Adjust branch target in NMI handler for the !PTI case.
In the !PTI case the NMI handler jumped past the instructions that set
%rdi to point to the current PCB, but the target instructions assumed %rdi
were set.

Reviewed by:	kib
Tested by:	pho
2018-01-18 20:12:12 +00:00
Konstantin Belousov
3705dda7e4 Move the kernphys declaration to machine/md_var.h.
Apparently machinde/cpu.h is supposed to contain MD implementations of
MI interfaces.  Also, remove kernphys declaration from machdep.c,
since it is already provided by md_var.h.

Requested and reviewed by:	bde
MFC after:	13 days
2018-01-18 15:15:35 +00:00
Konstantin Belousov
ac97ccbab5 Fix compilation with gcc.
etext is already declared in machine/cpu.h, move kernphys declaration
there too.

Based on the patch by:	bde
MFC after:	13 days
2018-01-18 11:21:03 +00:00
Konstantin Belousov
406bc0da95 Fix compilation with gas.
Submitted by:	bde
MFC after:	13 days
2018-01-18 11:19:58 +00:00
Konstantin Belousov
0d6c61ec30 Remove the 'last' argument from the pmap_pti_free_page().
It is in fact unused.

Noted and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-01-18 11:01:41 +00:00
John Baldwin
65eefbe422 Save and restore guest debug registers.
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context.  One result is that hardware
watchpoints do not work reliably within a guest under VT-x.

Due to differences in SVM and VT-x, slightly different approaches are
used.

For VT-x:

- Enable debug register save/restore for VM entry/exit in the VMCS for
  DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
  %rflags for the host.  Note that because DR6 is "software" managed
  and not stored in the VMCS a kernel debugger which single steps
  through VM entry could corrupt the guest DR6 (since a single step
  trap taken after loading the guest DR6 could alter the DR6
  register).  To avoid this, explicitly disable single-stepping via
  the trace flag before loading the guest DR6.  A determined debugger
  could still defeat this by setting a breakpoint after the guest DR6
  was loaded and then single-stepping.

For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host.  Since SVM
  saves the guest DR6 in the VMCB, the race with single-stepping
  described for VT-x does not exist.

For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.

Discussed with:	avg, grehan
Tested by:	avg (SVM), myself (VT-x)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D13229
2018-01-17 23:11:25 +00:00
Mark Johnston
9cb26f73ea Annotate a couple of changes from r328083.
Reviewed by:	kib
X-MFC with:	r328083
2018-01-17 21:52:12 +00:00
Konstantin Belousov
bd50262f70 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
Konstantin Belousov
94b011c4bc Amd64 user_ldt_deref() is not used outside sys_machdep.c. Mark it as
static.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-17 11:21:03 +00:00
Pedro F. Giffuni
74641f0bc6 x86: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:08:22 +00:00
Tycho Nightingale
91fe5fe7e7 Provide some mitigation against CVE-2017-5715 by clearing registers
upon returning from the guest which aren't immediately clobbered by
the host.  This eradicates any remaining guest contents limiting their
usefulness in an exploit gadget.

This was inspired by this linux commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa

Reviewed by:	grehan, rgrimes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13573
2018-01-15 18:37:03 +00:00
Konstantin Belousov
5f7b9ff2e3 Add STAC and CLAC instructions wrappers.
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13838
2018-01-14 12:39:50 +00:00
Jeff Roberson
b6715dab8f Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA.
Sponsored by:	Netflix, Dell/EMC Isilon
Discussed with:	jhb
2018-01-14 03:36:03 +00:00
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Konstantin Belousov
c751f90c0c Fix grammar.
Submitted by:	alc
MFC after:	3 days
2018-01-11 16:50:03 +00:00
Konstantin Belousov
6da5c56ae5 Remove redundand CLD instructions.
We already clear %RFLAGS.DF on the kernel entry due to the compiler's
ABI requirements.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-11 13:22:13 +00:00
Konstantin Belousov
4975c202ac Do not clear %RFLAGS.DF on fast syscall entry.
Hardware already did it for us due to the mask loaded into the
MSR_SF_MASK msr register.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13838
2018-01-11 12:54:33 +00:00
Konstantin Belousov
0f7c159f6b Move the hardware setup for fast syscalls into a common function.
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-11 12:40:43 +00:00
Konstantin Belousov
4275e16fa9 Rename COMMON_TSS_RSP0 to TSS_RSP0.
The symbol is just an offset in the hardware TSS structure, it is not
limited to the common_tss instance.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-01-11 12:28:08 +00:00
Konstantin Belousov
3ee6e65875 Update comment explaining the check, to reality.
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-01-11 12:07:24 +00:00
Conrad Meyer
e6fcf7898d x86: Document purpose of _safe variants of {rd,wr}msr()
Sponsored by:	Dell EMC Isilon
2018-01-10 22:41:00 +00:00
Andriy Gapon
091da2dfa5 vmm/svm: contigmalloc of the whole svm_softc is excessive
This is a followup to r307903.

struct svm_softc takes more than 200 kilobytes while what we really need
is 3 contiguous pages for I/O permission map and 2 contiguous pages for
MSR permission map.  Other physically mapped structures have a size of
a single page, so a proper alignment is sufficient for their correct
mapping.

Thus, only the permission maps are allocated with contigmalloc now,
the softc is allocated with a regular malloc.

Additionally, this commit adds a check that malloc returns memory with the
expected page alignment and that contigmalloc does not fail.
Unfortunately, at present svm_vminit() is expected to always succeed and
there is no way to report an error.
So, a contigmalloc failure leads to a panic.
We should probably fix this.

MFC after:	2 weeks
2018-01-09 14:22:18 +00:00
Konstantin Belousov
0530a9360f Make it possible to re-evaluate cpu_features.
Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of
cpu_features, cpu_features2, cpu_stdext_features, and
std_stdext_features2.

The intent is to allow the kernel to see the changes in the CPU
features after micocode update.  Of course, the update is not atomic
across variables and not synchronized with readers.  See the man page
warning as well.

Reviewed by:	imp (previous version), jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13770
2018-01-05 21:06:19 +00:00
Andriy Gapon
5f3c7d6580 Fix a couple of comments in AMD Virtual Machine Control Block structure
MFC after:	1 week
2018-01-05 19:15:24 +00:00
Konstantin Belousov
84874cc151 Avoid re-check of usermode condition.
It does not change anything in the behavior of trap_pfault(), while
eliminating obfuscation of jumping to the code which checks for the
condition reversed of the goto cause.  Also avoid force initialize the
rv variable, since it is now only accessed after storing vm_fault()
return value.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13725
2018-01-01 20:47:03 +00:00
Konstantin Belousov
1865d6b851 Remove MP SAFE marks and stray register name in comments.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-12-31 17:07:59 +00:00
Colin Percival
31a55efdc5 Use the TSLOG framework to record entry/exit timestamps for hammer_time.
The entry must be logged "manually" using TSRAW rather than TSENTER
since PCPU data structures have not yet been initialized and thus
curthread cannot be accessed; &thread0 is what will become curthread
later in hammer_time.

Other MD initialization code should be similarly instrumented in order
to gain visibility into the time spent before entering mi_startup; this
will require some care and testing from people with access to such
hardware.
2017-12-31 09:22:07 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Tycho Nightingale
9e33a61693 Recognize a pending virtual interrupt while emulating the halt instruction.
Reviewed by:	grehan, rgrimes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13573
2017-12-21 18:30:11 +00:00
Konstantin Belousov
30d4f9e888 Add atomic_load(9) and atomic_store(9) operations.
They provide relaxed-ordered atomic access semantic.  Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses.  The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.

The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations.  It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.

Suggested by:	jhb
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 09:59:20 +00:00
Mark Johnston
5bab623438 Pass the trap frame to fasttrap hooks.
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.

MFC after:	2 weeks
2017-12-11 19:21:39 +00:00
Bruce Evans
fb3cc1c37d Move instantiation of msgbufp from 9 MD files to subr_prf.c.
This variable should be pure MI except possibly for reading it in MD
dump routines.  Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998.  There were many imperfections in
r36441.  This commit fixes only a small one, to simplify fixing the
others 1 arch at a time.  (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
2017-12-07 07:55:38 +00:00
Andriy Gapon
a7437a3e9d amd-vi: set iommu msi configuration using pci_enable_msi method
This is better than directly changing PCI configuration space of the
device because it makes the PCI bus aware of the configuration.
Also, the change allows to drop a bunch of code that duplicated
pci_enable_msi() functionality.

I wonder if it's possible to further simplify the code by using
pci_alloc_msi().
2017-12-04 17:10:52 +00:00
Andriy Gapon
df92c28d6a vmm/amd: add ivhd device with a higher order
ivhd should attach after the root PCI bus and, thus, after the ACPI
Host-PCI bridge off which the bus hangs.  This is because ivhd changes
PCI configuration of a PCI IOMMU device that is located on the root bus.
If the bus attaches after ivhd it clears the MSI portion of the
configuration.  As a result IOMMU event interrupts would never be
delivered.

For regular ACPI devices the order is calculated as
    ACPI_DEV_BASE_ORDER + level * 10
where level is a depth of the device in the ACPI namespace.
I expect the depth of the Host-PCI bridge to be two or three,
so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order
for ivhd.

This should fix the setup of the AMD-Vi event interrupt when vmm is
preloaded (as opposed to kldload-ed).
2017-12-04 17:08:03 +00:00
Andriy Gapon
8f09494d1e amd-vi: clear event interrupt and overflow bits upon handling the interrupt
This ensures that we can receive further event interrupts.
See the description of the bits in the specification for
MMIO Offset 2020h IOMMU Status Register.
The bits are defined as set-by-hardware write-1-to-clear, same as all
the bits in the status register.

Discussed with:	anish
2017-12-04 17:02:53 +00:00
Scott Long
c15269ccb8 It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from
the standard kernels.  They are still available as custom compile
options.
2017-11-29 23:41:49 +00:00
Brooks Davis
5cd667e65f Disable vim syntax highlighting.
Vim's default pick doesn't understand that ';' is a comment character
and the result looks horrible.

Reviewed by:	emaste
2017-11-28 18:23:17 +00:00
Konstantin Belousov
dde5602786 Fix index calculation for the page table pages for efirt 1:1 map.
Stop issuing pre-assigned number to enumerate all page table pages,
the assignment is incorrect.  Instead automatically calculate the next
unused index. This index in fact does not serve any purpose except to
be unique to satisfy vm_page_grab() interface, we do not look up the
page by the index later.

Reported and tested by:	emaste
Reviewed by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
PR:	223906
Differential revision:	https://reviews.freebsd.org/D13273
2017-11-28 09:34:43 +00:00
Fedor Uporov
cd76ee1ee3 Remap ENOATTR to ENODATA in the linuxulator.
In the linux ENOADATA is frequently #defined as ENOATTR.
The change is required for an xattrs support implementation.

MFC after: 1 week
Discussed with: netchild
Approved by: pfg

Differential Revision: https://reviews.freebsd.org/D13221
2017-11-27 17:03:11 +00:00
Pedro F. Giffuni
c49761dd57 sys/amd64: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:03:07 +00:00
Ed Schouten
ee13ffbe03 Use TO_PTR() to convert integers to pointers.
For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in
this place. Also use it for all of the other source files, so that the
difference remains as minimal as possible.

MFC after:	2 weeks
2017-11-26 14:45:56 +00:00
Hans Petter Selasky
8a53e1340f Merge ^/head r326132 through r326161. 2017-11-24 12:13:27 +00:00
Andriy Gapon
a3bbbd5e40 amd-vi: a small whitespace cleanup
Reviewed by:	anish
2017-11-24 11:37:41 +00:00
Andriy Gapon
685c54fc6a amd-vi: use correct type for pci_rid, start_dev_rid, end_dev_rid sysctls
Previously, the values could look confusing because of unrelated bits from
adjacent memory.

Reviewed by:	anish
2017-11-24 11:36:35 +00:00
Andriy Gapon
eb6c9c128c amd-vi: small improvements to event printing
Ensure that an opening bracket always has a matching closing one.
Ensure that there is always a new-line at the end of a report line.
Also, add a space before the printed event flag.

Reviewed by:	anish
2017-11-24 11:35:43 +00:00
Andriy Gapon
dee38cdc2a amd-vi: print some additional details for INVALID_DEVICE_REQUEST event
Namely, the type of the hardware event and whether the transaction
was a translation request.

Reviewed by:	anish
2017-11-24 11:34:46 +00:00
Andriy Gapon
53d580f984 amd-vi: fix up r326152, the new width requires a wider type
This is my brain-o from extending the width at the last moment.
2017-11-24 11:25:06 +00:00
Andriy Gapon
5a041f2183 amd-vi: fix and extend definition of Command and Event Status Register (0x2020)
The defined bits are the lower bits, not the higher ones.

Also, the specification has been extended to define bits 0:18 and they
all could potentially be interesting to us, so extend the width of the
field accordingly.

Reviewed by:	anish
2017-11-24 11:20:10 +00:00
Andriy Gapon
8523ad24ba vmm/amd: improve iteration over IVHD (type 10h) entries in IVRS table
Many 8-byte entries have zero at byte 4, so the second 4-byte part is
skipped as a 4-byte padding entry.  But not all 8-byte entries have that
property and they get misinterpreted.

A real example:
    48 00 00 00 ff 01 00 01
This an 8-byte ACPI_IVRS_TYPE_SPECIAL entry for IOAPIC with ID 255 (bogus).
It is reported as:
    ivhd0: Unknown dev entry:0xff
Fortunately, it was completely harmless.

Also, bail out early if we encounter an entry of a variable length type.
We do not have proper handling for those yet.

Reviewed by:	anish
2017-11-24 11:10:36 +00:00
Ed Schouten
814629dd64 Don't let cpu_set_syscall_retval() clobber exec_setregs().
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).

Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.

Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.

To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.

Reviewed by:	kib, jhibbits
Differential Revision:	https://reviews.freebsd.org/D13180
2017-11-24 07:35:08 +00:00
Hans Petter Selasky
82725ba9bf Merge ^/head r325999 through r326131. 2017-11-23 14:28:14 +00:00
Konstantin Belousov
383f241dce Remove lint support from system headers and MD x86 headers.
Reviewed by:	dim, jhb
Discussed with:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13156
2017-11-23 11:40:16 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Hans Petter Selasky
937d37fc6c Merge ^/head r325842 through r325998. 2017-11-19 12:36:03 +00:00
Pedro F. Giffuni
df57947f08 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
Hans Petter Selasky
55b1c6e7e4 Merge ^/head r325663 through r325841. 2017-11-15 11:28:11 +00:00
Hans Petter Selasky
8dee9a7a44 Remove no longer supported mthca driver.
Sponsored by:	Mellanox Technologies
2017-11-13 10:59:38 +00:00
Mateusz Guzik
ca0227933e amd64: stop nesting preemption counter in spinlock_enter
Discussed with:	jhb
2017-11-12 03:13:01 +00:00
Jeff Roberson
8d6fbbb867 Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated.  This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons.  This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by:	alc, kib, markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2017-11-08 02:39:37 +00:00
Konstantin Belousov
b535ed2898 Zero the structure instead of the pointer to it.
Reported by:	Don Morris <Don.Morris@dell.com>
MFC after:	4 days
2017-11-05 20:03:57 +00:00
Konstantin Belousov
5b9a3721e6 x86: Do not emit unused TD_TID symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-04 10:51:52 +00:00
Konstantin Belousov
ad4e4ae591 Restore an optimization that was temporary disabled by r324665.
In reclaim_pv_chunk(), rotate the pv chunks list so that next
invocations of the reclaim do not scan the same pv chunks that could
not be freed.  Only do the rotation when there is no parallel scan,
tracked by active_reclaims counter.

To rotate, move all chunks that are before current iteration marker,
after another marker that is inserted at the list tail on start of the
reclaim.

Reviewed by:	alc
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-01 18:06:44 +00:00
Konstantin Belousov
aa788cc387 Consistently ensure that we do not load MXCSR with reserved bits set.
Some callers of fpusetregs()/npxsetregs(), most importantly
set_fpcontext(), clear reserved bits.  But some did not.  Do the
clearing in fpusetregs() and remove now redundand operation from
set_fpcontext().

Reported by:	Maxime Villard <max@m00nbsd.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-01 10:32:44 +00:00
Peter Grehan
9d210a4a18 Emulate the "OR reg, r/m" instruction (opcode 0BH).
This is	needed for the HDA emulation with FreeBSD guests.

Reviewed by:	marcelo
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12832
2017-11-01 03:26:53 +00:00
Tijl Coosemans
f236378b54 Set the return address for stack entry points to zero.
Stack unwinders treat zero as a stop condition.  The value on the stack can
be non-zero because thread stacks may be arbitrary memory provided via
pthread_attr_setstack(3) or may be recycled from previous threads.

Reference:
https://lists.freebsd.org/pipermail/freebsd-current/2017-August/066855.html
https://lists.freebsd.org/pipermail/freebsd-current/2017-October/067254.html

Discussed with:	kib
MFC after:	1 week
2017-10-31 11:51:34 +00:00
Ian Lepore
5deb1573e8 Improve the performance of the hpet timer in bhyve guests by making the
timer frequency a power of two.  This changes the frequency from 10 to
16.7 MHz (2 ^ 24 HZ).  Using a power of two avoids roundoff errors when
doing arithmetic in sbintime_t units.

Testing shows this can fix erratic ntpd behavior in guests using the
hpet timer (which is the default for multicore guests).

Reported by:	bsam@
2017-10-29 20:50:03 +00:00
Eitan Adler
a2aef24aa3 Update several more URLs
- Primarily http -> https
- Primarily FreeBSD project URLs
2017-10-29 08:17:03 +00:00
John Baldwin
6db55a0f3a Rework pass through changes in r305485 to be safer.
Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated.  The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption.  Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.

As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.

PR:		222937
Tested by:	Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12661
2017-10-27 14:57:14 +00:00
Mark Johnston
5fca1d90c1 Fix the VM_NRESERVLEVEL == 0 build.
Add VM_NRESERVLEVEL guards in the pmaps that implement transparent
superpage promotion using reservations.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12764
2017-10-23 15:34:05 +00:00
Mateusz Guzik
9e68989764 Make the sleepq chain hash size configurable per-arch and bump on amd64.
While here cache-align chains.

This shortens longest found chain during poudriere -j 80 from 32 to 16.

Pushing this higher up will probably require allocation on boot.
2017-10-22 20:43:50 +00:00
Bjoern A. Zeeb
8e94025b41 With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into
HEAD.  Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.

Disable building LINT-VIMAGE with VIMAGE being default.

This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.

Requested by:		many
Reviewed by:		kristof, emaste, hiren
X-MFC after:		never
Relnotes:		yes
Differential Revision:	https://reviews.freebsd.org/D12639
2017-10-20 21:40:59 +00:00
Mateusz Guzik
e66167764a amd64: plug missed dt_lock in cpu_fork 2017-10-20 18:58:11 +00:00
Mateusz Guzik
a5db8ade37 amd64: __exclusive_cache_line pv_chunks_mutex and pv_list_locks
Note that pv_list_locks is an array and currently it fits 2 locks per line.
Resizing it and/or putting more locks in different lines requires several tests.

MFC after:	1 week
2017-10-20 03:38:58 +00:00
Mateusz Guzik
d95498d44f amd64: avoid acquiring dt lock if possible (which is the common case)
Discussed with:	kib
MFC after:	1 week
2017-10-20 03:30:02 +00:00
Mark Johnston
46fcd1af63 Move kernel dump offset tracking into MI code.
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.

This is needed to implement support for kernel dump compression in the
MI kernel dump code.

Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.

Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11722
2017-10-18 15:38:05 +00:00
Konstantin Belousov
ca1f624517 Fix the pv_chunks pc_lru tailq handling in reclaim_pv_chunk().
For processing, reclaim_pv_chunk() removes the pv_chunk from the lru
list, which makes pc_lru linkage invalid.  Then the pmap lock is
released, which allows for other thread to free the last pv entry
allocated from the chunk and call free_pv_chunk(), which tries to
modify the invalid linkage.

Similarly, the chunk is inserted into the private tailq new_tail
temporary.  Again, free_pv_chunk() might be run and corrupt the
linkage for the new_tail after the pmap lock is dropped.

This is a consequence of r299788 elimination of pvh_global_lock, which
allowed for reclaim to run in parallel with other pmap calls which
free pv chunks.

As a fix, do not remove the chunk from pc_lru queue, use a marker to
remember the position in the queue iteration.  We can safely operate
on the chunks after the chunk's pmap is locked, we fetched the chunk
after the marker, and we checked that chunk pmap is same as we have
locked, because chunk removal from pc_lru requires both pv_chunk_mutex
and the pmap mutex owned.

Note that the fix lost an optimization which was present in the
previous algorithm.  Namely, new_tail requeueing rotated the pv chunks
list so that reclaim didn't scan the same pv chunks that couldn't be
freed (because they contained a wired and/or superpage mapping) on
every invocation.  An additional change is planned which would improve
this.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 15:16:24 +00:00
Konstantin Belousov
1df04cc069 Change amd64_get_ldt() to return 'EOF' when the LDT is not yet
allocated, when requested range of descriptors does not fit into
currently allocated LDT, or trim the return if the range fits
partially.  Before, the function returned EINVAL.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 16:20:39 +00:00
Mateusz Guzik
801eec865f amd64: remove unused variable from pmap_delayed_invl_genp
Reported by:	gcc
MFC after:	1 week
2017-10-05 18:51:48 +00:00
Konstantin Belousov
a6d4b1dc48 Ensure that after sucessfull i386_set_ldt() call, other threads can
use LDT segments immediately.

If the i386_set_ldt() call created a first LDT descriptor (and
consequently created the LDT) for our address space, LDTR is currently
loaded only on the CPU executing the syscall.  Other CPUs executing
threads sharing the address space, would only load LDTR after context
switch.

Uncomment set_user_ldt_rv() and call it on all CPUs.  Remove critical
section inside set_user_ldt(), it is not needed in the context of call
from smp_rendezvous().

Set md_ldt after md_ldt_sd is initialized using the same code sequence
as in user_ldt_free().  Do the whole initialization in a critical
section, to not race with the context switching while we set LDT.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 13:12:59 +00:00
Konstantin Belousov
78d58cb6bc Avoid a race betweem freeing LDT and context switches.
cpu_switch.S uses curproc->p_md.md_ldt value as the flag indicating
presence of the process LDT.  The flag is checked and then ldt segment
descriptor is copied into the CPU' GDT slot.

Disallow context switches around clearing of the curproc LDT state by
performing the cleanup in critical section.  Ensure that the md_ldt
flag is cleared before md_ldt_sd descriptor content is destroyed by
inserting fence between the operations.

We depend on the x86 memory model strong ordering guarantees, in
particular, that cpu_switch.S observes the writes to md_ldt and
md_ldt_sd in the expected order.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:50:03 +00:00
Konstantin Belousov
287c718f32 Improve amd64_get_ldt().
Provide consistent snapshot of the requested descriptors by preventing
other threads from modifying LDT while we fetch the data, lock dt_lock
around the read.  Copy the data into intermediate buffer, which is
copied out after the lock is dropped.

Use guaranteed atomic (aligned volatile) reads of the descriptors to
use same-size atomic as CPU update to set A bit in the descriptor type
field.

Improve overflow checking for the descriptors range calculations and
remove unneeded casts.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:29:34 +00:00
Konstantin Belousov
8fc26d9612 Minor style fix.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:19:55 +00:00
Konstantin Belousov
a58679a93b Complete r323772 on amd64.
Compilers are allowed to combine plain reads into group operations,
e.g. 64bit element copies of one array into another can be
legitimately optimized back to a memcpy() call, which r323772 tried to
prevent.

Qualify accesses to LDT descriptors with volatile dereference to
ensure that each write indeed occurs.  After that, our usual claim of
native-size aligned writes being atomic applies.

This is equivalent to atomic_store(memory_order_relaxed) C11 accesses,
but our machine/atomic.h does not provide corresponding primitive.

Noted and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:16:45 +00:00
Konstantin Belousov
98af67c78e Use ANSI C declaration for amd64_get_ldt().
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:07:38 +00:00
Konstantin Belousov
83d55c8ac2 Correct format specifiers in the debug code.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:01:39 +00:00
Konstantin Belousov
687a5be47a Remove useless comments.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 11:56:04 +00:00
Konstantin Belousov
a1fc6a8c49 On amd64, mark the set_user_ldt() function as static.
On i386, the function is used from the context switch code and needs
to be accessible externally.  Amd64 MD context switch does not lock an
LDT spinlock and inlines switching in assembly.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 11:50:01 +00:00
Konstantin Belousov
37afe7dfd2 Reduce default max_ldt_segment value to 512.
This makes the LDT to use only one page with default settings,
avoiding the need to find contigous 2 pages in KVA.  It seems that
most users are fine even with 512 segments.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 11:36:55 +00:00
Konstantin Belousov
843d5752f5 Update comment to note that we skip LDT reload for kthreads as well.
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-10-05 11:34:51 +00:00
Konstantin Belousov
9674d76346 Hide kernel stuff from userspace.
Sponsored by:	Mellanox Technologies
2017-10-02 08:37:43 +00:00
Andrew Turner
0e73a61997 To prepare for adding EFI runtime services support on arm64 move the
machine independent parts of the existing code to a new file that can be
shared between amd64 and arm64.

Reviewed by:	kib (previous version), imp
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D12434
2017-10-01 19:52:47 +00:00
Konstantin Belousov
3cabd93e26 Do not do torn writes to active LDTs.
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12413
2017-09-19 17:57:04 +00:00
Ilya Bakulin
a9bfc8d2ae Add MMCCAM-enabled kernel config for IMX6, reduce debug noice in MMCCAM kernels
CAM_DEBUG_TRACE results in way too much debug output than needed now.
When debugging, it's always possible to turn on trace level using camcontrol.

Approved by:	imp (mentor)
Differential Revision:	https://reviews.freebsd.org/D12110
2017-09-13 10:56:02 +00:00
Conrad Meyer
907f50fe04 Add smn(4) driver for AMD System Management Network
AMD Family 17h CPUs have an internal network used to communicate between
the host CPU and the PSP and SMU coprocessors.  It exposes a simple
32-bit register space.

Reviewed by:	avg (no +1), mjoras, truckman
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12217
2017-09-05 15:13:41 +00:00
Josh Paetzel
9d0ec2a920 Revert r323087
This needs more thinking out and consensus, and the commit message
was wrong AND there was a typo in the commit.

pointyhat:	jpaetzel
2017-09-01 17:03:48 +00:00
Josh Paetzel
0be04b100c Take options IPSEC out of GENERIC
PR:	220170
Submitted by:	delphij
Reviewed by:	ae, glebius
MFC after:	2 weeks
Differential Revision:	D11806
2017-09-01 15:54:53 +00:00
Josh Paetzel
3b65550eec Allow kldload tcpmd5
PR:	220170
MFC after:	2 weeks
2017-08-31 20:16:28 +00:00
Alexander Motin
ed9652da5f Add NTB driver for PLX/Avago/Broadcom PCIe switches.
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too.  It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells.  There are also 4 DMA engines
in those chips, but they are not yet supported.

While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-08-30 21:16:32 +00:00
Conrad Meyer
2744a0b69b Drop CACHE_LINE_SIZE to 64 bytes on x86
The actual cache line size has always been 64 bytes.

The 128 number arose as an optimization for Core 2 era Intel processors.  By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently.  Newer CPUs prefetch more intelligently.

The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington").  If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.

Reported by:	mjg
Reviewed by:	np
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2017-08-28 22:28:41 +00:00
Ryan Libby
0d4e7ec5f3 amd64: drop q suffix from rd[fg]sbase for gas compatibility
Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12133
2017-08-26 23:13:18 +00:00
Konstantin Belousov
9fc847133e Save KGSBASE in pcb before overriding it with the guest value.
Reported by:	lwhsu, mjoras
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	18 days
2017-08-24 10:49:53 +00:00
Konstantin Belousov
761fb3ef29 Ensure that fs/gs bases are stored in pcb before copying the pcb for
new process or thread.

Reported and tested by:	ae, dhw
Sponsored by:	The FreeBSD Foundation
MFC after:	20 days
2017-08-22 18:15:47 +00:00
Konstantin Belousov
3e902b3d76 Make WRFSBASE and WRGSBASE instructions functional.
Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.

Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.

Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.

The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.

Reviewed by:	jhb (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D12023
2017-08-21 17:38:02 +00:00
Konstantin Belousov
9ed84d55c1 Simplify the code.
Noted by:	Oliver Pinter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 11:18:16 +00:00
Konstantin Belousov
43b7b1f29b Simplify amd64 trap().
- Use more relevant name 'signo' instead of 'i' for the local variable
  which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
  of the trap() function.  Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.

Requested and reviewed by:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 09:52:25 +00:00
Konstantin Belousov
4031ebef84 Trim excessive 'extern' and remove unused declaration.
Reviewed by:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 09:42:09 +00:00
Konstantin Belousov
dad2e0e420 Use ANSI C declaration for trap_pfault(). Style.
Reviewed by:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 09:39:10 +00:00
Ruslan Bukin
5651294282 Fix module unload when SGX support is not present in CPU.
Sponsored by:	DARPA, AFRL
2017-08-18 14:47:06 +00:00
Mark Johnston
01938d3666 Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
Mark Johnston
50ef60dabe Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
Conrad Meyer
dc6a82801d x86: Add dynamic interrupt rebalancing
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs.  Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources.  Reduced latency
was proven by, e.g., SPEC2008.

Submitted by:	jeff@ (earlier version)
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10435
2017-08-16 18:48:53 +00:00
Ruslan Bukin
7dea76609b Rename macro DEBUG to SGX_DEBUG.
This fixes LINT kernel build.

Reported by:	lwhsu
Sponsored by:	DARPA, AFRL
2017-08-16 13:44:46 +00:00
Ruslan Bukin
2164af29a0 Add support for Intel Software Guard Extensions (Intel SGX).
Intel SGX allows to manage isolated compartments "Enclaves" in user VA
space. Enclaves memory is part of processor reserved memory (PRM) and
always encrypted. This allows to protect user application code and data
from upper privilege levels including OS kernel.

This includes SGX driver and optional linux ioctl compatibility layer.
Intel SGX SDK for FreeBSD is also available.

Note this requires support from hardware (available since late Intel
Skylake CPUs).

Many thanks to Robert Watson for support and Konstantin Belousov
for code review.

Project wiki: https://wiki.freebsd.org/Intel_SGX.

Reviewed by:	kib
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11113
2017-08-16 10:38:06 +00:00
Konstantin Belousov
baec5778ed Print whole machine state on double fault.
It is quite useful when double fault is not caused by a stack overflow.

Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-14 11:23:07 +00:00
Konstantin Belousov
0fd7ea1f21 Add {rd,wr}{fs,gs}base C wrappers for instructions.
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-14 11:20:54 +00:00
Konstantin Belousov
7bf0049e48 Style.
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-14 11:20:10 +00:00
Jung-uk Kim
b5669d0aa8 Split identify_cpu() into two functions for amd64 as we do for i386. This
reduces diff between amd64 and i386.  Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.

Reported by:	dexuan
Tested by:	dexuan
2017-08-09 18:09:09 +00:00
Warner Losh
9057f54d74 Fail to open efirt device when no EFI on system.
libefivar expects opening /dev/efi to indicate if the we can make efi
runtime calls. With a null routine, it was always succeeding leading
efi_variables_supported() to return the wrong value. Only succeed if
we have an efi_runtime table. Also, while I'm hear, out of an
abundance of caution, add a likely redundant check to make sure
efi_systbl is not NULL before dereferencing it. I know it can't be
NULL if efi_cfgtbl is non-NULL, but the compiler doesn't.
2017-08-08 20:44:16 +00:00
Konstantin Belousov
fe04f5e9d0 Avoid DI recursion when reclaim_pv_chunk() is called from
pmap_advise() or pmap_remove().

Reported and tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-07 17:29:54 +00:00
Konstantin Belousov
1a47eac0f5 Explain why delayed invalidation is not required in pmap_protect() and
pmap_remove_pages().

Submitted by:	alc
MFC after:	1 week
2017-08-07 17:23:10 +00:00
Jung-uk Kim
0105034487 Detect hypervisors early. We used to set lower hz on hypervisors by default
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().

MFC after:	3 days
2017-08-05 06:56:46 +00:00
Conrad Meyer
6f240e18b5 x86: Tag some intrinsics with __pure2
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__.  Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-03 22:28:30 +00:00
Ed Schouten
c852847584 Keep top page on CloudABI to work around AMD Ryzen stability issues.
Similar to r321899, reduce sv_maxuser by one page inside of CloudABI.
This ensures that the stack, the vDSO and any allocations cannot touch
the top page of user virtual memory.

Considering that CloudABI userspace is completely oblivious to virtual
memory layout, don't bother making this conditional based on the CPU of
the running system.

Reviewed by:	kib, truckman
Differential Revision:	https://reviews.freebsd.org/D11808
2017-08-02 13:08:10 +00:00
Mateusz Guzik
fd1d4c8159 amd64: annotate the syscall return address check with __predict_false
before:
   0xffffffff80b03ebb <+2059>:	mov    0x460(%r14),%rax
   0xffffffff80b03ec2 <+2066>:	mov    0x98(%rax),%rax
   0xffffffff80b03ec9 <+2073>:	shr    $0x2f,%rax
   0xffffffff80b03ecd <+2077>:	je     0xffffffff80b03edd <amd64_syscall+2093>
   0xffffffff80b03ecf <+2079>:	mov    0x3f8(%r14),%rax
   0xffffffff80b03ed6 <+2086>:	orl    $0x1,0xc8(%rax)
   0xffffffff80b03edd <+2093>:	add    $0xf8,%rsp

after:
   0xffffffff80b03ebb <+2059>:	mov    0x460(%r14),%rax
   0xffffffff80b03ec2 <+2066>:	mov    0x98(%rax),%rax
   0xffffffff80b03ec9 <+2073>:	shr    $0x2f,%rax
   0xffffffff80b03ecd <+2077>:	jne    0xffffffff80b03eef <amd64_syscall+2111>
   0xffffffff80b03ecf <+2079>:	add    $0xf8,%rsp

Reviewed by:	kib
MFC after:	1 week
2017-08-02 11:25:38 +00:00
Konstantin Belousov
6632a4330f Do not call trapsignal() after handling usermode fault or interrupt,
when a signal is not intended to be sent.

The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.

For NMI, a return to usermode without ast processing is done.  On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.

Reported by:	Nils Beyer <nbe@renzel.net>
PR:	221151
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-02 10:12:10 +00:00
Don Lewis
cd155b5603 Lower the amd64 shared page, which contains the signal trampoline,
from the top of user memory to one page lower on machines with the
Ryzen (AMD Family 17h) CPU.  This pushes ps_strings and the stack
down by one page as well.  On Ryzen there is some sort of interaction
between code running at the top of user memory address space and
interrupts that can cause FreeBSD to either hang or silently reset.
This sounds similar to the problem found with DragonFly BSD that
was fixed with this commit:
  https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
but our signal trampoline location was already lower than the address
that DragonFly moved their signal trampoline to.  It also does not
appear to be related to SMT as described here:
  https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498

  "Hi, Matt Dillon here. Yes, I did find what I believe to be a
   hardware issue with Ryzen related to concurrent operations. In a
   nutshell, for any given hyperthread pair, if one hyperthread is
   in a cpu-bound loop of any kind (can be in user mode), and the
   other hyperthread is returning from an interrupt via IRETQ, the
   hyperthread issuing the IRETQ can stall indefinitely until the
   other hyperthread with the cpu-bound loop pauses (aka HLT until
   next interrupt). After this situation occurs, the system appears
   to destabilize. The situation does not occur if the cpu-bound
   loop is on a different core than the core doing the IRETQ. The
   %rip the IRETQ returns to (e.g. userland %rip address) matters a
   *LOT*. The problem occurs more often with high %rip addresses
   such as near the top of the user stack, which is where DragonFly's
   signal trampoline traditionally resides. So a user program taking
   a signal on one thread while another thread is cpu-bound can cause
   this behavior. Changing the location of the signal trampoline
   makes it more difficult to reproduce the problem. I have not
   been because the able to completely mitigate it. When a cpu-thread
   stalls in this manner it appears to stall INSIDE the microcode
   for IRETQ. It doesn't make it to the return pc, and the cpu thread
   cannot take any IPIs or other hardware interrupts while in this
   state."
since the system instability has been observed on FreeBSD with SMT
disabled.  Interrupts to appear to play a factor since running a
signal-intensive process on the first CPU core, which handles most
of the interrupts on my machine, is far more likely to trigger the
problem than running such a process on any other core.

Also lower sv_maxuser to prevent a malicious user from using mmap()
to load and execute code in the top page of user memory that was made
available when the shared page was moved down.

Make the same changes to the 64-bit Linux emulator.

PR:		219399
Reported by:	nbe@renzel.net
Reviewed by:	kib
Reviewed by:	dchagin (previous version)
Tested by:	nbe@renzel.net (earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11780
2017-08-02 01:43:35 +00:00
Mark Johnston
2375aaa8e9 Batch updates to v_wire_count when freeing page table pages on x86.
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.

Reviewed by:	alc, kib (previous revision)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11790
2017-08-01 05:26:30 +00:00
Konstantin Belousov
9adf30b0c3 Remove unused symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-30 21:52:22 +00:00
Dmitry Chagin
c151945c86 Avoid using [LINUX_]SHAREDPAGE constant directly in the vdso code.
This is needed for https://reviews.freebsd.org/D11780.

Reported by:	kib@
2017-07-30 21:24:20 +00:00
Alan Cox
782e896088 Add support for pmap_enter(..., psind=1) to the amd64 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping.  (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.

Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).

Submitted by:	Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by:	kib, markj
Tested by:	pho
MFC after:	10 days
Differential Revision:	https://reviews.freebsd.org/D11556
2017-07-23 06:33:58 +00:00
Ryan Libby
b1a987bb34 __pcpu: gcc -Wredundant-decls
Pollution from counter.h made __pcpu visible in amd64/pmap.c.  Delete
the existing extern decl of __pcpu in amd64/pmap.c and avoid referring
to that symbol, instead accessing the pcpu region via PCPU_SET macros.
Also delete an unused extern decl of __pcpu from mp_x86.c.

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11666
2017-07-21 17:11:36 +00:00
Ryan Libby
5e6f40bdef efi: restrict visibility of EFIABI_ATTR-declared functions
In-tree gcc (4.2) doesn't understand __attribute__((ms_abi))
(EFIABI_ATTR).  Avoid declaring functions with that attribute when the
compiler is detected to be gcc < 4.4.

Reviewed by:	kib, imp (previous version)
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11636
2017-07-20 06:47:06 +00:00
Alan Cox
5818f05d39 Style-only change: Consistently use the variable name "pdpg" throughout
this file.  Previously, half of the pointers to a vm_page being used as
a page directory page were named "pdpg" and the rest were named "mpde".

Discussed with:	kib
MFC after:	1 week
2017-07-15 16:42:55 +00:00
Alan Cox
68a870f3bc Extract the innermost loop of pmap_remove() out into its own function,
pmap_remove_ptes().  (This new function will also be used by an upcoming
change to pmap_enter() that adds support for psind == 1 mappings.)

Submitted by:	Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by:	kib, markj
MFC after:	1 week
2017-07-15 01:49:54 +00:00
Konstantin Belousov
60686c3703 Fix size argument to vm_pager_allocate(), it is in bytes, not in pages.
It is believed to be only cosmetic.

Noted by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-13 08:23:37 +00:00
Konstantin Belousov
e766a6bb01 Revert r320936 to recommit with the correct log message. 2017-07-13 08:23:12 +00:00
Konstantin Belousov
89f91fa960 It is believed to be only cosmetic.
Noted by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-13 08:19:50 +00:00
Ian Lepore
b524a31593 Protect access to the AT realtime clock with its own mutex.
The mutex protecting access to the registered realtime clock should not be
overloaded to protect access to the atrtc hardware, which might not even be
the registered rtc. More importantly, the resettodr mutex needs to be
eliminated to remove locking/sleeping restrictions on clock drivers, and
that can't happen if MD code for amd64 depends on it. This change moves the
protection into what's really being protected: access to the atrtc date and
time registers.

This change also adds protection when the clock is accessed from
xentimer_settime(), which bypasses the resettodr locking.

Differential Revision:	https://reviews.freebsd.org/D11483
2017-07-12 02:42:57 +00:00
Warner Losh
a94a63f0a6 An MMC/SD/SDIO stack using CAM
Implement the MMC/SD/SDIO protocol within a CAM framework. CAM's
flexible queueing will make it easier to write non-storage drivers
than the legacy stack. SDIO drivers from both the kernel and as
userland daemons are possible, though much of that functionality will
come later.

Some of the CAM integration isn't complete (there are sleeps in the
device probe state machine, for example), but those minor issues can
be improved in-tree more easily than out of tree and shouldn't gate
progress on other fronts. Appologies to reviews if specific items
have been overlooked.

Submitted by: Ilya Bakulin
Reviewed by: emaste, imp, mav, adrian, ian
Differential Review: https://reviews.freebsd.org/D4761

merge with first commit, various compile hacks.
2017-07-09 16:57:24 +00:00
Ryan Libby
5228ad10d4 amd-vi: gcc build errors
amdvi_cmp_wait: gcc complained about a malformed string behind an ifdef.

struct amdvi_dte: widen the type of the first reserved bitfield so that
the packed representation would not cross an alignment boundary for that
type. Apparently that causes in-tree gcc (4.2) to insert padding
(despite packed, resulting in a wrong structure definition), and causes
more modern gcc to emit a warning.

ivrs_hdr_iterate_tbl: delete a misleading check about header length
being less than 0 (the type is unsigned) and replace it with a check
that the length doesn't exceed the table size.

Reviewed by:	anish, grehan
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11485
2017-07-07 06:37:19 +00:00
Sean Bruno
6ef9566177 Garbage collect kernel option TWA_FLASH_FIRMWARE
Submitted by:	 kevin.bowling0kev009.com
Differential Revision:	https://reviews.freebsd.org/D11387
2017-07-03 19:33:50 +00:00
Dmitry Chagin
a0c59c7afd Add support for musl consumers to the Linuxulator.
PR:		213809
Submitted by:	Yonas Yanfa
Reported by:	Yonas Yanfa
MFC after:	1 week
Relnotes:	yes
2017-07-03 10:24:49 +00:00
Alan Cox
510cdf22a2 When "force" is specified to pmap_invalidate_cache_range(), the given
start address is not required to be page aligned.  However, the loop
within pmap_invalidate_cache_range() that performs the actual cache
line invalidations requires that the starting address be truncated to
a multiple of the cache line size.  This change corrects an error in
that truncation.

Submitted by:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-07-01 16:42:09 +00:00
Jason A. Harmening
eb36b1d0bc Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00
Konstantin Belousov
c377ff617b Translate between abridged and full x87 tags for compat32
ptrace(PT_GETFPREGS).

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-24 11:38:31 +00:00
Konstantin Belousov
2d88da2f06 Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
Konstantin Belousov
43f41dd393 Make struct syscall_args visible to userspace compilation environment
from machine/proc.h, consistently on all architectures.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 20:53:44 +00:00
Alan Cox
d118a00f3c Eliminate duplication of the pmap and pv list unlock operations in
pmap_enter() by implementing a single return path.  Otherwise, the
duplication will only increase with the upcoming support for psind == 1.

Reviewed by:	kib (some time ago)
2017-06-03 17:24:13 +00:00
Dmitry Chagin
9811d215b9 In r246085 some bits that are MI movied out into headers in compat/linux,
but I missed that when I commited x86_64 Linuxulator. So remove the duplicates.

MFC after:	1 week
2017-05-28 08:46:57 +00:00
Edward Tomasz Napierala
43af586011 Bump default MAXTSIZ (kern.maxtsiz) from 128MB to 32GB. The old limit
prevents one from running eg clang built with debug; the new one is
arbitrary (equal to MAXDSIZ) and... well, should be quite future-proof.

Same fix might be applicable to other 64 bit architectures; I'll ask
their respective maintainers to make sure it won't break anything.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10758
2017-05-17 08:38:41 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Conrad Meyer
fcf0952c80 Correct page frame mask constant used in pmap_change_attr_locked
This was introduced in r290156.  It's present in 11.0, but not any 10.x
release unless someone decided to MFC it.

It affects ordinary pages right above the DMAP limit, which is effectively
system memory rounded up to a 1 GB (3rd level superpage) boundary (or up to
a minimum of 4 GB, on small systems).

Reported by:	vangyzen
Reviewed by:	kib, alc
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D4030
2017-05-16 16:20:22 +00:00
Konstantin Belousov
bd101a6648 Ensure that resume path on amd64 only accesses page tables for normal
operation after processor is configured to allow all required
features.

In particular, NX must be enabled in EFER, otherwise load of page
table element with nx bit set causes reserved bit page fault.  Since
malloc uses direct mapping for small allocations, in particular for
the suspension pcbs, and DMAP is nx after r316767, this commit tripped
fault on resume path.

Restore complete state of EFER while wakeup code is still executing
with custom page table, before calling resumectx, instead of trying to
guess which features might be needed before resumectx restored EFER on
its own.

Bisected and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-05-15 20:52:43 +00:00
Sepherosa Ziehau
c23a0b35c1 pcicfg: Fix direct calls of pci_cfg{read,write} on systems w/o PCI host bridge.
Reported by:	dexuan@
Reviewed by:	jhb@
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision: https://reviews.freebsd.org/D10564
2017-05-04 05:28:46 +00:00
Anish Gupta
07ff474a68 Add AMD IOMMU/AMD-Vi support in bhyve for passthrough/direct assignment to VMs. To enable AMD-Vi, set hw.vmm.amdvi.enable=1.
Reviewed by:bcr
Approved by:grehan
Tested by:rgrimes
Differential Revision:https://reviews.freebsd.org/D10049
2017-04-30 02:08:46 +00:00
Jung-uk Kim
af1973281e Use kmem_malloc() instead of malloc(9) for the native amd64 filter.
r316767 broke the BPF JIT compiler for amd64 because malloc()'d space is no
longer executable.

Discussed with:	kib, alc
2017-04-17 22:02:09 +00:00
Jung-uk Kim
e329e330d4 Move declarations for a machine-dependent function to the header file. 2017-04-17 21:51:26 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
75c4b0b5ac Remove unused assembly symbols pointing to vmmeter. 2017-04-17 17:18:07 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Konstantin Belousov
33c72b24de Map DMAP as nx.
Demotions preserve PG_NX, so it is enough to set nx bit for initial
lowest-level paging entries.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-04-13 15:49:55 +00:00
Patrick Kelsey
67d955aab4 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
Tai-hwa Liang
7ece126ed8 Trying to be more compatible with Linux if.h definitions:
- renaming l_ifreq::ifru_metric to l_ifreq::ifru_ivalue;
	- adding a definition for ifr_ifindex which points to l_ifreq::ifru_ivalue.

A quick search indicates that Linux already got the above changes since 2.1.14.

Reviewed by:	kib, marcel, dchagin
MFC after:	1 week
2017-04-08 14:41:39 +00:00
Andriy Gapon
978f3da16f revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
Bruce Evans
f434f3515b Fix printing of negative offsets (typically from frame pointers) again.
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before.  i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386.  powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.

Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.

-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.

The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.

Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
2017-03-26 18:46:35 +00:00
Andriy Gapon
a7b4c009e1 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
Dmitry Chagin
6c2a934b79 Implement Linux mincore() system call.
This is necessary for the upcoming drm-next.

Suggested by:	hselasky@
MFC after:	1 month
2017-03-25 15:47:29 +00:00
Bruce Evans
4e501eb7cc Remove buggy adjustment of page tables in db_write_bytes().
Long ago, perhaps only on i386, kernel text was mapped read-only and
it was necessary to change the mapping to read-write to set breakpoints
in kernel text.  Other writes by ddb to kernel text were also allowed.
This write protection is harder to implement with 4MB pages, and was
lost even for 4K pages when 4MB pages were implemented.  So changing
the mapping became useless.  It was actually worse than useless since
it followed followed various null and otherwise garbage pointers to
not change random memory instead of the mapping.  (On i386s, the
pointers became good in pmap_bootstrap(), and on amd64 the pointers
became bad in pmap_bootstrap() if not before.)

Another bug broke detection of following of null pointers on i386,
except early in boot where not detecting this was a feature.  When
I fixed the bug, I accidentally broke the feature and soon got traps
in db_write_bytes().  Setting breakpoints early in ddb was broken.

kib pointed out that a clean way to do the adjustment would be to use
a special [sub]map giving a small window on the bytes to be written.

The trap handler didn't know how to fix up errors for pagefaults
accessing the map itself.  Such errors rarely need fixups, since most
traps for the map are for the first access which is a read.

Reviewed by:	kib
2017-03-24 17:34:55 +00:00
Ed Schouten
ebfc28088b Stop providing the compat_3_brand.
As of r315860, the ELF image activator works fine for CloudABI without it.

Reviewed by:	kib
MFC after:	2 weeks
2017-03-23 14:12:21 +00:00
Konstantin Belousov
2274ab3d7b Update r315753 with the proper flag name.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:28:13 +00:00
Konstantin Belousov
1438fe3cf2 Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only
matches static binaries.

Interpretation of the 'static' there is that the binary must not
specify an interpreter.  In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.

This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.

Revert r315701.

Discussed with and tested by:	ed (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:23:01 +00:00
Mark Johnston
3d6732549d Add support for 8- and 16-bit atomic_(f)cmpset to x86.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10068
2017-03-22 17:29:04 +00:00
Ed Schouten
ae2373da91 Set the interpreter path to /nonexistent.
CloudABI executables are statically linked and don't have an
interpreter. Setting the interpreter path to NULL used to work
previously, but r314851 introduced code that checks the string
unconditionally. Running CloudABI executables now causes a null pointer
dereference.

Looking at the rest of imgact_elf.c, it seems various other codepaths
already leaned on the fact that the interpreter path is set. Let's just
go ahead and pick an obviously incorrect interpreter path to appease
imgact_elf.c.

MFC after:	1 week
2017-03-22 07:05:27 +00:00
Dmitry Chagin
b1ba0846f1 Implement getrandom() syscall.
Note. GRND_RANDOM option is not supported for now.

MFC after:	1 month
2017-03-18 18:34:29 +00:00
Dmitry Chagin
857129394d To reduce code duplication move socket defines to the MI path.
MFC after:	1 week
2017-03-18 18:23:30 +00:00
Bruce Evans
ff17a6773e Don't access the reserved registers %dr4 and %dr5 on i386.
On the original i386, %dr[4-5] were unimplemented but not very clearly
reserved, so debuggers read them to print them.  i386 was still doing
this.

On the original athlon64, %dr[4-5] are documented as reserved but are
aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps.

On 2 of my systems, accessing %dr[4-5] trapped sometimes.  On my Haswell
system, the apparent randomness was because the boot CPU starts with
CR4_DE set while all other CPUs start with CR4_DE clear.  FreeBSD
doesn't support the data breakpoints enabled by CR4_DE and it never
changes this flag, so the flag remains different across CPUs and
the behaviour seemed inconsistent except while booting when the CPU
doesn't change.

The invalid accesses broke:
- read access for printing the registers in ddb "show watches" on CPUs
  with CR4_DE set
- read accesses in fill_dbregs() on CPUs with CR4_DE set.  This didn't
  implement panic(3) since the user case always skipped %dr[4-5].
- write accesses in set_dbregs().  This also didn't affect userland.
  When it didn't trap, the aliasing made it fragile.

Don't print the dummy (zero) values of %dr[4-5] in "show watches" for
i386 or amd64.  Fix style bugs near this printing.

amd64 also has space in the dbregs struct for the reserved %dr[8-15]
and already didn't print the dummy values for these, and never accessed
any of the 10 reserved debug registers.

Remove cpufuncs for making the invalid accesses.  Even amd64 had these.
2017-03-17 13:49:05 +00:00
Peter Grehan
3da443021f Hide the AMD MONITORX/MWAITX capability.
Otherwise, recent Linux guests will use these instructions, resulting
in #UD exceptions since bhyve doesn't implement MONITOR/MWAIT exits.

This fixes boot-time hangs in recent Linux guests on Ryzen CPUs
(and probably Bulldozer aka AMD FX as well).

Reviewed by:	kib
MFC after:	1 week
2017-03-16 03:21:42 +00:00
Emmanuel Vadot
aa6b345634 Remove i915drm and radeondrm from NOTES and conf.
This unbreak LINT kernel.

Reported by:	lwhsu
2017-03-12 00:52:16 +00:00
Dmitry Chagin
ab60bc8488 Reduce code duplication between MD Linux code by moving SYSV IPC 64-bit
related struct definitions out into the MI path.

Invert the native ipc structs to the Linux ipc structs convesion logic.
Since 64-bit variant of ipc structs has more precision convert native ipc
structs to the 64-bit Linux ipc structs and then truncate 64-bit values
into the non 64-bit if needed. Unlike Linux, return EOVERFLOW if the
values do not fit.

Fix SYSV IPC for 64-bit Linuxulator which never sets IPC_64 bit.

MFC after:	1 month
2017-03-07 17:07:16 +00:00
Mahdi Mokhtari
881b1219aa Regenerated Linuxulator syscall tables for r314782
Approved by:	dchagin
MFC after:	1 month
2017-03-06 18:20:37 +00:00
Mahdi Mokhtari
8049c6bfb8 Add UNIMPLEMENTED() placeholder macro for
the syscalls that are not implemented in Linux kernel itself.
Cleanup DUMMY() macros.

Reviewed by:	dchagin, trasz
Approved by:	dchagin
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9804
2017-03-06 18:11:38 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Konstantin Belousov
2e6e48fb59 Initialize pcb_save for thread0.
Otherwise kernel traps on NULL dereference if fpu_kern(9) is used from the
thread0 context.

Reported by:	cem
Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-28 22:54:52 +00:00
Gleb Smirnoff
efe3b0de14 Remove SVR4 (System V Release 4) binary compatibility support.
UNIX System V Release 4 is operating system released in 1988. It ceased
to exist in early 2000-s.
2017-02-28 05:14:42 +00:00
Dmitry Chagin
af68739567 Regen for r314312 (Linux epoll_pwait).
MFC after:	1 month
2017-02-26 19:59:28 +00:00
Dmitry Chagin
f8ae1bb64d Change Linux epoll_pwait syscall definition to match Linux actual one.
MFC after:	1 month
2017-02-26 19:57:18 +00:00
Alan Cox
0314966858 Refine the fix from r312954. Specifically, add a new PDE-only flag,
PG_PROMOTED, that indicates whether lingering 4KB page mappings might
need to be flushed on a PDE change that restricts or destroys a 2MB
page mapping.  This flag allows the pmap to avoid range invalidations
that are both unnecessary and costly.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9665
2017-02-26 19:54:02 +00:00
Dmitry Chagin
dd93b628e9 Implement timerfd family syscalls.
MFC after:	1 month
2017-02-26 09:48:18 +00:00
Dmitry Chagin
354aa2dd56 Regen after r314291 (timerfd definition).
MFC after:	1 month
2017-02-26 09:37:25 +00:00
Dmitry Chagin
1064d53fde Change Linuxulator timerfd syscalls definition to match actual Linux one.
MFC after:	1 month
2017-02-26 09:35:44 +00:00
Edward Tomasz Napierala
e801ac7852 Fix linux_fstatfs() to return proper value for f_frsize. Without it,
linux df(1) binary from Xenial shows garbage.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9692
2017-02-25 20:32:37 +00:00
Mahdi Mokhtari
bd911530b7 Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 20:04:02 +00:00
Dmitry Chagin
8665c4d9cd Revert r314217. Commit is not match that I have approved. 2017-02-24 19:47:27 +00:00
Mahdi Mokhtari
21d23e3249 Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 19:22:17 +00:00
Pedro F. Giffuni
e099b90b80 sys: Replace zero with NULL for pointers.
Found with:	devel/coccinelle
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9694
2017-02-22 02:35:59 +00:00
Mark Johnston
a384a37df8 ddb show pte: use pmap of kdb_thread
show pte from the pmap of the process of the current DDB thread, instead
of necessarily the PCPU pmap.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9645
2017-02-21 21:06:12 +00:00
Edward Tomasz Napierala
5a49cd0099 Reimplement linux_arch_prctl() as a wrapper around sysarch(2).
This also adds support for LINUX_ARCH_SET_GS.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9372
2017-02-20 16:13:40 +00:00
Alan Cox
4b4c84cfc3 In pmap_enter(), set the PG_MANAGED flag on the new PTE in one place,
rather two places, and do so before the pmap lock is acquired.

Submitted by:	Yufeng Zhou <yz70@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-02-19 18:00:57 +00:00
Dmitry Chagin
486a06bdf0 Implement rt_tgsigqueueinfo system call used by glibc for pthread_sigqueue(3).
MFC after:	2 week
2017-02-19 07:38:11 +00:00
Konstantin Belousov
d9440197b4 Microoptimize amd64/pmap.c pmap_protect_pde().
For the loop that dirties vm_pages in case superpage was written to,
check the complete condition before the loop.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-19 03:33:20 +00:00
Jason A. Harmening
e2a8d17887 Bring back r313037, with fixes for mips:
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.

Reviewed by:	andreast, kan, lidl
Tested by:	lidl(mips, sparc64), andreast(powerpc)
Differential Revision:	https://reviews.freebsd.org/D9587
2017-02-19 02:03:09 +00:00
Konstantin Belousov
b1fa987835 Merge i386 and amd64 mtrr drivers.
Reviewed by:	royger, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9648
2017-02-17 21:08:32 +00:00
Roger Pau Monné
43b00aeb88 x86: fix MTRR initialization if EARLY_AP_STARTUP is used
MTRR handlers are set in {amd64/i686}_mem_drvinit, which is called at
SI_SUB_DRIVERS, and that's too late when EARLY_AP_STARTUP is set because APs
have already started at this point. {amd64/i686}_mrinit is also called too late
for the BSP, since that happens when the memory device is attached, also after
APs have already started.

Move the position to SI_SUB_CPU, and also initialize the state for the BSP, so
that the APs can correctly get to the same state as the BSP.

Sponsored by:		Citrix Systems R&D
MFC after:		1 week
Reviewed by:		jhb, kib
Differential Revision:	https://reviews.freebsd.org/D9630
2017-02-17 12:47:51 +00:00
Edward Tomasz Napierala
d82de05480 Implement linux version of ptrace(2). It's nowhere near complete,
but it allows to use 64 bit linux strace(1) on 64 bit linux binaries.

Reviewed by:	dchagin (earlier version)
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9406
2017-02-16 13:32:15 +00:00
Edward Tomasz Napierala
c6639ffe4e Regen after r313769.
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-15 14:25:50 +00:00
Edward Tomasz Napierala
4ac1825ce3 Fix definition of linux64 ptrace syscall.
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-15 14:12:39 +00:00
John Baldwin
bb9b710477 Regenerate all the system call tables to drop "created from" lines.
One of the ibcs2 files contains some actual changes (new headers) as
it hasn't been regenerated after older changes to makesyscalls.sh.
2017-02-10 19:45:02 +00:00
Eric Joyner
cb6b8299fd ixl(4): Update to 1.7.12-k
Refresh upstream driver before impending conversion to iflib.

Major new features:

- Support for Fortville-based 25G adapters
- Support for I2C reads/writes

(To prevent getting or sending corrupt data, you should set
dev.ixl.0.debug.disable_fw_link_management=1 when using I2C
[this will disable link!], then set it to 0 when done. The driver implements
the SIOCGI2C ioctl, so ifconfig -v works for reading I2C data,
but there are read_i2c and write_i2c sysctls under the .debug sysctl tree
[the latter being useful for upper page support in QSFP+]).

- Addition of an iWARP client interface (so the future iWARP driver for
  X722 devices can communicate with the base driver).
  - Compiling this option in is enabled by default, with "options IXL_IW" in
    GENERIC.

Differential Revision:	https://reviews.freebsd.org/D9227
Reviewed by:	sbruno
MFC after:	2 weeks
Sponsored by:	Intel Corporation
2017-02-10 01:04:11 +00:00
Dmitry Chagin
12bc0fb56f Regen after r313284.
MFC after:	2 week
2017-02-05 14:19:19 +00:00
Dmitry Chagin
8b756d40a7 Update syscall.master to 4.10-rc6. Also fix comments, a typo,
and wrong numbering for a few unimplemented syscalls.

For 32-bit Linuxulator, socketcall() syscall was historically
the entry point for the sockets API. Starting in Linux 4.3, direct
syscalls are provided for the sockets API. Enable it.

The initial version of patch was provided by trasz@ and extended by me.

Submitted by:	trasz
MFC after:	2 week
Differential Revision:	https://reviews.freebsd.org/D9381
2017-02-05 14:17:09 +00:00
Jason A. Harmening
ad62ba6e96 Revert r313037
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.

Reported by:	kan, lidl
2017-02-04 06:24:49 +00:00
Jason A. Harmening
65ed483615 Implement get_pcpu() for the remaining architectures and use it to
replace pcpu_find(curcpu) in MI code.
2017-02-01 03:32:49 +00:00
Edward Tomasz Napierala
ae6b6ef6cb Replace sys_ftruncate() with kern_ftruncate() in various compats.
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9368
2017-01-30 11:50:54 +00:00
Konstantin Belousov
a0f64f38a1 Do not leave stale 4K TLB entries on pde (superpage) removal or
protection change.

On superpage promotion, x86 pmaps do not invalidate existing 4K
entries for the superpage range, because they are compatible with the
promoted 2/4M entry.  But the invalidation on superpage removal or
protection change only did single INVLPG with the base address of the
superpage.  This reliably flushed superpage TLB entry, and 4K entry
for the first page of the superpage, potentially leaving other 4K TLB
entries lingering.  Do the invalidation of the whole superpage range
to correct the problem.

Note that the precise invalidation is done by x86 code for kernel_pmap
only, for user pmaps whole (per-AS) TLB is flushed.  This made the bug
well hidden, because promotions of the kernel mappings require
specific load.

Reported and tested by:	Jonathan Looney <jtl@netflix.com> (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-29 19:14:48 +00:00
Baptiste Daroussin
b4b4b5304b Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
Baptiste Daroussin
814aaaa7da Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
Tijl Coosemans
86e01d5add Apply r210555 to 64 bit linux support:
The interpreter name should no longer be treated as a buffer that can be
overwritten.

PR:		216346
MFC after:	3 days
2017-01-24 16:13:59 +00:00
Konstantin Belousov
5611aaa195 Use SFENCE for ordering CLFLUSHOPT.
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-01-20 19:08:44 +00:00
Andriy Gapon
b4a5a4d0d9 vmm_dev: work around a bogus error with gcc 6.3.0
The error is:
vmm_dev.c: In function 'alloc_memseg':
vmm_dev.c:261:11: error: null argument where non-null required (argument 1) [-Werror=nonnull]

Apparently, the gcc is unable to figure out that if a ternary operator
produced a non-NULL value once, then the operator with exactly the same
operands would produce the same value again.

MFC after:	1 week
2017-01-20 13:21:27 +00:00
Ed Schouten
4423244072 Catch up with changes to structure member names.
Pointer/length pairs are now always named ${name} and ${name}_len.
2017-01-17 22:05:52 +00:00
Conrad Meyer
1d64db52f3 Fix a variety of cosmetic typos and misspellings
No functional change.

PR:		216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110
Reported by:	Bulat <bltsrc at mail.ru>
Sponsored by:	Dell EMC Isilon
2017-01-15 18:00:45 +00:00
Mark Johnston
bd7abab0c9 Coalesce TLB shootdowns of global PTEs in pmap_advise() on x86.
We would previously invalidate such entries individually, resulting in more
IPIs than necessary.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9094
2017-01-10 21:52:48 +00:00
Sean Bruno
f2d6ace4a6 Migrate e1000 to the IFLIB framework:
- em(4) igb(4) and lem(4)
- deprecate the igb device from kernel configurations
- create a symbolic link in /boot/kernel from if_em.ko to if_igb.ko

Devices tested:
- 82574L
- I218-LM
- 82546GB
- 82579LM
- I350
- I217

Please report problems to freebsd-net@freebsd.org

Partial review from jhb and suggestions on how to *not* brick folks who
originally would have lost their igbX device.

Submitted by:	mmacy@nextbsd.org
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks and Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8299
2017-01-10 03:23:22 +00:00
Mateusz Guzik
f7c6177038 amd64: add atomic_fcmpset
Reviewed by:	kib, jhb
2017-01-03 21:00:24 +00:00
Konstantin Belousov
98db43f4e2 Fix typo. Remove spurious blank line.
MFC after:	3 days
2016-12-18 09:32:23 +00:00
John Baldwin
b663816443 Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default.
PR:		199321, 203682
MFC after:	2 months
Sponsored by:	Netflix
2016-12-16 21:10:37 +00:00
Konstantin Belousov
396a688bd9 Provide non-final but valid PCB pointer for thread0 for duration of
hammer_time().  This makes assembler exception handlers not fault
itself when setting PCB flags, and allow normal kernel trap handler to
get control.  The pointer is reset after FPU parameters are obtained.

Set thread0.td_critnest to 1 for duration of hammer_time() as well.
In particular, page faults at that early stage panic immediately
instead of trying to call not yet operational VM to resolve it.

As result, faults during second half of the hammer_time() execution
have a chance to be reported instead of silent machine reboot or hang.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-14 11:40:31 +00:00
Konrad Witaszczyk
480f31c214 Add support for encrypted kernel crash dumps.
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.

A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.

dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable.  Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.

When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore

A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.

Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.

savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.

decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.

Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.

EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.

Designed by:	def, pjd
Reviewed by:	cem, oshogbo, pjd
Partial review:	delphij, emaste, jhb, kib
Approved by:	pjd (mentor)
Differential Revision:	https://reviews.freebsd.org/D4712
2016-12-10 16:20:39 +00:00
Warner Losh
8bece6062d Permit loading of efirt module even when there's no EFI to call. The
module loading is successful, but attempts to use it will not be
successful. This is similar to what we do (did?) with ACPI on non-ACPI
systems. We succeed if we can't find the necessary information to hook
into EFI, but still fail if we're unable to allocate resources if we
do find EFI.

Not Objected to by: kib@
MFC Afer: 3 days
2016-12-09 23:37:11 +00:00
Mark Johnston
7f68a896dc Add a COMPAT_FREEBSD11 kernel option.
Use it wherever COMPAT_FREEBSD10 is currently specified.

Reviewed by:	glebius, imp, jhb
Differential Revision:	https://reviews.freebsd.org/D8736
2016-12-09 18:54:12 +00:00