Commit Graph

8376 Commits

Author SHA1 Message Date
Mateusz Guzik
2318ed2508 amd64: provide custom zpcpu set/add/sub routines
Note that clobbers are highly overzealous, can be cleaned up later.
2020-02-12 11:15:33 +00:00
Mateusz Guzik
fb886947d9 amd64: store per-cpu allocations subtracted by __pcpu
This eliminates a runtime subtraction from counter_u64_add.

before:
mov    0x4f00ed(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
sub    0x808ff6(%rip),%rax        # 0xffffffff80f1a698 <__pcpu>
addq   $0x1,%gs:(%rax)

after:
mov    0x4f02fd(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
addq   $0x1,%gs:(%rax)

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23570
2020-02-12 11:12:13 +00:00
Mateusz Guzik
e7ce9c32a7 amd64: remove redundant sa->code assignment from cpu_fetch_syscall_args_fallback
It is already set in the only caller.
2020-02-11 18:15:23 +00:00
Mateusz Guzik
e2b81f518a amd64: clean up counter(9)
- stop open-coding access to per-cpu data, use common macros instead
- consistently use counter_t type where appropriate
2020-02-07 16:22:02 +00:00
Mark Johnston
c3d326fd44 Define MAXCPU consistently between the kernel and KLDs.
This reverts r177661.  The change is no longer very useful since
out-of-tree KLDs will be built to target SMP kernels anyway.  Moveover
it breaks the KBI in !SMP builds since cpuset_t's layout depends on the
value of MAXCPU, and several kernel interfaces, notably
smp_rendezvous_cpus(), take a cpuset_t as a parameter.

PR:		243711
Reviewed by:	jhb, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23512
2020-02-05 19:08:21 +00:00
Ed Maste
690a8a6acd regen linuxulator sysent after r357577 2020-02-05 16:54:16 +00:00
Ed Maste
fc7510aef7 linuxulator: implement sendfile
Submitted by:	Bora Özarslan <borako.ozarslan@gmail.com>
Submitted by:	Yang Wang <2333@outlook.jp>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19917
2020-02-05 16:53:02 +00:00
Mark Johnston
7c390929d9 Fix map locking in the CLEAR_PKRU sysarch(2) handler.
Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-05 16:09:02 +00:00
Edward Tomasz Napierala
eff8d99fb3 Regen after r357503.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-02-04 16:02:04 +00:00
Edward Tomasz Napierala
369c4633c1 Add missing linux(4) syscall entries. This fixes missing debug
messages for some of the unimplemented syscalls, in particular
the AIO-related ones.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23231
2020-02-04 16:01:06 +00:00
Mark Johnston
1c29da0279 Reimplement stack capture of running threads on i386 and amd64.
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.

Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.

Simplify the KPIs.  Remove stack_save_td_running() and add a return
value to stack_save_td().  On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP.  If the
target thread is running in user mode, stack_save_td() returns EBUSY.

Reviewed by:	kib
Reported by:	mjg, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23355
2020-01-31 15:43:33 +00:00
Mateusz Guzik
aa88cc44f3 amd64: speed up failing case for memcmp
Instead of branching on up to 8 bytes, drop the size to 4.

Assorted clean ups while here.

Validated with glibc test suite.
2020-01-30 19:56:22 +00:00
Mateusz Guzik
f0ddecd745 amd64: revamp memcmp
Borrow the trick from memset and memmove and use the scale/index/base addressing
to avoid branches.

If a mismatch is found, the routine has to calculate the difference. Make sure
there is always up to 8 bytes to inspect. This replaces the previous loop which
would operate over up to 16 bytes with an unrolled list of 8 tests.

Speed varies a lot, but this is a net win over the previous routine with probably
a lot more to gain.

Validated with glibc test suite.
2020-01-28 17:48:17 +00:00
Mark Johnston
f01fe060e6 Regen.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-01-21 17:28:36 +00:00
Mark Johnston
149afbf3ba Fix 64-bit syscall argument fetching in 32-bit Linux syscall handlers.
The Linux32 system call argument fetcher places each argument (passed in
registers in the Linux x86 system call convention) into an entry in the
generic system call args array.  Each member of this array is 8 bytes
wide, so this approach is broken for system calls that take off_t
arguments.

Fix the problem by splitting l_loff_t arguments in the 32-bit system
call descriptions, the same as we do for FreeBSD32.  Change entry points
to handle this using the PAIR32TO64 macro.

Move linux_ftruncate64() into compat/linux.

PR:		243155
Reported by:	Alex S <iwtcex@gmail.com>
Reviewed by:	kib (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23210
2020-01-21 17:28:22 +00:00
Konstantin Belousov
2ee49fac82 Add support for Hygon Dhyana Family 18h processor.
As a new x86 CPU vendor, Chengdu Haiguang IC Design Co., Ltd (Hygon)
is a joint venture between AMD and Haiguang Information Technology Co.,
Ltd., aims at providing x86 processors for China server market.

The first generation Hygon processor(Dhyana) shares most architecture
with AMD's family 17h, but with different CPU vendor ID("HygonGenuine")
and PCI vendor ID(0x1d94) and family series number 18h(Hygon negotiated
with AMD to confirm that only Hygon use family 18h).

To enable Hygon Dhyana support in FreeBSD, add new definitions
HYGON_VENDOR_ID("HygonGenuine") and X86_VENDOR_HYGON(0x1d94) to identify
Hygon Dhyana CPU.

Initialize the CPU features(topology, local APIC ext, MSI, TSC, hwpstate,
MCA, DEBUG_CTL, etc) for amd64 and i386 mode by sharing the code path of
AMD family 17h.

The changes have been applied on FreeBSD 13.0-CURRENT and tested
successfully on Hygon Dhyana processor.

References:
[1] Linux kernel patches for Hygon Dhyana, merged in 4.20:

https://git.kernel.org/tip/c9661c1e80b609cd038db7c908e061f0535804ef

[2] MSR and CPUID definition:

https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf

Submitted by:	Pu Wen <puwen@hygon.cn>
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23163
2020-01-21 13:22:35 +00:00
Kyle Evans
05d7dd739c sysent targets: further cleanup and deduplication
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.

Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.

Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.

Reviewed by:	brooks, imp
Nice!:		emaste
Differential Revision:	https://reviews.freebsd.org/D23197
2020-01-18 20:37:45 +00:00
Kyle Evans
1171c633fb Set .ORDER for makesyscalls generated files
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.

Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.

Reviewed by:	brooks
Discussed with:	sjg
Differential Revision:	https://reviews.freebsd.org/D23099
2020-01-10 18:24:17 +00:00
Pawel Biernacki
a1d7296784 sysctl: mark more nodes as MPSAFE
vm.kvm_size and vm.kvm_free are read only and marked as MPSAFE on i386
already. Mark them as that on amd64 and arm64 too to avoid locking Giant.

Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23039
2020-01-06 10:52:13 +00:00
Mateusz Guzik
2e77cad11d locks: add default delay struct
Use it for all primitives. This makes everything fit in 8 bytes.
2020-01-05 12:48:19 +00:00
Alan Cox
1c3a241032 When a copy-on-write fault occurs, pmap_enter() is called on to replace the
mapping to the old read-only page with a mapping to the new read-write page.
To destroy the old mapping, pmap_enter() must destroy its page table and PV
entries and invalidate its TLB entry.  This change simply invalidates that
TLB entry a little earlier, specifically, on amd64 and arm64, before the PV
list lock is held.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23027
2020-01-04 19:50:25 +00:00
Konstantin Belousov
b837dadd87 bhyve: terminate waiting loops if thread suspension is requested.
PR:	242724
Reviewed by:	markj
Reported and tested by:	Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
	 (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22881
2020-01-02 22:37:04 +00:00
Edward Tomasz Napierala
cc50333011 Add basic getcpu(2) support to linuxulator. The purpose of this
syscall is to query the CPU number and the NUMA domain the calling
thread is currently running on.  The third argument is ignored.
It doesn't do anything regarding scheduling - it's literally
just a way to query the current state, without any guarantees
you won't get rescheduled an opcode later.

This unbreaks Java from CentOS 8
(java-11-openjdk-11.0.5.10-0.el8_0.x86_64).

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22972
2019-12-31 22:01:08 +00:00
Edward Tomasz Napierala
da7627d797 Regen after r356229.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-12-31 16:01:37 +00:00
Edward Tomasz Napierala
a8bfc7a85c Fix definitions for Linux getcpu(2).
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-12-31 15:57:29 +00:00
Pawel Biernacki
54666dffa8 linux(4): implement copy_file_range(2)
copy_file_range(2) is implemented natively since r350315, make it available
for Linux binaries too.

Reviewed by:	kib (mentor), trasz (previous version)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D22959
2019-12-30 18:11:06 +00:00
Edward Tomasz Napierala
ee0fe82ee2 Implement Linux syslog(2) syscall; just enough to make Linux dmesg(8)
utility work.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22465
2019-12-29 15:53:55 +00:00
Brandon Bergren
38f69a619e Unbreak build. It seems that mips and amd64 still pull in link_elf.c, so
we need to have elf_cpu_parse_dynamic() everywhere after all to avoid
an undefined symbol.
2019-12-24 16:52:10 +00:00
Alan Cox
50079417a5 Micro-optimize the control flow in _pmap_unwire_ptp(), and eliminate
unnecessary parentheses.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22893
2019-12-21 22:32:24 +00:00
Alan Cox
7c237b7c3a Correct a mistakenly inverted condition in r355833.
Noticed by:	kib
X-MFC with:	r355833
2019-12-20 20:46:26 +00:00
Alan Cox
48ef33180a When pmap_enter_{l2,pde}() are called to create a kernel mapping, they are
incrementing (and decrementing) the ref_count on kernel page table pages.
They should not do this.  Kernel page table pages are expected to have a
fixed ref_count.  Address this problem by refactoring pmap_alloc{_l2,pde}()
and their callers.  This also eliminates some duplicated code from the
callers.

Correctly implement PMAP_ENTER_NOREPLACE in pmap_enter_{l2,pde}() on kernel
mappings.

Reduce code duplication by defining a function, pmap_abort_ptp(), for
handling a common error case.

Handle a possible page table page leak in pmap_copy().  Suppose that we are
determining whether to copy a superpage mapping.  If we abort because there
is already a mapping in the destination pmap at the current address, then
simply decrementing the page table page's ref_count is correct, because the
page table page must have a ref_count > 1.  However, if we abort because we
failed to allocate a PV entry, this might be a just allocated page table
page that has a ref_count = 1, so we should call pmap_abort_ptp().

Simplify error handling in pmap_enter_quick_locked().

Reviewed by:	kib, markj (an earlier)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22763
2019-12-18 18:21:39 +00:00
Edward Tomasz Napierala
b5f20658ee Add compat.linux.emul_path, so it can be set to something other
than "/compat/linux".  Useful when you have several compat directories
with different Linux versions and you don't want to clash with files
installed by linux-c7 packages.

Reviewed by:	bcr (manpages)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22574
2019-12-16 20:07:04 +00:00
Edward Tomasz Napierala
cf69fe66d4 Add sync_file_range(2) implementation to linux(4); it's a thin wrapper
over the usual fsync(2).

This silences some warnings when running "apt-get upgrade".

Reviewed by:	brooks, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22371
2019-12-14 13:37:17 +00:00
Edward Tomasz Napierala
0cde2b3239 Regen after r355752.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22371
2019-12-14 13:32:37 +00:00
Edward Tomasz Napierala
0610f417a4 Fix definitions for linuxulator's sync_file_range(2).
Reviewed by:	brooks, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22371
2019-12-14 13:30:43 +00:00
John Baldwin
cbd03a9df2 Support software breakpoints in the debug server on Intel CPUs.
- Allow the userland hypervisor to intercept breakpoint exceptions
  (BP#) in the guest.  A new capability (VM_CAP_BPT_EXIT) is used to
  enable this feature.  These exceptions are reported to userland via
  a new VM_EXITCODE_BPT that includes the length of the original
  breakpoint instruction.  If userland wishes to pass the exception
  through to the guest, it must be explicitly re-injected via
  vm_inject_exception().

- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
  pseudo-register.  Injecting a BP# on Intel requires setting this to
  the length of the breakpoint instruction.  AMD SVM currently ignores
  writes to this register (but reports success) and fails to read it.

- Rework the per-vCPU state tracked by the debug server.  Rather than
  a single 'stepping_vcpu' global, add a structure for each vCPU that
  tracks state about that vCPU ('stepping', 'stepped', and
  'hit_swbreak').  A global 'stopped_vcpu' tracks which vCPU is
  currently reporting an event.  Event handlers for MTRAP and
  breakpoint exits loop until the associated event is reported to the
  debugger.

  Breakpoint events are discarded if the breakpoint is not present
  when a vCPU resumes in the breakpoint handler to retry submitting
  the breakpoint event.

- Maintain a linked-list of active breakpoints in response to the GDB
  'Z0' and 'z0' packets.

Reviewed by:	markj (earlier version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D20309
2019-12-13 19:21:58 +00:00
Mark Johnston
5cff1f4dc3 Introduce vm_page_astate.
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state.  The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields.  No functional change intended.

Reviewed by:	alc, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22650
2019-12-10 18:14:50 +00:00
John Baldwin
23a5b4ed65 Use 4 byte stack alignment instead of 8 byte.
This was an old bug prior to r355373 and mostly harmless as it would
waste at most a handful of bytes on the stack.
2019-12-09 19:18:05 +00:00
John Baldwin
d8010b1175 Copy out aux args after the argument and environment vectors.
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector.  Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack.  It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.

This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).

Reviewed by:	kib
Tested on:	amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22695
2019-12-09 19:17:28 +00:00
Konstantin Belousov
3e5b13991c amd64: properly set the start of the io permission bitmap for BSP
... after the initial common TSS is copied into its final location
during PCPU reallocation.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-12-07 00:23:19 +00:00
Brooks Davis
af796bfa71 sysent: Reduce duplication and improve readability.
Use the power of variable to avoid spelling out source and generated
files too many times.  The previous Makefiles were hard to read, hard to
edit, and badly formatted.

Reviewed by:	kevans, emaste
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22714
2019-12-06 23:59:23 +00:00
Scott Long
961aacb107 Move the mds, irbs, and ssb mitigation knobs into machdep.mitigations.
They're in both the old and new places in HEAD for the moment for
discussion and transition.  The old locations will be garbage collected
in 4 weeks.  MFCs to 12 an 11 will keep the old and new for transition
purposes.

Reviewed by:	kib
MFC after:	4 weeks
Sponsored by:	Intel
Differential Revision:	https://reviews.freebsd.org/D22590
2019-12-06 02:43:05 +00:00
Warner Losh
f86e60008b Regularize my copyright notice
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
  All Rights Reserved on same line as other copyright holders (but not
  me). Other such holders are also listed last where it's clear.
2019-12-04 16:56:11 +00:00
John Baldwin
31174518d2 Use uintptr_t instead of register_t * for the stack base.
- Use ustringp for the location of the argv and environment strings
  and allow destp to travel further down the stack for the stackgap
  and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
  stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs.  This
  used to hold translated system call arguments, but hasn't been used
  since r159992.

Reviewed by:	kib
Tested on:	md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22501
2019-12-03 23:17:54 +00:00
Jeff Roberson
0f9e06e18b Fix a few places that free a page from an object without busy held. This is
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22611
2019-12-02 22:42:05 +00:00
Anish Gupta
84474332d3 bhyve amd: amdvi_dump_cmds() log the command for which the command completion failed. Completion is checked in poll mode although it can be done using interrupts.
No need to log all the commands in command ring but only the last one for which completion failed.

Reported by: np@freebsd.org
Reviewed by: np, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22566
2019-12-01 04:00:08 +00:00
Scott Long
33ce28d137 Remove the trm(4) driver
Differential Revision:	https://reviews.freebsd.org/D22575
2019-11-28 02:32:17 +00:00
Konstantin Belousov
13189065cb amd64: assert that EARLY_COUNTER does not corrupt memory.
Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22514
2019-11-24 19:02:13 +00:00
Andrew Turner
68cad68149 Add kcsan_md_unsupported from NetBSD.
It's used to ignore virtual addresses that may have a different physical
address depending on the CPU.

Sponsored by:	DARPA, AFRL
2019-11-21 13:22:23 +00:00
Andrew Turner
1b8c58f283 Fix for style(9): use parentheses around return statements.
Reported by:	kib
Sponsored by:	DARPA, AFRL
2019-11-21 12:29:20 +00:00
Andrew Turner
849aef496d Port the NetBSD KCSAN runtime to FreeBSD.
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.

This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.

Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22315
2019-11-21 11:22:08 +00:00
Konstantin Belousov
da248a69aa amd64: in double fault handler, do not rely on sane gsbase value.
Typical reasons for doublefault faults are either kernel stack
overflow or bugs in the code that manipulates protection CPU state.
The later code is the code which often has to set up gsbase for
kernel.  Switching to explicit load of GSBASE MSR in the fault handler
makes it more probable to output a useful information.

Now all IST handlers have nmi_pcpu structure on top of their stacks.

It would be even more useful to save gsbase value at the moment of the
fault.  I did not this because I do not want to modify PCB layout now.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-20 11:12:19 +00:00
Kyle Evans
f22a592111 Convert in-tree sysent targets to use new makesyscalls.lua
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.

Reviewed by:	brooks
Differential Revision:	https://reviews.freebsd.org/D21894
2019-11-18 23:28:23 +00:00
John Baldwin
03b0d68c72 Check for errors from copyout() and suword*() in sv_copyout_args/strings.
Reviewed by:	brooks, kib
Tested on:	amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22401
2019-11-18 20:07:43 +00:00
Mark Johnston
85e06c728c Set MALLOC_DEBUG_MAXZONES=1 in GENERIC-NODEBUG configurations.
The purpose of this option is to make it easier to track down memory
corruption bugs by reducing the number of malloc(9) types that might
have recently been associated with a given chunk of memory.  However, it
increases fragmentation and is disabled in release kernels.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-11-18 20:03:28 +00:00
Konstantin Belousov
b2e1b88984 amd64 copyout: remove irrelevant comment.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-11-17 14:41:47 +00:00
Scott Long
e372160177 TSX Asynchronous Abort mitigation for Intel CVE-2019-11135.
This CVE has already been announced in FreeBSD SA-19:26.mcu.

Mitigation for TAA involves either turning off TSX or turning on the
VERW mitigation used for MDS. Some CPUs will also be self-mitigating
for TAA and require no software workaround.

Control knobs are:
machdep.mitigations.taa.enable:
        0 - no software mitigation is enabled
        1 - attempt to disable TSX
        2 - use the VERW mitigation
        3 - automatically select the mitigation based on processor
	    features.

machdep.mitigations.taa.state:
        inactive        - no mitigation is active/enabled
        TSX disable     - TSX is disabled in the bare metal CPU as well as
                        - any virtualized CPUs
        VERW            - VERW instruction clears CPU buffers
	not vulnerable	- The CPU has identified itself as not being
			  vulnerable

Nothing in the base FreeBSD system uses TSX.  However, the instructions
are straight-forward to add to custom applications and require no kernel
support, so the mitigation is provided for users with untrusted
applications and tenants.

Reviewed by:	emaste, imp, kib, scottph
Sponsored by:	Intel
Differential Revision:	22374
2019-11-16 00:26:42 +00:00
John Baldwin
5caa67fa84 Use a sv_copyout_auxargs hook in the Linux ELF ABIs.
Reviewed by:	emaste
Tested on:	amd64 (linux64 only), i386
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22356
2019-11-15 23:01:43 +00:00
John Baldwin
e353233118 Add a sv_copyout_auxargs() hook in sysentvec.
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook.  In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland.  This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by:	brooks, emaste
Tested on:	amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D22355
2019-11-15 18:42:13 +00:00
Josh Paetzel
052e12a508 Add the pvscsi driver to the tree.
This driver allows to usage of the paravirt SCSI controller
in VMware products like ESXi.  The pvscsi driver provides a
substantial performance improvement in block devices versus
the emulated mpt and mps SCSI/SAS controllers.

Error handling in this driver has not been extensively tested
yet.

Submitted by:	vbhakta@vmware.com
Relnotes:	yes
Sponsored by:	VMware, Panzura
Differential Revision:	D18613
2019-11-14 23:31:20 +00:00
Konstantin Belousov
c4f056e8ea amd64: only set PCB_FULL_IRET pcb flag when #gp or similar exception comes
from usermode.

If CPU supports RDFSBASE, the flag also means that userspace fsbase
and gsbase are already written into pcb, which might be not true when
we handle #gp from kernel.

The offender is rdmsr_safe(), and the visible result is corrupted
userspace TLS base.

Reported by:	pstef
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-11-13 22:39:46 +00:00
Konstantin Belousov
c08973d09c Workaround for Intel SKL002/SKL012S errata.
Disable the use of executable 2M page mappings in EPT-format page
tables on affected CPUs.  For bhyve virtual machines, this effectively
disables all use of superpage mappings on affected CPUs.  The
vm.pmap.allow_2m_x_ept sysctl can be set to override the default and
enable mappings on affected CPUs.

Alternate approaches have been suggested, but at present we do not
believe the complexity is warranted for typical bhyve's use cases.

Reviewed by:	alc, emaste, markj, scottl
Security:	CVE-2018-12207
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21884
2019-11-12 18:01:33 +00:00
Konstantin Belousov
a7af4a3e7d amd64: move GDT into PCPU area.
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22302
2019-11-12 15:51:47 +00:00
Konstantin Belousov
de6f295446 amd64: assert that size of the software prototype table for gdt is equal
to the size of hardware gdt.

Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22302
2019-11-12 15:47:46 +00:00
Andriy Gapon
78f1851613 teach db_nextframe/x86 about [X]xen_intr_upcall interrupt handler
Discussed with:	kib, royger
MFC after:	3 weeks
Sponsored by:	Panzura
2019-11-12 11:00:01 +00:00
Konstantin Belousov
6cd492bcd4 amd64: Issue MFENCE on context switch on AMD CPUs when reusing address space.
On some AMD CPUs, in particular, machines that do not implement
CLFLUSHOPT but do provide CLFLUSH, the CLFLUSH instruction is only
synchronized with MFENCE.

Code using CLFLUSH typicall needs to brace it with MFENCE both before
and after flush, see for instance pmap_invalidate_cache_range().  If
context switch occurs while inside the protected region, we need to
ensure visibility of flushes done on the old CPU, to new CPU.

For all other machines, locked operation done to lock switched thread,
should be enough.  For case of different address spaces, reload of
%cr3 is serializing.

Reviewed by:	cem, jhb, scottph
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22007
2019-11-11 21:59:20 +00:00
Andriy Gapon
2961e6efeb db_nextframe/amd64: remove TRAP_INTERRUPT frame type
Besides the confusing name, this type is effectively unused.
In all cases where it could be set, the INTERRUPT type is set by the
earlier code.  The conditions for TRAP_INTERRUPT are a subset of the
conditions for INTERRUPT.

Reviewed by:	kib, markj
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D22305
2019-11-11 17:11:49 +00:00
Konstantin Belousov
415d23ebfd amd64: change r_gdt to the local variable in hammer_time().
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-10 10:03:22 +00:00
Konstantin Belousov
d70bab39f2 amd64: Change SFENCE to locked op for synchronizing with CLFLUSHOPT on Intel.
Reviewed by:	cem, jhb
Discussed with:	alc, scottph
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22007
2019-11-10 09:41:29 +00:00
Konstantin Belousov
98158c753d amd64: move common_tss into pcpu.
This saves some memory, around 256K I think.  It removes some code,
e.g. KPTI does not need to specially map common_tss anymore.  Also,
common_tss become domain-local.

Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22231
2019-11-10 09:28:18 +00:00
Eric van Gyzen
854e90da4e vmm: pass M_WAITOK to uma_zalloc when allocating FPU save area
Submitted by:	patrick.sullivan3@dell.com
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22276
2019-11-08 16:30:55 +00:00
Konstantin Belousov
83ba1468ab amd64: Store %cr3 into pcpu saved_ucr3 on double fault.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-03 11:52:50 +00:00
Konstantin Belousov
7ccd639deb amd64 ddb: Add printing of kernel/user and saved user %cr3 values from pcpu.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-03 11:51:53 +00:00
Edward Tomasz Napierala
2ae3f52cee There's nothing architecture specific in "options STATS"; move it from
sys/amd64/conf/NOTES to sys/conf/NOTES.

Suggested by:	jhb@
Sponsored by:	Klara Inc, Netflix
2019-10-30 10:16:28 +00:00
Konstantin Belousov
af592d0465 Fix reset of the kernel stack pointer in TSS for !PTI case on pmap activation
after r354095.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2019-10-28 10:50:37 +00:00
Konstantin Belousov
20795e252a Provide dummy definition of the amd64 struct pcb for -m32 compilation.
I do not see a need in the proper x86/include/pcb.h header.

Reported and tested by:	antoine
MFC after:	1 week
2019-10-26 18:22:52 +00:00
Konstantin Belousov
5e921ff49e amd64: move pcb out of kstack to struct thread.
This saves 320 bytes of the precious stack space.

The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that.  Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22138
2019-10-25 20:09:42 +00:00
Mateusz Guzik
08ded448cf amd64 pmap: per-domain pv chunk list
This significantly reduces contention since chunks get created and removed
all the time. See the review for sample results.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21976
2019-10-23 19:17:10 +00:00
Conrad Meyer
639ec13157 amd64: Add CFI directives for libc syscall stubs
No functional change (in program code).  Additional DWARF metadata is
generated in the .eh_frame section.  Also, it is now a compile-time
requirement that machine/asm.h ENTRY() and END() macros are paired.  (This
is subject to ongoing discussion and may change.)

This DWARF metadata allows llvm-libunwind to unwind program stacks when the
program is executing the function.  The goal is to collect accurate
userspace stacktraces when programs have entered syscalls.

(The motivation for "Call Frame Information," or CFI for short -- not to be
confused with Control Flow Integrity -- is to sufficiently annotate assembly
functions such that stack unwinders can unwind out of the local frame
without the requirement of a dedicated framepointer register; i.e.,
-fomit-frame-pointer.  This is necessary for C++ exception handling or
collecting backtraces.)

For the curious, a more thorough description of the metadata and some
examples may be found at [1] and documentation at [2].  You can also look at
'cc -S -o - foo.c | less' and search for '.cfi_' to see the CFI directives
generated by your C compiler.

[1]: https://www.imperialviolet.org/2017/01/18/cfi.html
[2]: https://sourceware.org/binutils/docs/as/CFI-directives.html

Reviewed by:	emaste, kib (with reservations)
Differential Revision:	https://reviews.freebsd.org/D22122
2019-10-23 19:03:03 +00:00
Mateusz Guzik
61b8430f38 amd64 pmap: conditionalize per-superpage locks on NUMA
Instead of superpages use. The current code employs superpage-wide locking
regardless and the better locking granularity is welcome with NUMA enabled
even when superpage support is not used.

Requested by:	alc
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21982
2019-10-22 22:55:46 +00:00
Mateusz Guzik
15e33b5493 amd64 pmap: fixup invlgen lookup for fictitious mappings
Similarly to r353438, use dummy entry.

Reported and tested by:	Neel Chauhan
Sponsored by:	The FreeBSD Foundation
2019-10-22 22:54:41 +00:00
Andriy Gapon
869dbab7ba vmm: remove a wmb() call
After removing wmb(), vm_set_rendezvous_func() became super trivial, so
there was no point in keeping it.

The wmb (sfence on amd64, lock nop on i386) was not needed.  This can be
explained from several points of view.

First, wmb() is used for store-store ordering (although, the primitive
is undocumented).  There was no obvious subsequent store that needed the
barrier.

Second, x86 has a memory model with strong ordering including total
store order.  An explicit store barrier may be needed only when working
with special memory (device, special caching mode) or using special
instructions (non-temporal stores).  That was not the case for this
code.

Third, I believe that there is a misconception that sfence "flushes" the
store buffer in a sense that it speeds up the propagation of stores from
the store buffer to the global visibility.  I think that such
propagation always happens as fast as possible.  sfence only makes
subsequent stores wait for that propagation to complete.  So, sfence is
only useful for ordering of stores and only in the situations described
above.

Reviewed by:	jhb
MFC after:	23 days
Differential Revision: https://reviews.freebsd.org/D21978
2019-10-19 07:10:15 +00:00
Mark Johnston
14327f5334 Tighten mapping protections on preloaded files on amd64.
- We load the kernel at 0x200000.  Memory below that address need not
  be executable, so do not map it as such.
- Remove references to .ldata and related sections in the kernel linker
  script.  They come from ld.bfd's default linker script, but are not
  used, and we now use ld.lld to link the amd64 kernel.  lld does not
  contain a default linker script.
- Pad the .bss to a 2MB as we do between .text and .data.  This
  forces the loader to load additional files starting in the following
  2MB page, preserving the use of superpage mappings for kernel data.
- Map memory above the kernel image with NX.  The kernel linker now
  upgrades protections as needed, and other preloaded file types
  (e.g., entropy, microcode) need not be mapped with execute permissions
  in the first place.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21859
2019-10-18 14:05:13 +00:00
Yuri Pankov
a161fba992 linux: futex_mtx should follow futex_list
Move futex_mtx to linux_common.ko for amd64 and aarch64 along
with respective list/mutex init/destroy.

PR:		240989
Reported by:	Alex S <iwtcex@gmail.com>
2019-10-18 12:25:33 +00:00
Conrad Meyer
dda17b3672 Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.

There are three pieces in the complete NetGDB system.

First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.

Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic.  Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).

Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.

The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU).  There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with:	emaste, markj
Relnotes:	sure
Differential Revision:	https://reviews.freebsd.org/D21568
2019-10-17 21:33:01 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Mark Johnston
6e6d41f20d Introduce pmap_change_prot() for amd64.
This updates the protection attributes of subranges of the kernel map.
Unlike pmap_protect(), which is typically used for user mappings,
pmap_change_prot() does not perform lazy upgrades of protections.
pmap_change_prot() also updates the aliasing range of the direct map.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21758
2019-10-16 22:12:34 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Conrad Meyer
f677fed5a2 ddb: Add support for disassembling 'crc32' on amd64 2019-10-16 18:27:27 +00:00
Andriy Gapon
edca4938f7 itwd(4): driver for watchdog function in ITE Super I/O chips
The chips are commonly named with "IT" prefix.

MFC after:	19 days
2019-10-16 14:57:38 +00:00
Jeff Roberson
638f867814 (6/6) Convert pmap to expect busy in write related operations now that all
callers hold it.

This simplifies pmap code and removes a dependency on the object lock.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21596
2019-10-15 03:51:46 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
205be21d99 (3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.

This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held.  This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.

Reviewed by:    kib
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21592
2019-10-15 03:41:36 +00:00
Mateusz Guzik
f48d651dd9 amd64 pmap: handle fictitious mappigns with addresses beyond pv_table
There are provisions to do it already with pv_dummy, but new locking code
did not account for it. Previous one did not have the problem because
it hashed the address into the lock array.

While here annotate common vars with __read_mostly and __exclusive_cache_line.

Reported by:	Thomas Laus
Tesetd by:	jkim, Thomas Laus
Fixes: r353149 ("amd64 pmap: implement per-superpage locks")
Sponsored by:	The FreeBSD Foundation
2019-10-11 14:57:47 +00:00
Doug Ambrisko
f2521a76ed This driver attaches to the Intel VMD drive and connects a new PCI domain
starting at the max. domain, and then work down.  Then existing FreeBSD
drivers will attach.  Interrupt routing from the VMD MSI-X to the NVME
drive is not well known, so any interrupt is sent to all children that
register.

VROC used Intel meta data so graid(8) works with it. However, graid(8)
supports RAID 0,1,10 for read and write. I have some early code to
support writes with RAID 5.  Note that RAID 5 can have life issues
with SSDs since it can cause write amplification from updating the parity
data.

Hot plug support needs a change to skip the following check to work:
	if (pcib_request_feature(dev, PCI_FEATURE_HP) != 0) {
in sys/dev/pci/pci_pci.c.

Looked at by: imp, rpokala, bcr
Differential Revision:	https://reviews.freebsd.org/D21383
2019-10-10 03:12:17 +00:00
Mateusz Guzik
fa43c5d49e amd64: plug spurious cld instructions
ABI already guarantees the direction is forward. Note this does not take care
of i386-specific cld's.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21906
2019-10-08 21:14:11 +00:00
Mark Johnston
8f52459857 Simplify pmap_page_array_startup() a bit.
No functional change intended.

Sponsored by:	The FreeBSD Foundation
2019-10-08 16:42:50 +00:00
Mateusz Guzik
06d25b07c6 amd64 pmap: allocate pv table entries for gaps in PA
This matches the state prior to r353149 and fixes crashes with DRM
modules.

Reported and tested by:	cy, garga, Krasznai Andras
Fixes: r353149 ("amd64 pmap: implement per-superpage locks")
Sponsored by:	The FreeBSD Foundation
2019-10-08 14:59:50 +00:00
Edward Tomasz Napierala
1a13f2e6b4 Introduce stats(3), a flexible statistics gathering API.
This provides a framework to define a template describing
a set of "variables of interest" and the intended way for
the framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof).  Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs.  The stats(3) framework can be used in
both userspace and the kernel.

See the stats(3) manual page for details.

This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.

The stats(3) framework is disabled by default for now, except
in the NOTES kernel (for QA); it is expected to be enabled
in amd64 GENERIC after a cool down period.

Reviewed by:	sef (earlier version)
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20477
2019-10-07 19:05:05 +00:00
Mateusz Guzik
6114fc8b85 amd64 pmap: implement per-superpage locks
The current 256-lock sized array is a problem in the following ways:
- it's way too small
- there are 2 locks per cacheline
- it is not NUMA-aware

Solve these issues by introducing per-superpage locks backed by pages
allocated from respective domains.

This significantly reduces contention e.g. during poudriere -j 104.
See the review for results.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21833
2019-10-06 22:13:35 +00:00
Ed Maste
9923b64177 Remove host binary object drivers from GENERIC
Four drivers (hpt27xx, hptmv, hptnr, hptrr, hpt27xx) include precompiled
binary objects; have users load them as modules if they are needed.

Additional work (i.e., integrating devmatch) required before MFC.

Reviewed by:	markj
Relnotes:	Yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21865
2019-10-03 12:51:57 +00:00
Mark Johnston
0bed9d03b4 Remove more identifiers orphaned by r351742.
Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D21642
2019-09-30 20:39:25 +00:00
Mateusz Guzik
4093e719b4 amd64 pmap: batch chunk removal in pmap_remove_pages
pv list lock is the main bottleneck during poudriere -j 104 and
pmap_remove_pages is the most impactful consumer. It frees chunks with the lock
held even though it plays no role in correctness. Moreover chunks are often
freed in groups, sample counts during buildkernel (0-sized frees removed):

    value  ------------- Distribution ------------- count
          0 |                                         0
          1 |                                         8
          2 |@@@@@@@                                  19329
          4 |@@@@@@@@@@@@@@@@@@@@@@                   58517
          8 |                                         1085
         16 |                                         71
         32 |@@@@@@@@@@                               24919
         64 |                                         899
        128 |                                         7
        256 |                                         2
        512 |                                         0

Thus:
1. batch freeing
2. move it past unlocking pv list

Reviewed by:	alc (previous version), markj (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21832
2019-09-29 20:44:13 +00:00
Mark Johnston
d3588766e1 Correct the scope of several global variables.
They are accessed from multiple compilation units.  No functional change
intended.

MFC after:	1 week
Sponsored by:	Netflix
2019-09-27 21:04:33 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Conrad Meyer
407c48f060 amd64 pmap: Clarify largemap bootverbose message units
A PML4 covers 512 gigabytes, not gigabits.  Use the typical B suffix for
bytes.  No functional change.

Sponsored by:	Dell EMC Isilon
2019-09-26 01:51:55 +00:00
Conrad Meyer
f7b69dd986 amd64: Expose vm.pmap.large_map_pml4_entries as a sysctl node
It's nice to have sysctl nodes for tunables.

Sponsored by:	Dell EMC Isilon
2019-09-26 01:50:26 +00:00
Kyle Evans
d19f028e33 sysent: regenerate after r352693 2019-09-25 17:30:28 +00:00
Mark Johnston
b119329d81 Complete the removal of the "wire_count" field from struct vm_page.
Convert all remaining references to that field to "ref_count" and update
comments accordingly.  No functional change intended.

Reviewed by:	alc, kib
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21768
2019-09-25 16:11:35 +00:00
Mark Johnston
2ff730191d Set NX on some non-leaf direct map page table entries.
The direct map is never used for execution of code, so we might as well
set NX in the direct map's PML4Es.  Also clarify the intent of the code
in create_pagetables() that restricts access protections on the region
of the direct map mapping the kernel text.

Reviewed by:	alc, kib (previous version)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21759
2019-09-23 14:19:41 +00:00
Mark Johnston
38dae42c26 Use elf_relocaddr() when handling R_X86_64_RELATIVE relocations.
This is required for DPCPU and VNET data variable definitions to work when
KLDs are linked as DSOs.  R_X86_64_RELATIVE relocations should not appear
in object files, so assert this in elf_relocaddr().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21755
2019-09-23 14:14:43 +00:00
Mark Johnston
751727948a Set NX in mappings created by pmap_kenter() and pmap_kenter_attr().
There does not appear to be any existing need for such mappings to be
executable.

Reviewed by:	alc, kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21754
2019-09-23 14:11:59 +00:00
Konstantin Belousov
a5181a86a2 amd64: minor tweaks to pat decoding in sysctl vm.pmap.kernel_maps.
Decode PAT_UNCACHED.
When unknown pat mode is encountered, print the pte bits combination
instead of the index, which is always 8.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21738
2019-09-22 19:20:37 +00:00
Konstantin Belousov
6320faed10 amd64 pmap: Fix formats for 64bit addresses in ddb and sysctl output.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21737
2019-09-21 17:59:15 +00:00
Mark Johnston
38a20ba16f Fix a couple of nits in r352110.
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:06:19 +00:00
Mark Johnston
e8bcf6966b Revert r352406, which contained changes I didn't intend to commit. 2019-09-16 15:04:45 +00:00
Mark Johnston
41fd4b9422 Fix a couple of nits in r352110.
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:03:12 +00:00
Ed Maste
a8ce8f0ae5 Update comments and ordering in linux*_dummy.c
- sort alphabetically
- getcpu arrived in Linux 2.6.19
- fanotify_* arrived in 2.6.36
2019-09-11 17:56:48 +00:00
Ed Maste
eff9d7b799 linuxulator: memfd_create first appeared in Linux 3.17
Reference: http://man7.org/linux/man-pages/man2/memfd_create.2.html
2019-09-11 17:05:49 +00:00
Ed Maste
1515fe49f2 linuxulator: seccomp syscall first appeared in Linux 3.17
Reference: http://man7.org/linux/man-pages/man2/seccomp.2.html
2019-09-11 17:04:13 +00:00
Ed Maste
2eb6ef203a linux: add trivial renameat2 implementation
Just return EINVAL if flags != 0.  The Linux man page documents one
case of EINVAL as "The filesystem does not support one of the flags in
flags."

After r351723 userland binaries will try using new system calls.

Reported by:	mjg
Reviewed by:	mjg, trasz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21590
2019-09-11 13:01:59 +00:00
Ed Maste
65ab1fdd21 regen linuxulator sysent after r352208 2019-09-11 12:58:53 +00:00
Ed Maste
427b1baec0 make linux_renameat2 args consistent with linux_renameat
Use 'dfd' consistently for a directory fd.
2019-09-11 12:58:06 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Edward Tomasz Napierala
36c03d045a Improve debugging output.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-09-04 18:00:03 +00:00
Mark Johnston
f97bf60469 Fix some nits in pmap_page_array_startup().
- Use ptoa() instead of the archaic ctob().
- Use pagezero() to zero a PDP page.
- Remove PA_MIN_ADDRESS, orphaned by r351742.
- Remove unneeded parens and an unnecessary control flow statement.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21495
2019-09-03 22:26:01 +00:00
Edward Tomasz Napierala
ee6da5cee7 Unbreak Linux binaries linked against new glibc, such as the ones
from recent Ubuntu versions.  Without it they segfault on startup.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20687
2019-09-03 19:48:23 +00:00
Mark Johnston
9d75f0dc75 Map the vm_page array into KVA on amd64.
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose.  One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.

To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21491
2019-09-03 13:18:51 +00:00
Mark Johnston
209f2e9838 Add a sysctl to dump kernel mappings and their properties on amd64.
The sysctl is called vm.pmap.kernel_maps.  It dumps address ranges
and their corresponding protection and mapping mode, as well as
counts of 2MB and 1GB pages in the range.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21380
2019-09-02 21:57:57 +00:00
Mark Johnston
87044fca73 Replace PMAP_LARGEMAP_MAX_ADDRESS() with a more general predicate.
No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-09-02 21:54:08 +00:00
John Baldwin
6a1e1c2c48 Simplify bhyve vlapic ESR logic.
The bhyve virtual local APIC uses an instance-global flag to indicate
when an error LVT is being delivered to prevent infinite recursion.
Use a function argument instead to reduce the amount of instance-global
state.

This was inspired by reviewing the bhyve save/restore work, which
saves a copy of the instance-global state for each vlapic.

Smart OS bug:	https://smartos.org/bugview/OS-7777
Submitted by:	Patrick Mooney
Reviewed by:	markj, rgrimes
Obtained from:	SmartOS / Joyent
Differential Revision:	https://reviews.freebsd.org/D20365
2019-08-29 18:23:38 +00:00
Konstantin Belousov
a2a0f90654 Centralize __pcpu definitions.
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources.  The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu.  There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.

To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source.  Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.

Reported and tested by:	lwhsu, bcran
Reviewed by:	imp, markj (previous version)
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21418
2019-08-29 07:25:27 +00:00
John Baldwin
e08087ee43 Use get_pcpu() to fetch the current CPU's pcpu pointer.
This avoids encoding knowledge about how pcpu objects are allocated and is
also a few instructions shorter.

MFC after:	2 weeks
2019-08-28 23:40:57 +00:00
Mateusz Guzik
73dd84a7a5 amd64: clean up cpu_switch.S
- LK macro (conditional on SMP for the lock prefix) is unused
- SETLK unnecessarily performs xchg. obtained value is never used and the
  implicit lock prefix adds avoidable cost. Barrier provided by it does
  not appear to be of any use.
- the lock waited for is almost never blocked, yet the loop starts with
  a pause. Move it out of the common case.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19563
2019-08-28 19:40:57 +00:00
Konstantin Belousov
c6a810919b amd64: loose constraints on the APs dpcpu and nmi/dbg stack allocations.
Use DOMAINSET_PREF() instead of DOMAINSET_FIXED(), to gracefully
fallback in case of memory-less domain.

Reported and tested by:	bcran
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2019-08-25 21:01:40 +00:00
Konstantin Belousov
46213cc528 amd64: If domain-local page for pcpu cannot be allocated, keep use
existing one.

Allocation failure is possible for instance when cpu domain has no memory.

Reported and tested by:	bcran
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2019-08-25 20:57:39 +00:00
Konstantin Belousov
10ff5eeb04 amd64: rework PCPU allocation
Move pcpu KVA out of .bss into dynamically allocated VA at
pmap_bootstrap().  This avoids demoting superpage mapping .data/.bss.
Also it makes possible to use pmap_qenter() for installation of
domain-local pcpu page on NUMA configs.

Refactor pcpu and IST initialization by moving it to helper functions.

Reviewed by:	markj
Tested by:	pho
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21320
2019-08-24 15:31:31 +00:00
Konstantin Belousov
ec9662fe61 Do not constrain allocations for doublefault, boot, and mce stacks.
All these stacks are used only once (doublefault, boot) or very rare
(mce).

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21320
2019-08-24 15:28:40 +00:00
Konstantin Belousov
cb0e752f8e Style. 2019-08-24 15:25:53 +00:00
Konstantin Belousov
5865e1b91b Remove unecessary VM_ALLOC_ZERO from allocation of the domain-local
page for pcpu.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21320
2019-08-24 15:22:18 +00:00
Conrad Meyer
799176810a gdb(4):amd64: Bump MI GDB_BUFSZ for more efficient transfers
A bigger buffer reduces the RTTs to transfer long messages and is otherwise
relatively harmless, especially on systems with plenty of memory.
2019-08-22 00:35:17 +00:00
Jeff Roberson
a5e5548c88 Allocate all per-cpu datastructures in domain correct memory.
Reviewed by:	kib, gallatin (some objections)
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21242
2019-08-18 23:44:23 +00:00
Jeff Roberson
3e5e1b5135 Allocate amd64's page array using pages and page directory pages from the
NUMA domain that the pages describe.  Patch original from gallatin.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21252
2019-08-18 23:07:56 +00:00
Jeff Roberson
2194393787 Move phys_avail definition into MI code. It is consumed in the MI layer and
doing so adds more flexibility with less redundant code.

Reviewed by:	jhb, markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21250
2019-08-16 00:45:14 +00:00
Ed Maste
ba084c18de sys/{x86,amd64}: remove one of doubled ;s
MFC after:	1 week
2019-08-13 19:39:36 +00:00
Warner Losh
0d89c934cb Start to split out the really x86 specific NOTES from the global notes file.
Start with COMPAT_43, since it's really only relevant to x86.

Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D21203
2019-08-12 22:58:13 +00:00
Mark Johnston
13a7c4d478 Use designated initializers for vmm_ops.
MFC after:	3 days
2019-08-07 19:45:44 +00:00
Konstantin Belousov
90e35b0a98 amd64: prevents speculations over swapgs reload of %gs base.
Such speculations could use user-controlled %gs base, esp. since
FreeBSD supports WRGSBASE instructions.

Place LFENCEs on entry for each basic block after the test for
previous kernel/user mode on the kernel entry, which prevents the
speculation.  Code accesses %gs-based PCPU before any serialization
instructions are executed, like %cr3 reload for KPTI.

With pti disabled, on haswell i7-4770S machine, "syscall_timings getppid"
shows when no lfence is added to syscall path:
test	loop	time	iterations	periteration
getppid	0	1.040918865	4643611	0.000000224
getppid	1	1.004985962	4481816	0.000000224
getppid	2	1.005196483	4482363	0.000000224
with lfence:
getppid	0	1.043701091	4554779	0.000000229
getppid	1	1.016930328	4438094	0.000000229
getppid	2	1.023223117	4466640	0.000000229
and ministat reports 'No difference proven at 95.0% confidence.'

Security:	CVE-2019-1125
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-06 16:53:25 +00:00
Konstantin Belousov
1947b29861 amd64: Streamline exceptions and interrupts handlers.
PTI-mode entry points were coded to set up the environment identical
to non-PTI entry and then fall-through to non-PTI handlers, mostly.
This has the drawback of requiring two more SWAPGS, first to access
PCPU, and then to return to the state expected by the non-PTI entry
point.

Eliminate the duplication by doing more in entry stubs both for PTI
and non-PTI, and adjusting the common code to expect that SWAPGS and
some minimal registers saving is done by entries.

Some less often used entries, in particular, #GP, #NP, and #SS, which
can fault on doreti, are left as is because there are basically four
variants of entrance, and they are not performance-critical,
esp. comparing with e.g. #PF or interrupts.

Reviewed by:	markj (previous version)
Tested by:	pho (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-08-03 17:07:04 +00:00
Konstantin Belousov
e550631697 bhyve: Ignore MSI/MSI-X interrupts sent to non-active vCPUs in
physical destination mode.

This is mostly a nop, because the vmm initializes all vCPUs up to
vm_maxcpus, so even if the target CPU is not active, lapic/vlapic code
still has the valid data to use.  As John notes, dropping such
interrupts more closely matches the real harware, which ignores all
interrupts for not started APs.

Reviewed by:	jhb
admbugs:	837
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-08-03 16:57:14 +00:00
John Baldwin
c45cbc7a1f Don't reset memory attributes when mapping physical addresses for ACPI.
Previously, AcpiOsMemory was using pmap_mapbios which would always map
the requested address Write-Back (WB).  For several AMD Ryzen laptops,
the BIOS uses AcpiOsMemory to directly access the PCI MCFG region in
order to access PCI config registers.  This has the side effect of
remapping the MCFG region in the direct map as WB instead of UC
hanging the laptops during boot.

On the one laptop I examined in detail, the _PIC global method used to
switch from 8259A PICs to I/O APICs uses a pair of PCI config space
registers at offset 0x84 in the device at 0:0:0 to as a pair of
address/data registers to access an indirect register in the chipset
and clear a single bit to switch modes.

To fix, alter the semantics of pmap_mapbios() such that it does not
modify the attributes of any existing mappings and instead uses the
existing attributes.  If a new mapping is created, this new mapping
uses WB (the default memory attribute).

Special thanks to the gentleman whose name I don't have who brought
two affected laptops to the hacker lounge at BSDCan.  Direct access to
the affected systems permitted finding the root cause within an hour
or so.

PR:		231760, 236899
Reviewed by:	kib, alc
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20327
2019-08-03 01:36:05 +00:00
Ed Maste
490d56c527 vmx: use C99 bool, not boolean_t
Bhyve's vmm is a self-contained modern component and thus a good
candidate for use of C99 types.

Reviewed by:	jhb, kib, markj, Patrick Mooney
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21036
2019-08-01 02:16:48 +00:00
Konstantin Belousov
fc83c5a7d0 Make randomized stack gap between strings and pointers to argv/envs.
This effectively makes the stack base on the csu _start entry
randomized.

The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap.  Setting it to zero
disables the gap, and max is capped at 50%.

Only amd64 for now.

Reviewed by:	cem, markj
Discussed with:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21081
2019-07-31 20:23:10 +00:00
Alan Cox
43ded0a321 In pmap_advise(), when we encounter a superpage mapping, we first demote the
mapping and then destroy one of the 4 KB page mappings so that there is a
potential trigger for repromotion.  Currently, we destroy the first 4 KB
page mapping that falls within the (current) superpage mapping or the
virtual address range [sva, eva).  However, I have found empirically that
destroying the last 4 KB mapping produces slightly better results,
specifically, more promotions and fewer failed promotion attempts.
Accordingly, this revision changes pmap_advise() to destroy the last 4 KB
page mapping.  It also replaces some nearby uses of boolean_t with bool.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21115
2019-07-31 05:38:39 +00:00
Ed Maste
305b9efefc linuxulator: rename linux_locore.s to .asm
It is assembled using "${CC} -x assembler-with-cpp", which by convention
(bsd.suffixes.mk) uses the .asm extension.

This is a portion of the review referenced below (D18344).  That review
also renamed linux_support.s to .S, but that is a functional change
(using the compiler's integrated assembler instead of as) and will be
revisited separately.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18344
2019-07-30 17:18:31 +00:00
Xin LI
d4565741c6 Remove gzip'ed a.out support.
The current implementation of gzipped a.out support was based
on a very old version of InfoZIP which ships with an ancient
modified version of zlib, and was removed from the GENERIC
kernel in 1999 when we moved to an ELF world.

PR:		205822
Reviewed by:	imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp>
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D21099
2019-07-30 05:13:16 +00:00
Alan Cox
5d18382b72 Simplify the handling of superpages in pmap_clear_modify(). Specifically,
if a demotion succeeds, then all of the 4KB page mappings within the
superpage-sized region must be valid, so there is no point in testing the
validity of the 4KB page mapping that is going to be write protected.

Deindent the nearby code.

Reviewed by:	kib, markj
Tested by:	pho (amd64, i386)
X-MFC after:	r350004 (this change depends on arm64 dirty bit emulation)
Differential Revision:	https://reviews.freebsd.org/D21027
2019-07-25 22:02:55 +00:00
John Baldwin
87c39157c6 Improve the precision of bhyve's vPIT.
Use 'struct bintime' instead of 'sbintime_t' to manage times in vPIT
to postpone rounding to final results rather than intermediate
results.  In tests performed by Joyent, this reduced the error measured
by Linux guests by 59 ppm.

Smart OS bug:	https://smartos.org/bugview/OS-6923
Submitted by:	Patrick Mooney
Reviewed by:	rgrimes
Obtained from:	SmartOS / Joyent
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20335
2019-07-20 15:59:49 +00:00
John Baldwin
c18ca74916 Don't pass error from syscallenter() to syscallret().
syscallret() doesn't use error anymore.  Fix a few other places to permit
removing the return value from syscallenter() entirely.
- Remove a duplicated assertion from arm's syscall().
- Use td_errno for amd64_syscall_ret_flush_l1d.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D2090
2019-07-15 21:25:16 +00:00
Konstantin Belousov
026e450262 Fix syntax.
Nod from:	jhb
Sponsored by:	The FreeBSD Foundation
2019-07-12 19:14:52 +00:00
Konstantin Belousov
30b3018d48 Provide protection against starvation of the ll/sc loops when accessing userpace.
Casueword(9) on ll/sc architectures must be prepared for userspace
constantly modifying the same cache line as containing the CAS word,
and not loop infinitely.  Otherwise, rogue userspace livelocks the
kernel.

To fix the issue, change casueword(9) interface to return new value 1
indicating that either comparision or store failed, instead of relying
on the oldval == *oldvalp comparison.  The primitive no longer retries
the operation if it failed spuriously.  Modify callers of
casueword(9), all in kern_umtx.c, to handle retries, and react to
stops and requests to terminate between retries.

On x86, despite cmpxchg should not return spurious failures, we can
take advantage of the new interface and just return PSL.ZF.

Reviewed by:	andrew (arm64, previous version), markj
Tested by:	pho
Reported by:	https://xenbits.xen.org/xsa/advisory-295.txt
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D20772
2019-07-12 18:43:24 +00:00
Scott Long
422a8a4d3a Tie the name limit of a VM to SPECNAMELEN from devfs instead of a
hard-coded value. Don't allocate space for it from the kernel stack.
Account for prefix, suffix, and separator space in the name. This
takes the effective length up to 229 bytes on 13-current, and 37 bytes
on 12-stable. 37 bytes is enough to hold a full GUID string.

PR:		234134
MFC after:	1 week
Differential Revision:	http://reviews.freebsd.org/D20924
2019-07-12 18:37:56 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Edward Tomasz Napierala
8037385293 Add support for PTRACE_O_TRACEEXIT to linuxulator ptrace(2).
This fixes strace 4.25 from Ubuntu 19.04.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20689
2019-07-04 19:46:58 +00:00
Edward Tomasz Napierala
de7ea4e414 Implement PTRACE_GETSIGINFO. This makes Linux strace(1) quieter
in some cases (strace -f man id > /dev/null).

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20691
2019-07-04 19:44:13 +00:00
Alexander Motin
6683132d54 Add driver for NTB in AMD SoC.
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.

Submitted by:	Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D18774
2019-07-02 05:25:18 +00:00
Alan Cox
b6ce9ba9c3 Tidy up pmap_copy(). Notably, deindent the innermost loop by making a
simple change to the control flow.  Replace an unnecessary test by a
KASSERT.  Add a comment explaining an obscure test.

Reviewed by:	kib, markj
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D20812
2019-07-01 22:00:42 +00:00
Andriy Gapon
e3722b788e add superio driver
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.

While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers.  SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring.  Such functions do
require drivers with a knowledge of a specific SuperIO.

At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices.  So, I have not done the usual
split between the hardware driver and the bus functionality.  Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip.  The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions.  The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.

I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.

Discussed with:	imp, jhb
MFC after:	7 weeks
Differential Revision: https://reviews.freebsd.org/D8175
2019-07-01 17:05:41 +00:00
Navdeep Parhar
57f317e60a Display the approximate space needed when a minidump fails due to lack
of space.

Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D20801
2019-06-30 03:14:04 +00:00
Alan Cox
c134ef742f When we protect PTEs (as opposed to PDEs), we only call vm_page_dirty()
when, in fact, we are write protecting the page and the PTE has PG_M set.
However, pmap_protect_pde() was always calling vm_page_dirty() when the PDE
has PG_M set.  So, adding PG_NX to a writeable PDE could result in
unnecessary (but harmless) calls to vm_page_dirty().

Simplify the loop calling vm_page_dirty() in pmap_protect_pde().

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20793
2019-06-28 22:40:34 +00:00
Rodney W. Grimes
e4da41f932 Emulate the "TEST r/m{16,32,64}, imm{16,32,32}" instructions (opcode F7H).
This adds emulation for:
	test r/m16, imm16
	test r/m32, imm32
	test r/m64, imm32 sign-extended to 64

OpenBSD guests compiled with clang 8.0.0 use TEST directly against a
Local APIC register instead of separate read via MOV followed by a
TEST against the register.

PR:		238794
Submitted by:	jhb
Reported by:	Jason Tubnor jason@tubnor.net
Tested by:	Jason Tubnor jason@tubnor.net
Reviewed by:	markj, Patrick Mooney patrick.mooney@joyent.com
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D20755
2019-06-26 21:19:43 +00:00
Mark Johnston
0fd977b3fa Add a return value to vm_page_remove().
Use it to indicate whether the page may be safely freed following
its removal from the object.  Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20758
2019-06-26 17:37:51 +00:00
Konstantin Belousov
7256d0fcfd amd64 pmap: Fix pkru handling in pmap_remove().
When pmap_pkru_on_remove() is called, the sva argument value was
advanced.  Clear PKRU earlier when sva still specifies the start of
the region.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-06-26 17:16:26 +00:00
Konstantin Belousov
d8ddb98a5e amd64 pmap: block on turnstile for lock-less DI.
Port the code to block on turnstile instead of yielding, to lock-less
delayed invalidation. The yield might cause tight loop due to priority
inversion.

Since it is impossible to avoid race between block and wake-up, arm
1-tick callout to wakeup when thread blocks itself.

Reported and tested by:	mjg
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
Differential revision:	https://reviews.freebsd.org/D20636
2019-06-23 21:21:11 +00:00
Conrad Meyer
c363b16c63 sys: Remove DEV_RANDOM device option
Remove 'device random' from kernel configurations that reference it (most).
Replace perhaps mistaken 'nodevice random' in two MIPS configs with 'options
RANDOM_LOADABLE' instead.  Document removal in UPDATING; update NOTES and
random.4.

Reviewed by:	delphij, markm (previous version)
Approved by:	secteam(delphij)
Differential Revision:	https://reviews.freebsd.org/D19918
2019-06-21 00:16:30 +00:00
Scott Long
da761f3b1f Implement VT-d capability detection on chipsets that have multiple
translation units with differing capabilities

From the author via Bugzilla:
---
When an attempt is made to passthrough a PCI device to a bhyve VM
(causing initialisation of IOMMU) on certain Intel chipsets using
VT-d the PCI bus stops working entirely. This issue occurs on the
E3-1275 v5 processor on C236 chipset and has also been encountered
by others on the forums with different hardware in the Skylake
series.

The chipset has two VT-d translation units. The issue is caused by
an attempt to use the VT-d device-IOTLB capability that is
supported by only the first unit for devices attached to the
second unit which lacks that capability. Only the capabilities of
the first unit are checked and are assumed to be the same for all
units.

Attached is a patch to rectify this issue by determining which
unit is responsible for the device being added to a domain and
then checking that unit's device-IOTLB capability. In addition to
this a few fixes have been made to other instances where the first
unit's capabilities are assumed for all units for domains they
share. In these cases a mutual set of capabilities is determined.
The patch should hopefully fix any bugs for current/future
hardware with multiple translation units supporting different
capabilities.

A description is on the forums at
https://forums.freebsd.org/threads/pci-passthrough-bhyve-usb-xhci.65235
The thread includes observations by other users of the bug
occurring, and description as well as confirmation of the fix.
I'd also like to thank Ordoban for their help.

---
Personally tested on a Skylake laptop, Skylake Xeon server, and
a Xeon-D-1541, passing through XHCI and NVMe functions.  Passthru
is hit-or-miss to the point of being unusable without this
patch.

PR: 229852
Submitted by: callum@aitchison.org
MFC after: 1 week
2019-06-19 06:41:07 +00:00
Alan Cox
fd2dae0a30 Implement an alternative solution to the amd64 and i386 pmap problem that we
previously addressed in r348246.

This pmap problem also exists on arm64 and riscv.  However, the original
solution developed for amd64 and i386 cannot be used on arm64 and riscv.  In
particular, arm64 and riscv do not define a PG_PROMOTED flag in their level
2 PTEs.  (A PG_PROMOTED flag makes no sense on arm64, where unlike x86 or
riscv we are required to break the old 4KB mappings before making the 2MB
mapping; and on riscv there are no unused bits in the PTE to define a
PG_PROMOTED flag.)

This commit implements an alternative solution that can be used on all four
architectures.  Moreover, this solution has two other advantages.  First, on
older AMD processors that required the Erratum 383 workaround, it is less
costly.  Specifically, it avoids unnecessary calls to pmap_fill_ptp() on a
superpage demotion.  Second, it enables the elimination of some calls to
pagezero() in pmap_kernel_remove_{l2,pde}().

In addition, remove a related stale comment from pmap_enter_{l2,pde}().

Reviewed by:	kib, markj (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20538
2019-06-09 03:36:10 +00:00
Konstantin Belousov
2f73a6e9c4 Correct definition for PGEX_SGX.
At the moment it is only used for page fault error code textual
representation.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-06-08 20:26:04 +00:00
Konstantin Belousov
8b49f4dd80 Make trap_msg array constant as well.
Suggested by:	tijl
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-08 19:50:57 +00:00
Konstantin Belousov
99b81dcb9e Remove lazy FPU switch support from amd64.
It is incompatible with some future features.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-08 16:03:34 +00:00
Konstantin Belousov
c7228026a8 amd64 trap.c: Modernize syntax around trap_msg[].
Convert the array to use C99 initializers.
Make it constant.
Replace MAX_TRAP_MSG with nitems().

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-08 13:40:57 +00:00
Mark Johnston
88ea538a98 Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m).
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone.  But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place.  No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20470
2019-06-07 18:23:29 +00:00
Mark Johnston
c080655467 Fix a race between fasttrap and the user breakpoint handler.
When disabling the last enabled userspace probe, fasttrap clears the
function pointers which hook in to the breakpoint handler.  If a traced
thread hit a fasttrap breakpoint before it was removed, we must ensure
that it is able to call the hook; otherwise fasttrap will not consume
the trap and SIGTRAP will be delievered to the thread.  Synchronize
with such threads by ensuring that they load the hook pointer with
interrupts disabled, and by completing an SMP rendezvous after removing
breakpoints and before clearing the pointers.

Reported by:	Alexander Alexeev <Alexander.Alexeev@dell.com>
Tested by:	Alexander Alexeev (earlier version)
Reviewed by:	cem, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20526
2019-06-06 16:03:25 +00:00
John Baldwin
0d1fd6e541 Support MSI-X for passthrough devices with a separate PBA BAR.
pci_alloc_msix() requires both the table and PBA BARs to be allocated
by the driver.  ppt was only allocating the table BAR so would fail
for devices with the PBA in a separate BAR.  Fix this by allocating
the PBA BAR before pci_alloc_msix() if it is stored in a separate BAR.

While here, release BARs after calling pci_release_msi() instead of
before.  Also, don't call bus_teardown_intr() in error handling code
if bus_setup_intr() has just failed.

Reported by:	gallatin
Tested by:	gallatin
Reviewed by:	rgrimes, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20525
2019-06-05 19:30:32 +00:00
Alan Cox
6f1412f800 The changes to pmap_demote_pde_locked()'s control flow in r348476 resulted
in the loss of a KASSERT that guarded against the invalidation a wired
mapping.  Restore this KASSERT.

Remove an unnecessary KASSERT from pmap_demote_pde_locked().  It guards
against a state that was already handled at the start of the function.

Reviewed by:	kib
X-MFC with:	r348476
2019-06-04 16:21:14 +00:00
Konstantin Belousov
185f7e0a9d amd64 ef_rt_arch_call: Preserve %rflags around call into EFI RT service.
If service code faulted, we might end up unwinding with interrupts
disabled.  Top-level kernel code should have interrupts enabled, which
is enforced by checks.

Save %rflags before entering EFI, and restore to the known good value
on return.  This handles situation with disabled interrupts on fault
and perhaps other potential bugs, e.g. invalid value for PSL_D.

Reported and tested by:	Jan Martin Mikkelsen <janm@transactionware.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-03 15:32:42 +00:00
Konstantin Belousov
85d76c38ba Simplify flow of pmap_demote_pde_locked() and add more comprehensive
debugging checks.

In particular,
- Move the code to handle failure to allocate page table page into
  a helper.
- After the previous item is done, it is possible to distinguish !PG_A
  case and case of missed page, in the control flow.
- Make the variable to indicate that in-kernel mapping is demoted.
- Assert that missed page table page can only happen for in-kernel
  mapping when demoting direct map.
- If DIAGNOSTIC is enabled, and the page table page should be already
  filled, check all ptes instead of only first one.

Reviewed by:	alc, markj
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20266
2019-05-31 18:53:04 +00:00
Brooks Davis
4af6033324 makesyscalls.sh: always use absolute path for syscalls.conf
syscalls.conf is included using "." which per the Open Group:

 If file does not contain a <slash>, the shell shall use the search
 path specified by PATH to find the directory containing file.

POSIX shells don't fall back to the current working directory.

Submitted by:	Nathaniel Wesley Filardo <nwf20@cl.cam.ac.uk>
Reviewed by:	bdrewery
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D20476
2019-05-30 20:56:23 +00:00
Konstantin Belousov
fcd0c06eee Correct some inconsistencies in the earliest created kernel page
tables which affect demotion.

The last last-level page table under 2M mappings below KERNend was
only partially initialized.  When that page was used as the hardware
page table for demotion of the 2M mapping, the result was not
consistent.  Since pmap_demote_pde() is switched to use PG_PROMOTED as
the test for the validity of the saved last level page table page, we
can keep page table pages zero-initialized instead.  Demotion would
fill them as needed.

Only map the created page tables beyond KERNend, there is no need to
pre-promote PTmap after KERNend, because the extra mapping is not used.

Only round up *firstaddr to 2M boundary when it is below rounded
KERNend.  Sometimes the allocpages() calls advance *firstaddr past the
end of the last 2MB page mapping. In that case, this conditional
avoids wasting an average of 1MB of physical memory.

Update comments to explain action in more clean and direct language.

Reported and tested by:	pho
In collaboration with:	alc
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20380
2019-05-27 15:21:26 +00:00
Konstantin Belousov
59659e78e1 Fix too loose assert in pmap_large_unmap().
The upper bound for the valid address from the large map used
LARGEMAP_MAX_ADDRESS instead of LARGEMAP_MIN_ADDRESS.  Provide a
function-like macro for proper upper value.

Noted by:	markj
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20386
2019-05-24 23:28:11 +00:00
Konstantin Belousov
f3dbf2ca4a Add PG_PS_PDP_FRAME symbol.
Similar to PG_FRAME and PG_PS_FRAME, it denotes the mask of the
physical address component of 1G superpage PDP entry.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20386
2019-05-24 23:26:17 +00:00
Konstantin Belousov
a9c7546a1d Fix a corner case in demotion of kernel mappings.
It is possible for the kernel mapping to be created with superpage by
directly installing pde using pmap_enter_2mpage() without filling the
corresponding page table page.  This can happen e.g. if the range is
already backed by reservation and vm_fault_soft_fast() conditions are
satisfied, which was observed on the pipe_map.

In this case, demotion must fill the page obtained from the pmap
radix, same as if the page is newly allocated.  Use PG_PROMOTED bit as
an indicator that the page is valid, instead of the wire count of the
page table page.

Since the PG_PROMOTED bit is set on pde when we leave TLB entries for
4k pages around, which in particular means that the ptes were filled,
it provides more correct indicator.  Note that pmap_protect_pde()
clears PG_PROMOTED, which handles the case when protection was changed
on the superpage without adjusting ptes.

Reported by:	pho
In collaboration with:	alc
Tested by:	alc, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20380
2019-05-24 17:19:06 +00:00
John Baldwin
bebcdc0073 Add a constant for the LS config MSR on AMD CPUs.
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D19506
2019-05-23 23:37:11 +00:00
Konstantin Belousov
48ec6d3bc9 Do not call hw_mds_recalculate() from initializecpu().
If MDS mitigation is enabled by the tunable but MDS microcode is not
early-loaded, software mitigation is selected.  This causes
initializecpu() to try to allocate memory which makes boot process
very unhappy.

Create SYSINIT that runs sufficiently late to succeed.

Reported by:	naddy
PR:	237968
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-05-21 22:56:21 +00:00
Edward Tomasz Napierala
fb74820239 Make linux_ptrace() use linux_msg() instead of printf().
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-05-21 08:23:24 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Edward Tomasz Napierala
d49fb289c8 Implement PTRACE_O_TRACESYSGOOD. This makes Linux strace(1) work.
Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20200
2019-05-19 12:58:44 +00:00
John Baldwin
e519cee307 Expose the MD_CLEAR capability used by Intel MDS mitigations to guests.
Submitted by:	Patrick Mooney <pmooney@pfmooney.com>
Reviewed by:	kib
Tested by:	Patrick on SmartOS with Linux and Windows guests
Obtained from:	Joyent
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D20296
2019-05-18 21:20:38 +00:00
Konstantin Belousov
bd0b1de42f Make lock-less delayed invalidation operational very early.
Apparently, there is more code trying to call pmap_remove() early,
mostly to free preloaded memory.  Instead of moving all deallocations
to the point where a scheduler is initialized, add missed setup of
thread0 di init at hammer_time().

The code in pmap_delayed_invl_start_u() is modified to not ever take
the thread lock if the thread priority is less or equal to PVM.  Since
thread0 starts at priority 0, and then is reset to PVM at
proc0_init(), this eliminates taking the thread lock during early
boot.

While there, fix off by one in comparision of the base priority.

Reported and tested by:	bcran (previous version)
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	29 days
2019-05-18 16:19:31 +00:00
Brooks Davis
02fae06a11 FCP-101: Remove wb(4)
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:24:34 +00:00
Brooks Davis
e8504bf9e7 FCP-101: Remove vx(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:24:26 +00:00
Brooks Davis
be345ff023 FCP-101: Remove txp(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:24:17 +00:00
Brooks Davis
b1b1c2fe38 FCP-101: Remove tx(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:24:08 +00:00
Brooks Davis
7c897ca91f FCP-101: Remove tl(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:24:00 +00:00
Brooks Davis
3b70dd81f5 FCP-101: Remove sf(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:23:43 +00:00
Brooks Davis
607790d10f FCP-101: Remove pcn(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:23:34 +00:00
Brooks Davis
05aa6e583b FCP-101: Remove ed(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:23:02 +00:00
Brooks Davis
08ac01a92c FCP-101: Remove de(4).
Relnotes:	yes
FCP:		https://github.com/freebsd/fcp/blob/master/fcp-0101.md
Reviewed by:	jhb, imp
Differential Revision:	https://reviews.freebsd.org/D20230
2019-05-17 15:22:54 +00:00
Konstantin Belousov
7c5a46a1bc Remove resolver_qual from DEFINE_IFUNC/DEFINE_UIFUNC macros.
In all practical situations, the resolver visibility is static.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	so (emaste)
Differential revision:	https://reviews.freebsd.org/D20281
2019-05-16 22:20:54 +00:00
Konstantin Belousov
07d7e24cf7 amd64 pmap: sysctl vm.pmap.pcid_save_cnt should be read-only.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-05-16 14:33:32 +00:00
Konstantin Belousov
5b2c83cb59 amd64 pmap: Add tunable vm.pmap.di_locked to set DI mode.
This is done mostly for debugging in field.  Also added the sysctl of
the same name to report used mode.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2019-05-16 14:29:09 +00:00
Konstantin Belousov
1febb0b0ae amd64 pmap: Rename DI functions.
pmap_delayed_invl_started -> pmap_delayed_invl_start
pmap_delayed_invl_finished -> pmap_delayed_invl_finish

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2019-05-16 13:40:54 +00:00
Konstantin Belousov
4d3b28bcdc amd64 pmap: rework delayed invalidation, removing global mutex.
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.

The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation.  New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1.  On DI finish, thread donates its generation to the previous
element.  The generation of the fake head of the list is the last
passed DI generation.  Basically, the implementation is a queued
spinlock but without spinlock.

Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
MFC note:	td_md.md_invl_gen should go to the end of struct thread
Differential revision:	https://reviews.freebsd.org/D19630
2019-05-16 13:28:48 +00:00
Ryan Libby
d375016d8d x86: spell vpxor %zmm0 as vpxord
Fix gcc/gas amd64 & i386 build after r347566.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20264
2019-05-15 18:13:43 +00:00
Edward Tomasz Napierala
060d0b57b8 Fix handling of r10 in Linux ptrace(2). This fixes decoding
of the 'flags' argument to mmap(2) with Linux strace(1).

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20223
2019-05-14 20:59:44 +00:00
Konstantin Belousov
7355a02bdd Mitigations for Microarchitectural Data Sampling.
Microarchitectural buffers on some Intel processors utilizing
speculative execution may allow a local process to obtain a memory
disclosure.  An attacker may be able to read secret data from the
kernel or from a process when executing untrusted code (for example,
in a web browser).

Reference: https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00233.html
Security:	CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091
Security:	FreeBSD-SA-19:07.mds
Reviewed by:	jhb
Tested by:	emaste, lwhsu
Approved by:	so (gtetlow)
2019-05-14 17:02:20 +00:00
Mark Johnston
0ac6ef663b Fix formatting.
MFC after:	3 days
2019-05-14 15:19:48 +00:00
Dmitry Chagin
c5156c7785 Linuxulator depends on a fundamental kernel settings such as SMP. Many
of them listed in opt_global.h which is not generated while building
modules outside of a kernel and such modules never match real cofigured
kernel.

So, we should prevent our users from building obviously defective modules.

Therefore, remove the root cause of the building of modules outside of a
kernel - the possibility of building modules with DEBUG or KTR flags.
And remove all of DEBUG printfs as it is incomplete and in threaded
programms not informative, also a half of system call does not have DEBUG
printf. For debuging Linux programms we have dtrace, ktr and ktrace ability.

PR:		222861
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20178
2019-05-13 18:24:29 +00:00
Mark Johnston
54a3a11421 Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes.  User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2).  Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks.  In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process.  The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity.  There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded.  For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM.  Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by:	kib, ngie (mlock() test changes)
Tested by:	pho (earlier version)
MFC after:	45 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
Mateusz Guzik
b72515e129 amd64: tidy up pagezero*/pagecopy (movq -> movl)
Sponsored by:	The FreeBSD Foundation
2019-05-12 07:11:44 +00:00
Mateusz Guzik
45372f1a6f amd64: fixup MEMMOVE comment (10 -> r10)
Sponsored by:	The FreeBSD Foundation
2019-05-12 06:42:17 +00:00
Mateusz Guzik
a8c2fcb287 x86: store pending bitmapped IPIs in per-cpu areas
This gets rid of the global cpu_ipi_pending array.

While replace cmpset with fcmpset in the delivery code and opportunistically
check if given IPI is already pending.

Sponsored by:	The FreeBSD Foundation
2019-05-12 06:36:54 +00:00
Mateusz Guzik
8eae2be460 amd64: stop re-reading curpc in suword
Plugs re-reads missed in r341719

Sponsored by:	The FreeBSD Foundation
2019-05-12 06:34:58 +00:00
Andrew Gallatin
542970fa2d Remove IPSEC from GENERIC due to performance issues
Having IPSEC compiled into the kernel imposes a non-trivial
performance penalty on multi-threaded workloads due to IPSEC
refcounting. In my benchmarks of multi-threaded UDP
transmit (connected sockets), I've seen a roughly 20% performance
penalty when the IPSEC option is included in the kernel (16.8Mpps
vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon
2697v3)). This is largely due to key_addref() incrementing and
decrementing an atomic reference count on the default
policy. This cause all CPUs to stall on the same cacheline, as it
bounces between different CPUs.

Given that relatively few users use ipsec, and that it can be
loaded as a module, it seems reasonable to ask those users to
load the ipsec module so as to avoid imposing this penalty on the
GENERIC kernel. Its my hope that this will make FreeBSD look
better in "out of the box" benchmark comparisons with other
operating systems.

Many thanks to ae for fixing auto-loading of ipsec.ko when
ifconfig tries to configure ipsec, and to cy for volunteering
to ensure the the racoon ports will load the ipsec.ko module

Reviewed by:	cem, cy, delphij, gnn, jhb, jpaetzel
Differential Revision:	https://reviews.freebsd.org/D20163
2019-05-09 22:38:15 +00:00
Kyle Evans
251a32b5b2 tun/tap: merge and rename to tuntap
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).

This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp

[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).

ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.

(MFC commentary)

This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.

I have no plans to do this MFC as of now.

Reviewed by:	bcr (manpages), tuexen (testing, syzkaller/packetdrill)
Input also from:	melifaro
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D20044
2019-05-08 02:32:11 +00:00
Conrad Meyer
fce2d624ea vmm(4): Pass through RDSEED feature bit to guests
Reviewed by:	jhb
Approved by:	#bhyve (jhb)
MFC after:	2 leapseconds
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20194
2019-05-08 00:40:08 +00:00
Edward Tomasz Napierala
faf2fa21d7 Support PTRACE_GETREGSET w/ NT_PRSTATUS in Linux ptrace(2).
While Linux strace(1) doesn't strictly require it - it has a fallback
to PTRACE_GETREGS - it's a newer interface, so we better support it
before the old one is deprecated.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20152
2019-05-07 19:06:41 +00:00
Ed Maste
0e26cd440f make sysent after r347228
Regenerate to add @generated tag in generated files.
2019-05-07 18:10:21 +00:00
Conrad Meyer
665919aaaf x86: Implement MWAIT support for stopping a CPU
IPI_STOP is used after panic or when ddb is entered manually.  MONITOR/
MWAIT allows CPUs that support the feature to sleep in a low power way
instead of spinning.  Something similar is already used at idle.

It is perhaps especially useful in oversubscribed VM environments, and is
safe to use even if the panic/ddb thread is not the BSP.  (Except in the
presence of MWAIT errata, which are detected automatically on platforms with
known wakeup problems.)

It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0
(off).  This commit also introduces the tunable
"machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has
known errata, but may be set to "1" in loader.conf to signal that mwait
wakeup is broken on CPUs FreeBSD does not yet know about.

Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't
help bhyve hypervisors running FreeBSD guests.

Submitted by:   Anton Rang <rang AT acm.org> (earlier version)
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20135
2019-05-04 20:34:26 +00:00
Conrad Meyer
83dc49beaf x86: Define pc_monitorbuf as a logical structure
Rather than just accessing it via pointer cast.

No functional change intended.

Discussed with:	kib (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20135
2019-05-04 17:35:13 +00:00
John Baldwin
c2b4cedd78 Emulate the "ADD reg, r/m" instruction (opcode 03H).
OVMF's flash variable storage is using add instructions when indexing
the variable store bootrom location.

Submitted by:	D Scott Phillips <d.scott.phillips@intel.com>
Reviewed by:	rgrimes
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19975
2019-05-03 21:48:42 +00:00
Dmitry Chagin
d151344dbf In order to reduce duplication between MD parts of the Linuxulator
move bits that are MI out into the headers in compat/linux.
For that remove bogus _packed attribute from struct l_sockaddr
and use MI types for struct members.

And continue to move into the linux_common module a code that is
intended for both Linuxulator modules (both instruction set - 32 & 64 bit)
or for external modules like linsysfs or linprocfs.

To avoid header pollution introduce new sys/compat/linux_common.h header.

Reviewed by:	emaste
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20137
2019-05-03 08:42:49 +00:00
Conrad Meyer
d6745408c7 Add a COMPAT_FREEBSD12 kernel option.
Use it wherever COMPAT_FREEBSD11 is currently specified, like r309749.

Reviewed by:	imp, jhb, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20120
2019-05-02 18:10:23 +00:00
Rodney W. Grimes
a488c9c99a Add accessor function for vm->maxcpus
Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.

This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298

Submitted by:		Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by:		Patrick Mooney <patrick.mooney@joyent.com>
Approved by:		bde (mentor), jhb (maintainer)
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D18755
2019-04-25 22:51:36 +00:00
Dmitry Chagin
c034ecf316 Since r339624 HEAD does not need for backslashes in syscalls.master,
however to make a merge r345471 to the stable add backslashes
to the syscalls.master.

MFC after:	3 days
2019-04-23 18:10:46 +00:00
Konstantin Belousov
fdfe249b63 Fix initial x87 state after r345562.
After the referenced commit, we did not set x87 and sse valid bits in
the xstate_bv bitmask for initial fpu state (stored in memory), when
using XSAVE.

The state is loaded into FPU register file to initialize the process
FPU state, and since both bits were clear, the default x87 and SSE
states were loaded.  By chance, FreeBSD ABI SSE2 state is same as FPU
initial state, so the bug is not visible for 64bit processes.  But on
i386, the precision control should be set to double (53bit mantissa),
instead of the default double extended (64bit mantissa). For 32bit
processes on amd64, kernel reloads control word with the right mask,
which only left native i386 and amd64 native but using x87 as
affected.

Fix it by setting minimal required xstate_bv mask.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-04-16 19:46:02 +00:00
Warner Losh
f7ab01581a Move mpr/mps drivers from per-arch NOTES files into the MI notes
file. They are in more arches they they aren't. Add appropriate
nodevice directives in powerpc and arm.
2019-04-13 06:30:45 +00:00
Konstantin Belousov
2a508645b4 pci_cfgreg.c: Use io port config access for early boot time.
Some early PCIe chipsets are explicitly listed in the white-list to
enable use of the MMIO config space accesses, perhaps because ACPI
tables were not reliable source of the base MCFG address at that time.
For that chipsets, MCFG base was read from the known chipset MCFGbase
config register.

During very early stage of boot, when access to the PCI config space
is performed (see e.g. pci_early_quirks.c), we cannot map 255MB of
registers because the method used with pre-boot pmap overflows initial
kernel page tables.

Move fallback to read MCFGbase to the attachment method of the
x86/legacy device, which removes code duplication, and results in the
use of io accesses until MCFG is parsed or legacy attach called.

For amd64, pre-initialize cfgmech with CFGMECH_1, right now we
dynamically assign CFGMECH_1 to it anyway, and remove checks for
CFGMECH_NONE.

There is a mention in the Intel documentation for corresponding
chipsets that OS must use either io port or MMIO access method, but we
already break this rule by reading MCFGbase register, so one more
access seems to be innocent.

Reported by:	longwitz@incore.de
PR:	236838
Reviewed by:	avg (other version), jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19833
2019-04-09 18:07:17 +00:00
Konstantin Belousov
5db2a4a812 Implement resets for PCI buses and PCIe bridges.
For PCI device (i.e. child of a PCI bus), reset tries FLR if
implemented and worked, and falls to power reset otherwise.

For PCIe bus (child of a PCIe bridge or root port), reset
disables PCIe link and then re-trains it, performing what is known as
link-level reset.

Reviewed by:	imp (previous version), jhb (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D19646
2019-04-05 19:25:26 +00:00
Warner Losh
d99880cf46 Add mpr, mps, mpt to NOTES file
Add these to all the architectures that these are in the GENERIC
kernel.
2019-04-05 02:54:02 +00:00
Jung-uk Kim
278f0de60d Merge ACPICA 20190329. 2019-03-29 20:21:28 +00:00
Conrad Meyer
8207def158 x86: Use XSAVEOPT for fpusave(), when available
Remove redundant npxsave_core definition while here.

Suggested by:	Anton Rang
Reviewed by:	kib, Anton Rang <rang AT acm.org>
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19665
2019-03-26 22:45:41 +00:00
Dmitry Chagin
1f66cb5154 Regen from r345471.
MFC after:	1 month
2019-03-24 14:51:17 +00:00
Dmitry Chagin
f730d606d5 Update syscall.master to 5.0.
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.

MFC after:	1 month
2019-03-24 14:50:02 +00:00
Dmitry Chagin
d9be8b39a5 Regen for r345469 (shmat()).
MFC after:	1 month
2019-03-24 14:46:07 +00:00
Dmitry Chagin
7dabf89bcf Linux between 4.18 and 5.0 split IPC system calls.
In preparation for doing this in the Linuxulator modify our linux_shmat()
to match actual Linux shmat() system call.

MFC after:	1 month
2019-03-24 14:44:35 +00:00
Dmitry Chagin
a7b87a2d95 Revert r313993.
AMD64_SET_**BASE expects a pointer to a pointer, we just passing in the pointer value itself.

Set PCB_FULL_IRET for doreti to restore %fs, %gs and its correspondig base.

PR:		225105
Reported by:	trasz@
MFC after:	1 month
2019-03-24 14:02:57 +00:00
Mark Johnston
64087fd7f3 Disallow preemptive creation of wired superpage mappings.
There are some unusual cases where a process may cause an mlock()ed
range of memory to be unmapped.  If the application subsequently
faults on that region, the handler may attempt to create a superpage
mapping backed by the resident, wired pages.  However, the pmap code
responsible for creating such a mapping (pmap_enter_pde() on i386
and amd64) does not ensure that a leaf page table page is available
if the superpage is later demoted; the demotion operation must therefore
perform a non-blocking page allocation and must unmap the entire
superpage if the allocation fails.  The pmap layer ensures that this
can never happen for wired mappings, and so the case described above
breaks that invariant.

For now, simply ensure that the MI fault handler never attempts to
create a wired superpage except via promotion.

Reviewed by:	kib
Reported by:	syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19670
2019-03-21 19:52:50 +00:00
Marcin Wojtas
3caad0b8f4 Prevent loading SGX with incorrect EPC data
It may happen on some machines, that even if SGX is disabled
in firmware, the driver would still attach despite EPC base and
size equal zero. Such behaviour causes a kernel panic when the
module is unloaded. Add a simple check to make sure we
only attach when these values are correctly set.

Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: br
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D19595
2019-03-19 02:33:58 +00:00
Konstantin Belousov
fd8d844f76 amd64 KPTI: add control from procctl(2).
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting.  PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.

Requested by:	jhb
Reviewed by:	jhb, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19514
2019-03-16 11:44:33 +00:00