Commit Graph

8376 Commits

Author SHA1 Message Date
Konstantin Belousov
6f1fe3305a amd64: Add md process flags and first P_MD_PTI flag.
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.

On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process.  Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.

MFC note: md_flags change struct proc KBI.

Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19514
2019-03-16 11:31:01 +00:00
Konstantin Belousov
c1c120b2cb amd64: fix switching to the pmap with pti disabled.
When the pmap with pti disabled (i.e. pm_ucr3 == PMAP_NO_CR3) is
activated, tss.rsp0 was not updated.  Any interrupt that happen before
next context switch would use pti trampoline stack for hardware frame
but fault and interrupt handlers are not prepared to this.  Correctly
update tss.rsp0 for both PMAP_NO_CR3 and pti pmaps.

Note that this case, pti = 1 but pmap->pm_ucr3 == PMAP_NO_CR3 is not
used at the moment.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19514
2019-03-16 11:16:09 +00:00
Konstantin Belousov
a9262f497a amd64: rewrite cpu_switch.S fragment to reload tss.rsp0 on context switch.
New code avoids jumps.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19514
2019-03-16 11:12:02 +00:00
Konstantin Belousov
39d70f6b80 Provide deterministic (and somewhat useful) value for RDPID result,
and for %ecx after RDTSCP.

Initialize TSC_AUX MSR with CPUID.  It allows for userspace to cheaply
identify CPU it was executed on some time ago, which is sometimes useful.

Note: The values returned might be changed in future.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-03-15 16:43:28 +00:00
Warner Losh
329f0aa952 Kill tz_minuteswest and tz_dsttime.
Research Unix, 7th Edition introduced TIMEZONE and DSTFLAG
compile-time constants in sys/param.h to communicate these values for
the machine. 4.2BSD moved from the compile-time to run-time and
introduced these variables and used for localtime() to return the
right offset from UTC (sometimes referred to as GMT, for this purpose
is the same). 4.4BSD migrated to using the tzdata code/database and
these variables were basically unused.

FreeBSD removed the real need for these with adjkerntz in
1995. However, some RTC clocks continued to use these variables,
though they were largely unused otherwise.  Later, phk centeralized
most of the uses in utc_offset, but left it using both tz_minuteswest
and adjkerntz.

POSIX (IEEE Std 1003.1-2017) states in the gettimeofday specification
"If tzp is not a null pointer, the behavior is unspecified" so there's
no standards reason to retain it anymore. In fact, gettimeofday has
been marked as obsolecent, meaning it could be removed from a future
release of the standard. It is the only interface defined in POSIX
that references these two values. All other references come from the
tzdata database via tzset().

These were used to more faithfully implement early unix ABIs which
have been removed from FreeBSD.  NetBSD has completely eliminated
these variables years ago. Linux has migrated to tzdata as well,
though these variables technically still exist for compatibility
with unspecified older programs.

So, there's no real reason to have them these days. They are a
historical vestige that's no longer used in any meaningful way.

Reviewed By: jhb@, brooks@
Differential Revision: https://reviews.freebsd.org/D19550
2019-03-12 04:49:47 +00:00
Andrey V. Elsukov
40025d42fd Fix typo.
MFC after:	1 week
2019-03-07 10:01:32 +00:00
Matt Macy
030963c090 add gcov to LINT build
MFC after:	1 week
2019-03-07 03:50:34 +00:00
John Baldwin
2e43efd0bb Drop "All rights reserved" from my copyright statements.
Reviewed by:	rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D19485
2019-03-06 22:11:45 +00:00
John Baldwin
2c352feb3b Fix missed posted interrupts in VT-x in bhyve.
When a vCPU is HLTed, interrupts with a priority below the processor
priority (PPR) should not resume the vCPU while interrupts at or above
the PPR should.  With posted interrupts, bhyve maintains a bitmap of
pending interrupts in PIR descriptor along with a single 'pending'
bit.  This bit is checked by a CPU running in guest mode at various
places to determine if it should be checked.  In addition, another CPU
can force a CPU in guest mode to check for pending interrupts by
sending an IPI to a special IDT vector reserved for this purpose.

bhyve had a bug in that it would only notify a guest vCPU of an
interrupt (e.g. by sending the special IPI or by resuming it if it was
idle due to HLT) if an interrupt arrived that was higher priority than
PPR and no interrupts were currently pending.  This assumed that if
the 'pending' bit was set, any needed notification was already in
progress.  However, if the first interrupt sent to a HLTed vCPU was
lower priority than PPR and the second was higher than PPR, the first
interrupt would set 'pending' but not notify the vCPU, and the second
interrupt would not notify the vCPU because 'pending' was already set.
To fix this, track the priority of pending interrupts in a separate
per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that
is above PPR and higher than any previously-received interrupt.

This was found and debugged in the bhyve port to SmartOS maintained by
Joyent.  Relevant SmartOS bugs with more background:

https://smartos.org/bugview/OS-6829
https://smartos.org/bugview/OS-6930
https://smartos.org/bugview/OS-7354

Submitted by:	Patrick Mooney <pmooney@pfmooney.com>
Reviewed by:	tychon, rgrimes
Obtained from:	SmartOS / Joyent
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D19299
2019-03-01 20:43:48 +00:00
Edward Tomasz Napierala
1699546def Remove sv_pagesize, originally introduced with r100384.
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.

Reviewed by:	imp (who also contributed the commit message)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D19280
2019-03-01 16:16:38 +00:00
Konstantin Belousov
e7a9df16e6 Add kernel support for Intel userspace protection keys feature on
Skylake Xeons.

See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:51:13 +00:00
Konstantin Belousov
87b1bf4f31 amd64: add defines and decode protection keys and SGX page faults reasons.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:46:44 +00:00
Warner Losh
b1ece24388 Remove drm from LINT kernels
drm was accidentally left in the LINT kernels.

Pointy hat to: imp
2019-02-19 21:20:50 +00:00
Konstantin Belousov
5ddeaf67c6 Provide convenience C wrappers for RDPKRU and WRPKRU instructions.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-19 19:17:20 +00:00
Konstantin Belousov
8cbe929be5 amd64: cleanup pmap_init_pat().
The pmap_works variable is always true for amd64.  Remove it, the
branch in the initialization taken when false, and corresponding
sysctl.

Remove pat_table[] local array, work on pat_index[] directly.

Collapse whole initialization to not override already assigned values.

Add comment explaining the choice for PAT4 and PAT7.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
MFC note:	Leave the sysctl around
Differential revision:	https://reviews.freebsd.org/D19225
2019-02-18 16:02:00 +00:00
Mark Johnston
8cbc89c7d2 Fix refcount leaks in the SGX Linux compat ioctl handler.
Some argument validation error paths would return without releasing the
file reference obtained at the beginning of the function.

While here, fix some style bugs and remove trivial debug prints.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19214
2019-02-17 16:43:44 +00:00
Konstantin Belousov
fa50a3552d Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings.  It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory.  This
implementation is relatively small and happens at the correct
architectural level.  Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection.  It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation.  The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in.  The initial location for the anon group range
is, of course, randomized.  This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386.  This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs.  Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
  to force ASLR off for the given binary.  (A tool to edit the feature
  control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR:	208580 (exp runs)
Exp-runs done by:	antoine
Reviewed by:	markj (previous version)
Discussed with:	emaste
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
Conrad Meyer
7e804fd5c5 Revert r343713 temporarily
The COVERAGE option breaks xtoolchain-gcc GENERIC kernel early boot
extremely badly and hasn't been fixed for the ~week since it was committed.
Please enable for GENERIC only when it doesn't do that.

Related fallout reported by:	lwhsu, tuexen (pr 235611)
2019-02-10 07:54:46 +00:00
Ed Maste
e26563b8c7 Retire SPX_HACK option unused after r342244 2019-02-06 17:21:25 +00:00
Konstantin Belousov
762138f78f amd64: clear callee-preserved registers on syscall exit.
%r8, %r10, and on non-KPTI configuration %r9 were not restored on fast
return from a syscall.

Reviewed by:	markj
Approved by:	so
Security:	CVE-2019-5595
Sponsored by:	The FreeBSD Foundation
MFC after:	0 minutes
2019-02-05 17:49:27 +00:00
Andrew Turner
634a8a8873 Enable COVERAGE and KCOV by default on arm64 and amd64.
This allows userspace to trace the kernel using the coverage sanitizer
found in clang. It will also allow other coverage tools to be built as
modules and attach into the same framework.

Sponsored by:	DARPA, AFRL
2019-02-03 12:46:27 +00:00
Konstantin Belousov
c75f49f7d8 Make iflib a loadable module.
iflib is already a module, but it is unconditionally compiled into the
kernel.  There are drivers which do not need iflib(4), and there are
situations where somebody might not want iflib in kernel because of
using the corresponding driver as module.

Reviewed by:	marius
Discussed with:	erj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D19041
2019-01-31 19:05:56 +00:00
Andrew Turner
524553f56d Extract the coverage sanitizer KPI to a new file.
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.

A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.

While here clean up the #include style a little.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18955
2019-01-29 11:04:17 +00:00
Andriy Voskoboinyk
86d535ab47 Garbage collect AH_SUPPORT_AR5416 config option.
It does nothing since r318857.
2019-01-25 13:48:40 +00:00
Ed Maste
1b1f24b936 linuxulator: fix stack memory disclosure in linux_sigaltstack
admbugs:	765
Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	andrew
MFC after:	1 day
Security:	Kernel memory disclosure
Sponsored by:   The FreeBSD Foundation
2019-01-21 16:25:40 +00:00
Andriy Voskoboinyk
4945f79a4c Remove IEEE80211_AMPDU_AGE config option.
It is noop since r297774.
2019-01-20 15:17:56 +00:00
Conrad Meyer
d0c7cde53e vmm(4): Mask Spectre feature bits on AMD hosts
For parity with Intel hosts, which already mask out the CPUID feature
bits that indicate the presence of the SPEC_CTRL MSR, do the same on
AMD.

Eventually we may want to have a better support story for guests, but
for now, limit the damage of incorrectly indicating an MSR we do not yet
support.

Eventually, we may want a generic CPUID override system for
administrators, or for minimum supported feature set in heterogenous
environments with failover.  That is a much larger scope effort than
this bug fix.

PR:		235010
Reported by:	Rys Sommefeldt <rys AT sommefeldt.com>
Sponsored by:	Dell EMC Isilon
2019-01-18 23:54:51 +00:00
Konstantin Belousov
f1dc49f33a Trim whitespace at EoL, use tabs instead of spaces for indent.
PR:  235004
Submitted by:	Jose Luis Duran <jlduran@gmail.com>
MFC after:	3 days
2019-01-17 05:15:25 +00:00
Conrad Meyer
15b7da10ac vmm(4): Take steps towards multicore bhyve AMD support
vmm's CPUID emulation presented Intel topology information to the guest, but
disabled AMD topology information and in some cases passed through garbage.
I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but
guest CPUs can migrate between host threads, so the information presented
was not consistent.  This could easily be observed with 'cpucontrol -i 0xfoo
/dev/cpuctl0'.

Slightly improve this situation by enabling the AMD topology feature flag
and presenting at least the CPUID fields used by FreeBSD itself to probe
topology on more modern AMD64 hardware (Family 15h+).  Older stuff is
probably less interesting.  I have not been able to empirically confirm it
is sufficient, but it should not regress anything either.

Reviewed by:	araujo (previous version)
Relnotes:	sure
2019-01-16 02:19:04 +00:00
Andrew Turner
b3c0d957a2 Add support for the Clang Coverage Sanitizer in the kernel (KCOV).
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).

Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.

The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.

The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.

KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.

A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.

[1] https://github.com/google/syzkaller

Submitted by:	Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by:	kib
Testing by:	tuexen
Sponsored by:	DARPA, AFRL
Sponsored by:	The FreeBSD Foundation (Mitchell Horne)
Differential Revision:	https://reviews.freebsd.org/D14599
2019-01-12 11:21:28 +00:00
Fedor Uporov
6651cf410c Fix errno values returned from DUMMY_XATTR linuxulator calls
Reported by: weiss@uni-mainz.de
Reviewed by: markj
MFC after: 1 day
Differential Revision: https://reviews.freebsd.org/D18812
2019-01-11 07:58:25 +00:00
Konstantin Belousov
f0d85a5dc5 x86: Report per-cpu IPI TLB shootdown generation in ddb 'show pcpu' output.
It is useful for inspecting tlb shootdown hangs.  The smp_tlb_generation value
is available using regular ddb data inspection commands.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-01-04 17:25:47 +00:00
Mark Johnston
9bfc7fa41d Avoid setting PG_U unconditionally in pmap_enter_quick_locked().
This KPI may in principle be used to create kernel mappings, in which
case we certainly should not be setting PG_U.  In any case, PG_U must be
set on all layers in the page tables to grant user mode access, and we
were only setting it on leaf entries.  Thus, this change should have no
functional impact.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-01-02 15:36:35 +00:00
Mateusz Guzik
628888f0e0 Remove iBCS2, part2: general kernel
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 21:57:58 +00:00
Mateusz Guzik
3c76ace36b amd64: stop re-reading curpc on subyte/suword
Originally read value is still safely kept. Re-reading code was there
for previous iterations which were partially shared with i386.

Sponsored by:	The FreeBSD Foundation
2018-12-08 04:53:08 +00:00
Mark Johnston
352aaa5122 Plug memory disclosures via ptrace(2).
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory.  The same
issue existed with the register files in procfs.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18421
2018-12-03 20:54:17 +00:00
Mateusz Guzik
ddf6571230 amd64: align target memmove buffer to 16 bytes before using rep movs
See the review for sample test results.

Reviewed by:	kib (kernel part)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18401
2018-12-01 14:20:32 +00:00
Mateusz Guzik
94243af2da amd64: handle small memmove buffers with overlapping stores
Handling sizes of > 32 backwards will be updated later.

Reviewed by:	kib (kernel part)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18387
2018-11-30 20:58:08 +00:00
Mateusz Guzik
2847cfce54 amd64: remove stale attribution for memmove work
While the routine started as expanded bcopy, it is now entirely rewritten.

Sponsored by:	The FreeBSD Foundation
2018-11-30 00:47:36 +00:00
Mateusz Guzik
dd219e5ea5 amd64: tidy up copying backwards in memmove
For non-ERMS case the code used handle possible trailing bytes with
movsb first and then followed it up with movsq. This also happened
to alter how calculations were done for other cases.

Handle the tail with regular movs, just like when copying forward.
Use leaq to calculate the right offset from the get go, instead of
doing separate add and sub.

This adjusts the offset for non-rep cases so that they can be used
to handle the tail.

The routine is still a work in progress.

Sponsored by:	The FreeBSD Foundation
2018-11-30 00:45:10 +00:00
Konstantin Belousov
32b083531f Fix assert condition in pmap_large_unmap().
pmap_large_unmap() asserts that an unmapping request covers the
entirety of a 2M or 1G page.  The logic in the asserts was out of date
with the loop logic.  Correct the test to actually check that
destroying the current superpage mapping does not unmap addresses
beyond those requested by the caller.

Submitted by:	D Scott Phillips <d.scott.phillips@intel.com>
Reviewed by:	alc
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18345
2018-11-27 21:40:51 +00:00
Eric van Gyzen
607a0eb2f1 Remove superfluous bzero in getcontext/swapcontext/sendsig
We zero the whole structure; we don't need to zero the __spare__ field again.

Remove trailing whitespace.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2018-11-26 20:56:05 +00:00
Eric van Gyzen
f5e7d8bdb5 Prevent kernel stack disclosure in getcontext/swapcontext
Expand r338982 to cover freebsd32 interfaces on amd64, mips, and powerpc.

MFC after:	2 days
Security:	FreeBSD-EN-18:12.mem
Security:	CVE-2018-17155
Sponsored by:	Dell EMC Isilon
2018-11-26 20:50:55 +00:00
Mark Johnston
2910a16124 Clear unused bytes in ia32_osendsig().
Mirror the fix for the native i386 implementation from r218327.  This
code is compiled only when the non-default COMPAT_43 option is
configured.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18298
2018-11-22 17:51:19 +00:00
Konstantin Belousov
2343757338 Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068.
SDM rev. 068 was released yesterday and it contains the description of
the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for
all bits present in the document, and decode them in the CPU
identification lines printed on boot.

But also, the document defines SSB_NO as bit 4, while FreeBSD used but
2 to detect the need to work-around Speculative Store Bypass
issue.  Change code to use the bit from SDM.

Similarly, the document describes bit 3 as an indicator that L1TF
issue is not present, in particular, no L1D flush is needed on
VMENTRY.  We used RDCL_NO to avoid flushing, and again I changed the
code to follow new spec from SDM.

In fact my Apollo Lake machine with latest ucode shows this:
    IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO>

Reviewed by:	bwidawsk
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18006
2018-11-16 21:27:11 +00:00
Mateusz Guzik
088ac3ef4b amd64: handle small memset buffers with overlapping stores
Instead of jumping to locations which store the exact number of bytes,
use displacement to move the destination.

In particular the following clears an area between 8-16 (inclusive)
branch-free:

movq    %r10,(%rdi)
movq    %r10,-8(%rdi,%rcx)

For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2.
Writing 8 bytes starting at that offset overlaps with 6 bytes written
previously and writes 2 new, giving 10 in total.

Provides a nice win for smaller stores. Other ones are erratic depending
on the microarchitecture.

General idea taken from NetBSD (restricted use of the trick) and bionic
string functions (use for various ranges like in this patch).

Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17660
2018-11-16 00:44:22 +00:00
Matt Macy
f1bac7bb74 Add ZFS to amd64 NOTES to catch future breakage of static linking 2018-11-13 23:08:46 +00:00
Niclas Zeising
af14df7703 Add evdev support to amd64 and i386 kernels
Include evdev support and drivers in the amd64 and i386 GENERIC and MINIMAL
kernels.  Evdev is used by X and wayland to handle input devices, and this
change, together with upcomming changes in ports will make us handle input
devices better in graphical UIs.

Reviewed by:	wulf, bapt, imp
Approved by:	imp
Differential Revision:	https://reviews.freebsd.org/D17912
2018-11-12 21:01:28 +00:00
Konstantin Belousov
83813c6696 Apply fix to un-cripple max cpu id on BSP earlier.
We need to know actual value for the standard extended features before
ifuncs are resolved.

Reported and tested by:	madpilot
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-11-12 19:17:26 +00:00
Mateusz Guzik
f1161465f4 amd64: align memset buffers to 16 bytes before using rep stos
Both Intel manual and Agner Fog's docs suggest aligning to 16.

See the review for benchmark results.

Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17661
2018-11-08 15:12:36 +00:00
Andrew Turner
3869df5d71 Add the KUBSAN options to the arm64 and amd64 GENERIC kernel config files.
As the kernel file size may be too large to run with a stock loader comment
them out for now.

Sponsored by:	DARPA, AFRL
2018-11-06 17:47:58 +00:00
Tijl Coosemans
02bf7e5e40 Fix builds with COMPAT_LINUX32 in the kernel config.
MFC after:	3 days
2018-11-06 15:29:44 +00:00
Tijl Coosemans
8fc08087a1 On amd64 both Linux compat modules, linux.ko and linux64.ko, provide
linux_ioctl_(un)register_handler that allows other driver modules to
register ioctl handlers.  The ioctl syscall implementation in each Linux
compat module iterates over the list of handlers and forwards the call to
the appropriate driver.  Because the registration functions have the same
name in each module it is not possible for a driver to support both 32 and
64 bit linux compatibility.

Move the list of ioctl handlers to linux_common.ko so it is shared by
both Linux modules and all drivers receive both 32 and 64 bit ioctl calls
with one registration.  These ioctl handlers normally forward the call
to the FreeBSD ioctl handler which can handle both 32 and 64 bit.

Keep the special COMPAT_LINUX32 ioctl handlers in linux.ko in a separate
list for now and let the ioctl syscall iterate over that list first.
Later, COMPAT_LINUX32 support can be added to the 64 bit ioctl handlers
via a runtime check for ILP32 like is done for COMPAT_FREEBSD32 and then
this separate list would disappear again.  That is a much bigger effort
however and this commit is meant to be MFCable.

This enables linux64 support in x11/nvidia-driver*.

PR:		206711
Reviewed by:	kib
MFC after:	3 days
2018-11-06 13:51:08 +00:00
John Baldwin
7f7f6f85a1 Add a custom implementation of cpu_lock_delay() for x86.
Avoid using DELAY() since it can try to use spin locks on CPUs without
a P-state invariant TSC.  For cpu_lock_delay(), always use the TSC if
it exists (even if it is not P-state invariant) to delay for a
microsecond.  If the TSC does not exist, read from I/O port 0x84 to
delay instead.

PR:		228768
Reported by:	Roger Hammerstein <cheeky.m@live.com>
Reviewed by:	kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D17851
2018-11-05 22:54:03 +00:00
John Baldwin
4cbbb74888 Add a KPI for the delay while spinning on a spin lock.
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI.  Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms.  However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.

Reviewed by:	kib
MFC after:	3 days
2018-11-05 21:34:17 +00:00
John Baldwin
b317cfd4c0 Don't enter DDB for fatal traps before panic by default.
Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic'
and make the calls to kdb_trap() in MD fatal trap handlers prior to
calling panic() conditional on this new knob instead of
'debugger_on_panic'.  Disable the new knob by default.  Developers who
wish to recover from a fatal fault by adjusting saved register state
and retrying the faulting instruction can still do so by enabling the
new knob.  However, for the more common case this makes the user
experience for panics due to a fatal fault match the user experience
for other panics, e.g. 'c' in DDB will generate a crash dump and
reboot the system rather than being stuck in an infinite loop of fatal
fault messages and DDB prompts.

Reviewed by:	kib, avg
MFC after:	2 months
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D17768
2018-11-01 21:34:17 +00:00
Konstantin Belousov
6bc6a54280 Add pci_early function to detect Intel stolen memory.
On some Intel devices BIOS does not properly reserve memory (called
"stolen memory") for the GPU.  If the stolen memory is claimed by the
OS, functions that depend on stolen memory (like frame buffer
compression) can't be used.

A function called pci_early_quirks that is called before the virtual
memory system is started was added. In Linux, this PCI early quirks
function iterates through all PCI slots to check for any device that
require quirks.  While this more generic solution is preferable I only
ported the Intel graphics specific parts because I think my
implementation would be too similar to Linux GPL'd solution after
looking at the Linux code too much.

The code regarding Intel graphics stolen memory was ported from
Linux. In the case of Intel graphics stolen memory this
pci_early_quirks will read the stolen memory base and size from north
bridge registers.  The values are stored in global variables that is
later read by linuxkpi_gplv2. Linuxkpi stores these values in a
Linux-specific structure that is read by the drm driver.

Relevant linuxkpi code is here:
https://github.com/FreeBSDDesktop/kms-drm/blob/drm-v4.16/linuxkpi/gplv2/src/linux_compat.c#L37

For now, only amd64 arch is suppor ted since that is the only arch
supported by the new drm drivers. I was told that Intel GPUs are
always located on 0:2:0 so these values are hard coded for now.

Note that the structure and early execution of the detection code is
not required in its current form, but we expect that the code will be
added shortly which fixes the potential BIOS bugs by reserving the
stolen range in phys_avail[].  This must be done as early as possible
to avoid conflicts with the potential usage of the memory in kernel.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Reviewed by:	bwidawsk, imp
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16719
Differential revision:	https://reviews.freebsd.org/D17775
2018-10-31 23:17:00 +00:00
Kyle Evans
be352d20d5 Compile in VERBOSE_SYSINIT support by default, remain silent by default
The loader tunable 'debug.verbose_sysinit' may be used to toggle verbosity.
This is added to the debugging section of these kernconfs to be turned off
in stable branches for clarity of intent.

MFC after:	never
2018-10-31 22:38:19 +00:00
Marcelo Araujo
ec9e3fb095 Merge cases with upper block.
This is a cosmetic change only to simplify code.

Reported by:	anish
Sponsored by:	iXsystems Inc.
2018-10-31 01:27:44 +00:00
Marcelo Araujo
5bae7542d4 Emulate machine check related MSR_EXTFEATURES to allow guest OSes to
boot on AMD FX Series.

PR:		224476
Submitted by:	Keita Uchida <m@jgz.jp>
Reviewed by:	rgrimes
Sponsored by:	iXsystems Inc.
Differential Revision:	https://reviews.freebsd.org/D17713
2018-10-30 10:02:23 +00:00
Konstantin Belousov
9775a6ebd2 amd64: Use ifuncs to select suitable implementation of set_pcb_flags().
There is no reason to check for PCB_FULL_IRET if FSGSBASE instructions
are not supported.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-10-29 23:52:31 +00:00
Konstantin Belousov
93177620ee Style.
Wrap long lines, use +4 spaces for continuation indent.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-10-29 23:45:17 +00:00
Yuri Pankov
8d56c80545 Provide basic descriptions for VMX exit reason (from "Intel 64 and IA-32
Architectures Software Developer’s Manual Volume 3").  Add the document
to SEE ALSO in bhyve.8 (and pet manlint here a bit).

Reviewed by:	jhb, rgrimes, 0mp
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D17531
2018-10-27 21:24:28 +00:00
Mateusz Guzik
099c6f6d45 amd64: finish the tail in memset with an overlapping store
Instead of finding the exact size to fit in we can just shift the target
by -8 + tail. Doing a blind write to a previously rep stosq'ed area comes
with a penalty so do it conditionally.

Sample win on EPYC when zeroing a 257 sized buffer (tail = 1) aligned to
16 bytes:
before: 44782846 ops/s
after:  46118614 ops/s

Idea stolen from NetBSD.

Sponsored by:	The FreeBSD Foundation
2018-10-22 06:44:20 +00:00
Warner Losh
6a18678249 Remove the ncr(4) drive.
This driver has been obsolete since the FreeBSD 4.x. It should have
been removed then since the sym(4) driver had subsumed it. The driver
was commented out of GENERIC in 2000.

RelNotes: Yes
2018-10-22 02:36:18 +00:00
Warner Losh
49a93324fe Remove stg(4) driver
stg(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame. It was also only enabled on i386.

Relnote: Yes
2018-10-22 02:35:50 +00:00
Warner Losh
08204c2cc3 Remove nsp(4) driver
nsp(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame. It was also only enabled on i386.

Relnote: Yes
2018-10-22 02:35:38 +00:00
Warner Losh
2dfd358865 Remove ncv(4) driver
ncv(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame..

Relnote: Yes
2018-10-22 02:35:26 +00:00
Warner Losh
e9b5375b04 Retire dpt(4)
Marked as gone in 12 and not relevant since the early 90s. No
sightings in nycbug's dmesg database.

Relnotes: yes
2018-10-22 02:35:12 +00:00
Warner Losh
48ac1a9566 Remove the gone_in(12) devices.
We're planning on removing adv, adw, aha, aic, bt, ncv, nsp, and stg
soon. They have been tagged for removal in 12. At least get them out
of GENERIC.

MFC after: 3 days
Relnotes: yes
2018-10-22 02:28:18 +00:00
Mateusz Guzik
bbf3607b86 amd64: tidy up memset to have rax set earlier for small sizes 2018-10-21 10:46:00 +00:00
Konstantin Belousov
2dec2b4a34 amd64: flush L1 data cache on syscall return with an error.
The knob allows to select the flushing mode or turn it off/on.  The
idea, as well as the list of the ignored syscall errors, were taken
from https://www.openwall.com/lists/kernel-hardening/2018/10/11/10 .

I was not able to measure statistically significant difference between
flush enabled vs disabled using syscall_timing getuid.

Reviewed by:	bwidawsk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17536
2018-10-20 23:17:24 +00:00
Mark Johnston
36209a40d1 Add an assertion to pmap_enter().
When modifying an existing managed mapping, we should find a PV entry
for the old mapping.  Verify this.

Before r335784 this would have been implicitly tested by the fact that
we always freed the PV entry for the old mapping.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17626
2018-10-20 20:53:35 +00:00
Mateusz Guzik
c88205e76e amd64: relax constraints in curthread and curpcb
This makes the compiler less likely to reload the content from %gs.

The 'P' modifier drops all synteax prefixes and 'n' constraint treats
input as a known at compilation time immediate integer.

Example reloading victim was spinlock_enter.

Stolen from:	OpenBSD

Reported by:	jtl
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17615
2018-10-20 17:00:18 +00:00
Konstantin Belousov
3ade944019 Do not flush cache for PCIe config window.
Apparently AMD machines cannot tolerate this. This was uncovered by
r339386, where cache flush started really flushing the requested range.

Introduce pmap_mapdev_pciecfg(), which simply does not flush cache
comparing with pmap_mapdev().  It assumes that the MCFG region was
never accessed through the cacheable mapping, which is most likely
true for machine to boot at all.

Note that i386 does not need the change, since the architecture
handles access per-page due to the KVA shortage, and page remapping
already does not flush the cache.

Reported and tested by:	mjg, Mike Tancsa <mike@sentex.net>
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D17612
2018-10-18 20:49:16 +00:00
Konstantin Belousov
2fd0c8e7ca Provide pmap_large_map() KPI on amd64.
The KPI allows to map very large contigous physical memory regions
into KVA, which are not covered by DMAP.

I see both with QEMU and with some real hardware started shipping, the
regions for NVDIMMs might be very far apart from the normal RAM, and
we expect that at least initial users of NVDIMM could install very
large amount of such memory.  IMO it is not reasonable to extend DMAP
to cover that far-away regions both because it could overflow existing
4T window for DMAP in KVA, and because it costs in page table pages
allocations, for gap and for possibly unused NV RAM.

Also, KPI provides some special functionality for fast cache flushing
based on the knowledge of the NVRAM mapping use.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17070
2018-10-16 17:28:10 +00:00
Konstantin Belousov
9d5d89b209 Add clwb().
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D17070
2018-10-16 17:00:42 +00:00
John Baldwin
de679f6efa Reload the LDT selector after an AMD-v #VMEXIT.
cpu_switch() always reloads the LDT, so this can only affect the
hypervisor process itself.  Fix this by explicitly reloading the host
LDT selector after each #VMEXIT.  The stock bhyve process on FreeBSD
never uses a custom LDT, so this change is cosmetic.

Reviewed by:	kib
Tested by:	Mike Tancsa <mike@sentex.net>
Approved by:	re (gjb)
MFC after:	2 weeks
2018-10-15 18:12:25 +00:00
Mateusz Guzik
6816c88458 amd64: partially depessimize cpu_fetch_syscall_args and cpu_set_syscall_retval
Vast majority of syscalls take 6 or less arguments. Move handling of other
cases to a fallback function. Similarly, special casing for _syscall
and __syscall
magic syscalls is moved away.

Return is almost always 0. The change replaces 3 branches with 1 in the common
case. Also the 'frame' variable convinces clang not to reload it on each access.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17542
2018-10-13 21:18:31 +00:00
Eric Joyner
77c1fcec91 ixl/iavf(4): Change ixlv to iavf and update it to use iflib(9)
Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).

This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.

The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.

A man page update that documents these drivers is forthcoming in a separate
commit.

Reviewed by:    sbruno@, kbowling@
Tested by:      jeffrey.e.pieper@intel.com
Approved by:	re (gjb@)
Relnotes:       yes
Sponsored by:   Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429
2018-10-12 22:40:54 +00:00
Mateusz Guzik
3cf1291d2e amd64: employ MEMMOVE in copyin/copyout
See r339205 for justification.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17526
2018-10-12 21:59:09 +00:00
Konstantin Belousov
555225c062 Call initializecpucache() before ifuncs are resolved.
The function tweaks CPU capabilities based on the VM platform and
tunables, which affected selection of the cache flush method before
ifuncs were used, and should affect the cache flush in the same way
after ifunc.

PR:	232081
Reported by:	phk
Analyzed by:	avg
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2018-10-12 16:00:21 +00:00
Konstantin Belousov
78a3652794 bhyve: emulate CLFLUSH and CLFLUSHOPT.
Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR.
I do not see that anything useful can be done except emulating page
faults on invalid addresses.

Due to the instruction encoding pecularity, also emulate SFENCE.

PR:	232081
Reported by:	phk
Reviewed by:	araujo, avg, jhb (all: previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17482
2018-10-12 15:30:15 +00:00
Mateusz Guzik
3c9a1d0493 amd64: make memmove and memcpy less slow with mov
The reasoning is the same as with the memset change, see r339205

Reviewed by:	kib (previous version)
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17441
2018-10-11 23:37:57 +00:00
Mateusz Guzik
3f102f5881 Provide string functions for use before ifuncs get resolved.
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.

Convert places which need them. Xen bits by royger.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17487
2018-10-11 23:28:04 +00:00
John Baldwin
b843f9be5e Fully restore the GDTR, IDTR, and LDTR after VT-x VM exits.
The VT-x VMCS only stores the base address of the GDTR and IDTR.  As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot.  Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.

Similarly, explicitly save and restore the LDT selector.  VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.

PR:		230773
Reported by:	John Levon <levon@movementarian.org>
Reviewed by:	kib
Tested by:	araujo
Approved by:	re (rgrimes)
MFC after:	1 week
2018-10-11 18:27:19 +00:00
Brooks Davis
c7d0908e1c Regenerated assorted syscall related files after:
- r327895: Implement 'domainset'...
 - r329876: Use linux types for linux-specific syscalls

Diff generated with:
	find . -name syscalls.conf | xargs dirname | \
	    xargs -n1 -I DIR make -C DIR sysent

Approved by:	re (kib)
Sponsored by:	DARPA, AFRL
2018-10-09 20:42:17 +00:00
Michael Tuexen
6b45121a6d Address the warning regarding duplicate option 'GEOM_PART_GPT' when
configuring kernels for i386, amd64, and arm64.
The 'GEOM_PART_GPT' option was added to the DEFAULTS configuration
in r337967.

Approved by:		re (kib@)
Reviewed by:		ler@
Differential Revision:	https://reviews.freebsd.org/D17458
Sponsored by:		Netflix, Inc.
2018-10-07 15:54:13 +00:00
Mateusz Guzik
97bb9a0818 amd64: make memset less slow with mov
rep stos has a high startup time even on modern microarchitectures like
Skylake. Intel optimization manuals discuss how for small sizes it is
beneficial to go for streaming stores. Since those cannot be used without
extra penalty in the kernel I investigated performance impact of just
regular movs.

The patch below implements a very simple scheme: a 32-byte loop followed
by filling in the remainder of at most 31 bytes. It has a 256 breaking
point on which it falls back to rep stos. It provides a significant win
over the current primitive on several machines I tested (both Intel and
AMD). A 64-byte loop did not provide any benefit even for multiple of 64
sizes.

See the review for benchmark data.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17398
2018-10-05 19:25:09 +00:00
Mateusz Guzik
9657b80ce7 amd64: hide non-erms jump label under non-erms copyin/copyout
This change is a no-op in terms of semantics, but has a side effect
of removing a perfectly useless nop sled for CPUs with ERMS.

Approved by:	re (gjb)
Sponsored by:   The FreeBSD Foundation
2018-10-04 20:01:48 +00:00
Mark Johnston
cb4961abc1 Apply r339046 to i386.
Belatedly add a comment to the amd64 pmap explaining why we initialize
the kernel pmap's resident page count.

Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17377
2018-10-01 18:48:33 +00:00
Mark Johnston
c6c770d041 Count bootstrap data as resident in the kernel pmap.
Such data may later be unmapped.  This occurs, for example, when a
loader-provided microcode update file is discarded.

Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17340
2018-10-01 14:47:49 +00:00
Konstantin Belousov
f76bec2a28 Revert part of the r338891 which reordered local invalidation and IPI.
For PCID case, there is a dependency between pm_gen zeroing and
reading pm_active for IPI target selection, to ensure that the
invalidation is not missed.

Reported and tested by:	mjg
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2018-09-28 14:08:20 +00:00
Mateusz Guzik
825eeb55f4 amd64: fix return value of copyinstr after r338970
The function stopped swapping rdi and rsi, but the error handling
code was not updated with the new register name.

Approved by:	re (implicit)
Sponsored by:	The FreeBSD Foundation
2018-09-27 20:48:07 +00:00
John Baldwin
83382d027f Don't clear DR6 for debug exceptions from userland.
This reverts part of r333368.  The attempt to clear DR6 was occuring
too soon as trapsignal() does not pause to let the debugger notice the
SIGTRAP and query DR6.  The signal exchange does not occur until much
later during ast().  As a result, GDB was no longer recognizing
hardware breakpoints and watchpoints on x86.

In addition, any userland programs that want to inspect DR6 in a
SIGTRAP handler don't have a way to do this if we clear DR6 in the
exception handler.

Instead of relying on the kernel to clear DR6, debuggers will have to
explicitly clear it after a trace trap (which they needed to do on
older kernels anyway).

Reviewed by:	kib
Approved by:	re (delphij)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D17319
2018-09-27 17:33:59 +00:00
Mateusz Guzik
5910b87605 amd64: macroify and mostly depessimize copyinstr
See r338968 for details.

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17288
2018-09-27 15:53:36 +00:00
Mateusz Guzik
3d95cc51bb amd64: mostly depessimize copystr
- remove a forward branch in the common case
- replace xchg + lodsb/stosb loop with simple movs

A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
/foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.

Further changes are pending.

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17281
2018-09-27 15:27:53 +00:00
Mateusz Guzik
0e59ecce47 amd64: clean up copyin/copyout
- move the PSL.AC comment to the fault handler
- stop testing for zero-sized ops. after several minutes of package
building there were no copyin calls with zero bytes and very few
copyout. the semantic of returning 0 in this case is preserved
- shorten exit paths by clearing %eax earlier
- replace xchg with 3 movs. this is what compilers do. a naive
benchmark on EPYC suggests about 1% increase in thoughput thanks to
this change.
- remove the useless movb %cl,%al from copyout. it looks like a
leftover from many years ago

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17286
2018-09-27 15:24:16 +00:00
Mateusz Guzik
a8e3f99ec1 amd64: implement memcmp in assembly
Both the in-kernel C variant and libc asm variant have very poor performance.
The former compiles to a single byte comparison loop, which breaks down even
for small sizes. The latter uses rep cmpsq/b which turn out to have very poor
throughput and are slower than a hand-coded 32-byte comparison loop.

Depending on size this is about 3-4 times faster than the current routines.

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17328
2018-09-27 14:05:44 +00:00
Andrew Turner
27d2645787 Handle a guest executing a vm instruction by trapping and raising an
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.

Found with:	syzkaller
Reviewed by:	araujo, tychon (previous version)
Approved by:	re (kib)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17192
2018-09-27 11:16:19 +00:00
Konstantin Belousov
05e1cca97a Fix some uses of dmaplimit.
dmaplimit is the first byte after the end of DMAP.

Reported by:	"Johnson, Archna" <Archna.Johnson@netapp.com>
Reviewed by:	alc, markj
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17318
2018-09-25 20:07:58 +00:00
Konstantin Belousov
cbe100dfca Fix an issue in r338862.
For pmap_invalidate_all_pcid(), only reset pm_gen for non-kernel
pmaps, as it was done before the conversion to ifuncs.  The reset is
useless but innocent for kernel_pmap. Coverity reported that cpuid is
used uninitialized in this case.

Reported by:	cem
Reviewed by:	alc, cem, markj
CID:	 1395807
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D17314
2018-09-25 18:24:25 +00:00
Konstantin Belousov
108ff63e8a Further reorganize pmap_invalidate TLB code.
Split calculation of mask for shootdown IPI and local
invalidation. Reorder IPI before local.

Suggested by:	alc
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D17277
2018-09-22 17:04:39 +00:00
Mark Johnston
6cdde6fda2 Use the GNU as-compatible .endm instead of .endmacro.
Approved by:	re (gjb)
2018-09-21 20:20:03 +00:00
Konstantin Belousov
d995b5b1ea Convert x86 TLB top-level invalidation functions to ifuncs.
Note that shootdown IPI handlers are already per-mode.

Suggested by:	alc
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D17184
2018-09-21 17:53:06 +00:00
Mateusz Guzik
68c7542edf amd64: even up copyin/copyout with memcpy + other cleanup
- _fault handlers for both primitives are identical, provide just one
- change the copying scheme to match memcpy (in particular jump
avoidance for the most common case of multiply of 8)
- stop re-reading pcb address on exit, just store it locally (in r9)

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17265
2018-09-21 15:00:46 +00:00
Mateusz Guzik
bb32268086 amd64: check for small size in memmove, memcpy and memset
If the size is 15 bytes or less avoid spinning up rep just to copy the 8
bytes. In my tests on EPYC and old Intel microarchs without ERMS (like
Westmere) it provided a nice win over the current version (e.g. for EPYC
memset with 15 bytes of size goes from 59712651 ops/s to 70600095) all
while almost not pessimizing the other cases.

Data collected during package building shows that < 16 sizes are pretty
common.

Verified with the glibc test suite.

Approved by:	re (kib)
2018-09-21 12:27:36 +00:00
Mateusz Guzik
5254f065ab amd64: macroify copyin/copyout and provide erms variants, follow up
Fix a fat-fingered typo with a "funny" side-effect: when doing copyin on a
cpu without ERMS and with size being a multiply of 8 a page fault would be
triggered resulting in EFAULT.

Pointy hat: mjg
Approved by:	re (implicit)
2018-09-20 20:32:08 +00:00
Mateusz Guzik
4bf0035fd1 amd64: macroify copyin/copyout and provide erms variants
Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17257
2018-09-20 18:30:17 +00:00
Mateusz Guzik
a286a3099c amd64: move fusufault after all users
A lot of function have the following check:
        cmpq    %rax,%rdi                       /* verify address is valid */
        ja      fusufault

The label is present earlier in kernel .text, which means this is a jump
backwards. Absent any information in branch predictor, the cpu predicts it
as taken. Since it is almost never taken in practice, this results in a
completely avoidable misprediction.

Move it past all consumers, so that it is predicted as not taken.

Approved by:	re (kib)
2018-09-20 13:29:43 +00:00
Konstantin Belousov
d12c446550 Convert x86 cache invalidation functions to ifuncs.
This simplifies the runtime logic and reduces the number of
runtime-constant branches.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D16736
2018-09-19 19:35:02 +00:00
Konstantin Belousov
215aa93033 amd64 pmap: remove tautological assert.
pm_pcid is unsigned.

Reviewed by:	cem, markj
CID:	1395727
Noted by:	cem
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D17235
2018-09-19 15:39:16 +00:00
Konstantin Belousov
3c022be2ca Use ifunc to resolve context switching mode on amd64.
Patch removes all checks for pti/pcid/invpcid from the context switch
path. I verified this by looking at the generated code, compiling with
the in-tree clang.  The invpcid_works1 trick required inline attribute
for pmap_activate_sw_pcid_pti() to work.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D17181
2018-09-17 15:52:19 +00:00
Mateusz Guzik
d6943c5804 amd64: tidy up kernel memmove, take 2
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.

This reduces the diff against userspace.

Tested with the glibc test suite.

Approved by:	re (kib)
2018-09-17 15:51:49 +00:00
Konstantin Belousov
09a6ada991 Calculate PTI, PCID and INVPCID modes earlier, before ifuncs are resolved.
This will be used in following conversion of pmap_activate_sw().

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D17181
2018-09-17 15:34:19 +00:00
Konstantin Belousov
76ed0c542f Make the PTI violation check to follow style of the SMAP check.
No functional changes.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D17181
2018-09-17 14:59:05 +00:00
Mateusz Guzik
9d1b868da0 Revert amd64: tidy up kernel memmove
There is a braino in the non-erms variant which breaks the
functionality.

Will be fixed at a later time with a different patch.

Reported by:	Manfred Antar
Approved by:	re (implicit)
2018-09-16 21:46:27 +00:00
Mateusz Guzik
17f67f63b9 amd64: tidy up kernel memmove
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.

This reduces the diff against userspace.

Approved by:	re (kib)
2018-09-16 19:28:27 +00:00
Konstantin Belousov
bd6c14afa7 Remove unneeded new line from the panic string.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D17181
2018-09-16 18:36:42 +00:00
Mateusz Guzik
c51b7ab9e3 amd64: implement pagezero_erms
Intel docs claim such a memset (rep stosb + 4096 bytes) is
special-cased by microarchs. They also switched Linux to use
it for this purpose.

Approved by:	re (gjb)
2018-09-14 15:29:35 +00:00
Mateusz Guzik
13ea074dc3 amd64: implement ERMS-based memmove, memcpy and memset
Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17124
2018-09-13 14:53:51 +00:00
Mateusz Guzik
e382dd47aa amd64: enable options NUMA in GENERIC and MINIMAL
Reviewed by:	gallatin, cem, scottl
Approved by:	re (kib)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon, Netflix
Differential Revision:	https://reviews.freebsd.org/D17059
2018-09-11 23:54:31 +00:00
Mateusz Guzik
12360b3079 amd64: depessimize copyinstr_smap
The stac/clac combo around each byte copy is causing a measurable
slowdown in benchmarks. Do it only before and after all data is
copied. While here reorder the code to avoid a forward branch in
the common case.

Note the copying loop (originating from copyinstr) is avoidably slow
and will be fixed later.

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17063
2018-09-06 19:42:40 +00:00
Konstantin Belousov
20df4f456d amd64: Properly re-merge r334537 into SMAP-ified copyin(9) and copyout(9).
Also this fixes the eflags.ac leak from copyin_smap() when the copied
data length is multiple of eight bytes.

Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2018-09-04 19:27:53 +00:00
Konstantin Belousov
e21c5abc2a amd64: For non-PTI mode, do not initialize PCPU kcr3 to KPML4phys.
Non-PTI mode does not switch kcr3, which means that kcr3 is almost
always stale.  This is important for the NMI handler, which reloads
%cr3 with PCPU(kcr3) if the value is different from PMAP_NO_CR3.

The end result is that curpmap in NMI handler does not match the page
table loaded into hardware.  The manifestation was copyin(9) looping
forever when a usermode access page fault cannot be resolved by
vm_fault() updating a different page table.

Reported by:	mmacy
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Approved by:	re (gjb)
2018-09-04 19:26:54 +00:00
Konstantin Belousov
50cd0be78f Catch exceptions during EFI RT calls on amd64.
This appeared to be required to have EFI RT support and EFI RTC
enabled by default, because there are too many reports of faulting
calls on many different machines.  The knob is added to leave the
exceptions unhandled to allow to debug the actual bugs.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 21:37:05 +00:00
Konstantin Belousov
1565fb29a7 Add amd64 mdthread fields needed for the upcoming EFI RT exception
handling.

This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 21:16:43 +00:00
Konstantin Belousov
9eb958988a Swap order of dererencing PCPU curpmap and checking for usermode in
trap_pfault() KPTI violation check.

EFI RT may set curpmap to NULL for the duration of the call for some
machines (PCID but no INVPCID).  Since apparently EFI RT code must be
ready for exceptions from the calls, avoid dereferencing curpmap until
we know that this call does not come from usermode.

Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 20:07:36 +00:00
Konstantin Belousov
d4be3789fe Normalize use of semicolon with EFI_TIME_LOCK macros.
Reviewed by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:    re (rgrimes)
Differential revision:	https://reviews.freebsd.org/D16972
2018-09-02 19:48:41 +00:00
Konstantin Belousov
f0165b1ca6 Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.

Based on the submission by:	royger
Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Approved by:	re (marius)
Differential revision:	https://reviews.freebsd.org/D16881
2018-08-29 12:24:19 +00:00
Konstantin Belousov
d367236183 Several bug fixes and robustness improvements for the AP boot page
table allocation.

At the time that mp_bootaddress() is called, phys_avail[] array does
not reflect some memory reservations already done, like kernel
placement.  Recent changes to DMAP protection which make kernel text
read-only in DMAP revealed this, where on some machines AP boot page
tables selection appears to intersect with the kernel itself.

Fix this by checking the addresses selected using the same algorithm
as bootaddr_rwx().  Also, try to chomp pages for the page table not
only at the start of the contiguous range, but also at the end.  This
should improve robustness when the only suitable range is already
consumed by the kernel.

Reported and tested by: Michael Gmelin <freebsd@grem.de>
Reviewed by:    jhb
MFC after:      1 week
Sponsored by:   The FreeBSD Foundation
Approved by:    re (gjb)
Differential revision:  https://reviews.freebsd.org/D16907
2018-08-28 18:47:02 +00:00
Alan Cox
49bfa624ac Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
Konstantin Belousov
60b7423434 Unify amd64 and i386 vmspace0 pmap activation.
Add pmap_activate_boot() for i386, move the invocation on APs from MD
init_secondary() to x86 init_secondary_tail().

Suggested by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16893
2018-08-25 15:21:28 +00:00
Warner Losh
592ffb2175 Revert drm2 removal.
Revert r338177, r338176, r338175, r338174, r338172

After long consultations with re@, core members and mmacy, revert
these changes. Followup changes will be made to mark them as
deprecated and prent a message about where to find the up-to-date
driver.  Followup commits will be made to make this clear in the
installer. Followup commits to reduce POLA in ways we're still
exploring.

It's anticipated that after the freeze, this will be removed in
13-current (with the residual of the drm2 code copied to
sys/arm/dev/drm2 for the TEGRA port's use w/o the intel or
radeon drivers).

Due to the impending freeze, there was no formal core vote for
this. I've been talking to different core members all day, as well as
Matt Macey and Glen Barber. Nobody is completely happy, all are
grudgingly going along with this. Work is in progress to mitigate
the negative effects as much as possible.

Requested by: re@ (gjb, rgrimes)
2018-08-24 00:02:00 +00:00
Mark Johnston
36716fe2e6 Prepare the kernel linker to handle PC-relative ifunc relocations.
The boot-time ifunc resolver assumes that it only needs to apply
IRELATIVE relocations to PLT entries.  With an upcoming optimization,
this assumption no longer holds, so add the support required to handle
PC-relative relocations targeting GNU_IFUNC symbols.
- Provide a custom symbol lookup routine that can be used in early boot.
  The default lookup routine uses kobj, which is not functional at that
  point.
- Apply all existing relocations during boot rather than filtering
  IRELATIVE relocations.
- Ensure that we continue to apply ifunc relocations in a second pass
  when loading a kernel module.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16749
2018-08-22 20:44:30 +00:00
Konstantin Belousov
614a9ce31a Skip PMAP_PCID_KERN + 1 PCPU pcid_next value on APs as well.
r337838 did it for BSP.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-08-22 14:58:52 +00:00
Matt Macy
d157fbd5b4 Remove legacy drm and drm2 from tree
As discussed on the MLs drm2 conflicts with the ports' version and there
is no upstream for most if not all of drm. Both have been merged in to
a single port.

Users on powerpc, 32-bit hardware, or with GPUs predating Radeon
and i915 will need to install the graphics/drm-legacy-kmod. All
other users should be able to use one of the LinuxKPI-based ports:
graphics/drm-stable-kmod, graphics/drm-next-kmod, graphics/drm-devel-kmod.

MFC: never
Approved by: core@
2018-08-22 01:50:12 +00:00
Alan Cox
83a90bffd8 Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)

Reviewed by:	kib, markj
Discussed with:	jeff, re@
Differential Revision:	https://reviews.freebsd.org/D16825
2018-08-21 16:43:46 +00:00
Konstantin Belousov
a997bcc015 Update comment about ABI of flush_l1s_sw to match the reality.
CPUID instruction clobbers %rbx and %rdx.

Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-08-20 19:09:39 +00:00
Konstantin Belousov
b0568ddbec Always initialize PCPU kcr3 for vmspace0 pmap.
If an exception or NMI occurs before CPU switched to a pmap different
from vmspace0, PCPU kcr3 is left zero for pti config, which causes
triple-fault in the handler.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-08-20 19:07:57 +00:00
John Baldwin
a800b45c18 Merge amd64 and i386 <machine/intr_machdep.h> headers.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16803
2018-08-20 12:31:39 +00:00
Konstantin Belousov
c1141fba00 Update L1TF workaround to sustain L1D pollution from NMI.
Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D.  If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers.  There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.

Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded.  If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.

Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16790
2018-08-19 18:47:16 +00:00
John Baldwin
8cd385fda0 Make 'device crypto' lines more consistent.
- In configurations with a pseudo devices section, move 'device crypto'
  into that section.
- Use a consistent comment.  Note that other things common in kernel
  configs such as GELI also require 'device crypto', not just IPSEC.

Reviewed by:	rgrimes, cem, imp
Differential Revision:	https://reviews.freebsd.org/D16775
2018-08-18 20:32:08 +00:00
Warner Losh
62ee5bbd73 GPT is standard in x86 and arm64 land. Add it to DEFAULTS with the
others.

Differential Revision: https://reviews.freebsd.org/D16740
2018-08-17 14:47:21 +00:00
Konstantin Belousov
54564eda77 Fix early EFIRT on PCID machines after r337773.
Ensure that the valid PCID state is created for proc0 pmap, since it
might be used by efirt enter() before first context switch on the BSP.

Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2018-08-15 12:48:49 +00:00
Konstantin Belousov
c30578feeb Provide part of the mitigation for L1TF-VMM.
On the guest entry in bhyve, flush L1 data cache, using either L1D
flush command MSR if available, or by reading enough uninteresting
data to fill whole cache.

Flush is automatically enabled on CPUs which do not report RDCL_NO,
and can be disabled with the hw.vmm.l1d_flush tunable/kenv.

Security:	CVE-2018-3646
Reviewed by:	emaste. jhb, Tony Luck <tony.luck@intel.com>
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:29:41 +00:00
Konstantin Belousov
9840c7373c Reserve page at the physical address zero on amd64.
We always zero the invalidated PTE/PDE for superpage, which means that
L1TF CPU vulnerability (CVE-2018-3620) can be only used for reading
from the page at zero.

Note that both i386 and amd64 exclude the page from phys_avail[]
array, so this change is redundant, but I think that phys_avail[] on
UEFI-boot does not need to do that.  Eventually the blacklisting
should be made conditional on CPUs which report that they are not
vulnerable to L1TF.

Reviewed by:	emaste. jhb
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:14:33 +00:00
Konstantin Belousov
8fba5348fc amd64: ensure that curproc->p_vmspace pmap always matches PCPU
curpmap.

When performing context switch on a machine without PCID, if current
%cr3 equals to the new pmap %cr3, which is typical for kernel_pmap
vs. kernel process, I overlooked to update PCPU curpmap value.  Remove
check for %cr3 not equal to pm_cr3 for doing the update.  It is
believed that this case cannot happen at all, due to other changes in
this revision.

Also, do not set the very first curpmap to kernel_pmap, it should be
vmspace0 pmap instead to match curproc.

Move the common code to activate the initial pmap both on BSP and APs
into pmap_activate_boot() helper.

Reported by: eadler, ambrisko
Discussed with: kevans
Reviewed by:	alc, markj (previous version)
Tested by: ambrisko (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16618
2018-08-14 16:37:14 +00:00
Konstantin Belousov
ef52dc71eb Fix typo.
Noted by:	alc
MFC after:	3 days
2018-08-14 16:27:17 +00:00
Mark Johnston
97edfc1b45 Implement kernel support for early loading of Intel microcode updates.
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel.  Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method.  For the time being only Intel updates are supported.

Microcode update files are passed to the kernel via loader(8).  The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update.  Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update.  Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16370
2018-08-13 17:13:09 +00:00
Konstantin Belousov
cb0eecdf92 Futex support functions in linux.ko and linux32.ko on amd64 should be
aware of SMAP.

Reported and tested by:	Johannes Lundberg <johalun0@gmail.com>, wulf
Sponsored by:	The FreeBSD Foundation
2018-08-07 18:29:10 +00:00
Kyle Evans
3395e43a04 efirt: Don't enter EFI context early, convert addrs to KVA instead
efi_enter here was needed because efi_runtime dereference causes a fault
outside of EFI context, due to runtime table living in runtime service
space. This may cause problems early in boot, though, so instead access it
by converting paddr to KVA for access.

While here, remove the other direct PHYS_TO_DMAP calls and the explicit DMAP
requirement from efidev.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16591
2018-08-04 21:41:10 +00:00
Konstantin Belousov
54c531cacd Add END()s for amd64 linux futex support routines.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-04 13:57:50 +00:00
Konstantin Belousov
35efb3b1de Fix typo in copyinstr_smap, resulting in mis-handling of too long strings.
Reported and tested by:	pho
PR:	230286
Sponsored by:	The FreeBSD Foundation
2018-08-03 15:35:29 +00:00
Konstantin Belousov
e45b89d23d Add pmap_is_valid_memattr(9).
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15583
2018-08-01 18:45:51 +00:00
Mark Johnston
8a5efe3601 Make sure that ENTRY() and END() refer to the same symbol.
X-MFC with:	r336876
2018-08-01 15:50:42 +00:00
Marcelo Araujo
be963beee6 - Add the ability to run bhyve(8) within a jail(8).
This patch adds a new sysctl(8) knob "security.jail.vmm_allowed",
by default this option is disable.

Submitted by:	Shawn Webb <shawn.webb____hardenedbsd.org>
Reviewed by:	jamie@ and myself.
Relnotes:	Yes.
Sponsored by:	HardenedBSD and G2, Inc.
Differential Revision:	https://reviews.freebsd.org/D16057
2018-08-01 00:39:21 +00:00
Mark Johnston
40fd44953c COMPAT_LINUX32 has not depended on COMPAT_43 in some time.
MFC after:	3 days
2018-07-31 21:40:13 +00:00
Kyle Evans
164138e7d8 amd64/GENERIC: Enable EFIRT by default
As noted in UDPATING, the new loader tunable efi.rt_disabled may be used to
disable EFIRT at runtime. It should have no effect if you are not booted via
UEFI boot.

MFC after:	6 weeks
2018-07-30 17:54:18 +00:00
Konstantin Belousov
8e36389535 Remove unneeded CLDs instructions in the SMAP-ed version of several
functions from support.S.

I believe they re-appeared due to me mis-merging my r327820 into the
topic branch.

Sponsored by:	The FreeBSD Foundation
2018-07-30 16:54:51 +00:00
Konstantin Belousov
b3a7db3b06 Use SMAP on amd64.
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access.  Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13838
2018-07-29 20:47:00 +00:00
Warner Losh
67d33338c0 Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM.
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).

Differential Review: https://reviews.freebsd.org/D16290
2018-07-27 18:34:20 +00:00
Mark Johnston
6c85795a25 Fix handling of KVA in kmem_bootstrap_free().
Do not use vm_map_remove() to release KVA back to the system.  Because
kernel map entries do not have an associated VM object, with r336030
the vm_map_remove() call will not update the kernel page tables.  Avoid
relying on the vm_map layer and instead update the pmap and release KVA
to the kernel arena directly in kmem_bootstrap_free().

Because the pmap updates will generally result in superpage demotions,
modify pmap_init() to insert PTPs shadowed by superpage mappings into
the kernel pmap's radix tree.

While here, port r329171 to i386.

Reported by:	alc
Reviewed by:	alc, kib
X-MFC with:	r336505
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16426
2018-07-27 15:46:34 +00:00
Konstantin Belousov
45ed991d96 On amd64, enable workarounds for several Ryzen erratas as described in
the AMD document 55449 'Revision Guide for AMD Family 17h Models
00h-0Fh Processors' rev 1.12.

The errata numbers are mentioned near each action.

It seems that newer BIOSes already include required chicken bits
settings, so the magic MSR updates are only needed when BIOS cannot be
updated.  On the other hand, MWAIT avoidance seems to be important.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-07-27 15:31:20 +00:00
Konstantin Belousov
41bed185c1 Extend ranges of the critical sections to ensure that context switch
code never sees FPU pcb flags not consistent with the hardware state.

This is uncovered by the eager FPU switch mode.

Analyzed, reviewed and tested by:	gleb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-07-24 19:22:52 +00:00
Mark Johnston
483f692ea6 Have preload_delete_name() free pages backing preloaded data.
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data.  This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded.  Previously,
it would have remained unused.

Reviewed by:	kib, royger
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16330
2018-07-19 20:00:28 +00:00
Roger Pau Monné
b0663c33c2 xen: implement early init helper for PVHv2
In order to setup an initial environment and jump into the generic
hammer_time initialization function. Some of the code is shared with
PVHv1, while other code is PVHv2 specific.

This allows booting FreeBSD as a PVHv2 DomU and Dom0.

Sponsored by:	Citrix Systems R&D
2018-07-19 08:44:52 +00:00
Roger Pau Monné
f2577f25c1 xen: add PVHv2 entry point
The PVHv2 entry point is fairly similar to the multiboot1 one. The
kernel is started in protected mode with paging disabled. More
information about the exact BSP state can be found in the pvh.markdown
document on the Xen tree.

This entry point is going to be joined with the native entry point at
hammer_time, and in order to do so the BSP needs to be bootstrapped
into long mode with the same set of page tables as used on bare metal.

Sponsored by:	Citrix Systems R&D
2018-07-19 07:39:35 +00:00
Konstantin Belousov
53dec71d39 Expand x86 struct pcpus to UMA_PCPU_ALLOC_SIZE AKA PAGE_SIZE.
This restores counters(9) operation.
Revert r336024. Improve assert of pcpu size on x86.

Reviewed by:	mmacy
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16163
2018-07-06 19:50:44 +00:00
Konstantin Belousov
fb0a281196 Revert to recommit with the proper message. 2018-07-06 19:50:25 +00:00
Konstantin Belousov
1614716655 Save a call to pmap_remove() if entry cannot have any pages mapped.
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries.  For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.

Profiled by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16148
2018-07-06 19:48:47 +00:00
Hans Petter Selasky
a7a7f5b472 Make sure kernel modules built by default are portable between UP and
SMP systems by extending defined(SMP) to include defined(KLD_MODULE).

This is a regression issue after r335873 .

Discussed with:		mmacy@
Sponsored by:		Mellanox Technologies
2018-07-06 10:13:42 +00:00
Matt Macy
428194fed2 counter(9): unbreak amd64 following r336020
Apply temporary fix to counter until daylight hours.
The fact that the assembly for counter_u64_add relied on the sizeof(struct pcpu) was
the basis for the otherwise arbitrary offset never came up in D15933.
critical_{enter,exit} is now inline so the only real added overhead is the
added (mostly false) conditional branch in exit.
2018-07-06 10:10:00 +00:00
Matt Macy
ab3059a8e7 Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
  (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
  There are some misconceptions surrounding this field. It is the
  _VM_ NUMA domain and should only ever correspond to valid domain
  values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision:  https://reviews.freebsd.org/D15933
2018-07-06 02:06:03 +00:00
Konstantin Belousov
945a6b310b Extend r335969 to superpages.
It is possible that a fictitious unmanaged userspace mapping of
superpage is created on x86, e.g. by pmap_object_init_pt(), with the
physical address outside the vm_page_array[] coverage.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16085
2018-07-05 17:28:06 +00:00
Konstantin Belousov
a0ef97f6fa Revert r335999 to re-commit with the correct error message. 2018-07-05 17:26:13 +00:00
Konstantin Belousov
c59dfa63bf In x86 pmap_extract_and_hold(), there is no need to recalculate the
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16085
2018-07-05 16:38:54 +00:00
Konstantin Belousov
81dac87135 In x86 pmap_extract_and_hold(), there is no need to recalculate the
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16085
2018-07-05 16:27:34 +00:00
Alan Cox
c11c3d64e0 As of r335784, if pmap_enter() replaces a managed mapping by an unmanaged
mapping, then it leaks the unlinked PV entry.  This change eliminates that
leak, freeing the PV entry.

Reviewed by:	kib, markj
X-MFC with:	r335784
Differential Revision:	https://reviews.freebsd.org/D16130
2018-07-05 02:04:18 +00:00
Konstantin Belousov
84a15fe70d In x86 pmap_extract_and_hold()s, handle the case of PHYS_TO_VM_PAGE()
returning NULL.

vm_fault_quick_hold_pages() can be legitimately called on userspace
mappings backed by fictitious pages created by unmanaged device and sg
pagers.

Note that other architectures pmap_extract_and_hold() might need
similar fix, but I postponed the examination.

Reported by:	bde
Discussed with:	alc
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16085
2018-07-04 21:21:59 +00:00
John Baldwin
79ba91952d Use 'e' instead of 'i' constraints with 64-bit atomic operations on amd64.
The ADD, AND, OR, and SUB instructions take at most a 32-bit
sign-extended immediate operand.  64-bit constants that do not fit into
that constraint need to be loaded into a register.  The 'i' constraint
tells the compiler it can pass any integer constant to the assembler,
whereas the 'e' constrain only permits constants that fit into a 32-bit
sign-extended value.  This fixes using
atomic_add/clear/set/subtract_long/64 with constants that do not fit into
a 32-bit sign-extended immediate.

Reported by:	several folks
Tested by:	Pete Wright <pete@nomadlogic.org>
MFC after:	2 weeks
2018-07-03 22:03:28 +00:00
Matt Macy
f4b3640475 inline atomics and allow tied modules to inline locks
- inline atomics in modules on i386 and amd64 (they were always
  inline on other arches)
- allow modules to opt in to inlining locks by specifying
  MODULE_TIED=1 in the makefile

Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079
2018-07-02 19:48:38 +00:00
Mark Johnston
1253de1eb6 Invalidate the mapping before updating its physical address.
Doing so ensures that all threads sharing the pmap have a consistent
view of the mapping.  This fixes the problem described in the commit
log messages for r329254 without the overhead of an extra fault in the
common case.  Once other pmap_enter() implementations are similarly
modified, the workaround added in r329254 can be removed, reducing the
overhead of CoW faults.

With this change we can reuse the PV entry from the old mapping,
potentially avoiding a call to reclaim_pv_chunk().  Otherwise, there is
nothing preventing the old PV entry from being reclaimed.  In rare
cases this could result in the PTE's page table page being freed,
leading to a use-after-free of the page when the updated PTE is written
following the allocation of the PV entry for the new mapping.

Reported and tested by:	pho
Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D16005
2018-06-28 21:40:31 +00:00
Konstantin Belousov
7f12ebe583 Do not leave stray qword on top of stack for interrupts and exceptions
without error code.  Doing so it mis-aligned the stack.

Since the only consumer of the SSE instructions with the alignment
requirements is AES-NI module, and since the FPU context cannot be
accessed in interrupts, the only situation where the alignment matter
are the compat32 syscalls, as reported in the PR.

PR:	229222
Reported and tested by:	 dewayne@heuristicsystems.com.au
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-06-25 11:29:04 +00:00
Mark Johnston
a8be239d69 Re-count available PV entries after reclaiming a PV chunk.
The call to reclaim_pv_chunk() in reserve_pv_entries() may free a
PV chunk with free entries belonging to the current pmap.  In this
case we must account for the free entries that were reclaimed, or
reserve_pv_entries() may return without having reserved the requested
number of entries.

Reviewed by:	alc, kib
Tested by:	pho (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D15911
2018-06-23 10:41:52 +00:00
Chuck Tuffli
3575504976 Fix the Linux kernel version number calculation
The Linux compatibility code was converting the version number (e.g.
2.6.32) in two different ways and then comparing the results.

The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to
what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2
    v = v0 * 1000000 + v1 * 1000 + v2;

The LINUX_KERNVER() macro, on the other hand, converted the value with
bit shifts. I.e. where major=a, minor=b, and patch=c
    v = (((a) << 16) + ((b) << 8) + (c))

The Linux kernel uses the later format via the KERNEL_VERSION() macro in
include/generated/uapi/linux/version.h

Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as
in the .trans_osrel functions.

PR: 229209
Reviewed by: emaste, cem, imp (mentor)
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15952
2018-06-22 00:02:03 +00:00
Matt Macy
92689b3f02 remove ixl iwarp and ixlv from the build until they are in a working state 2018-06-19 02:48:53 +00:00
Eric Joyner
1031d839aa ixl(4): Update to use iflib
Update the driver to use iflib in order to bring performance,
maintainability, and (hopefully) stability benefits to the driver.

The driver currently isn't completely ported; features that are missing:

- VF driver (ixlv)
- SR-IOV host support
- RDMA support

The plan is to have these re-added to the driver before the next FreeBSD release.

Reviewed by:	gallatin@
Contributions by: gallatin@, mmacy@, krzysztof.galazka@intel.com
Tested by:	jeffrey.e.pieper@intel.com
MFC after:	1 month
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D15577
2018-06-18 20:12:54 +00:00
Ed Maste
931e2a1a6e linuxulator: do not include legacy syscalls on arm64
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.

Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files.  We may need finer grained control in the future but this is
sufficient for now.

Reviewed by:	andrew
Sponsored by:	Turing Robotic Industries
Differential Revision:	https://reviews.freebsd.org/D15237
2018-06-15 14:41:51 +00:00
Konstantin Belousov
459ccd3c5f linuxolator/amd64: Don't mangle %r10 on return from syscall for EJUSTRETURN.
This fixes the %r10 content for rt_sigreturn.

Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
MFC after:	1 week
2018-06-14 12:35:57 +00:00
Konstantin Belousov
5803d744c7 Reorganize code flow in fpudna()/npxdna() to highlight the critical
section scope.  Sprinkle __predict_false() for conditions known to
never occur or occur only on rare platforms.

Sponsored by:	The FreeBSD Foundation
2018-06-14 11:09:51 +00:00
Konstantin Belousov
fa7fad8ab9 Remove printf() in #NM handler.
Give up and remove the almost useless informational message reporting
that device not available exception occured while our state tracking
indicates the current CPU has FPU context loaded for the current
thread.

It seems that this is recurring bug with some VM monitors.

Sponsored by:	The FreeBSD Foundation
2018-06-14 10:33:26 +00:00
Konstantin Belousov
d1a07e31e5 Enable eager FPU context switch by default on amd64.
With compilers making increasing use of vector instructions the
performance benefit of lazily switching FPU state is no longer a
desirable tradeoff.  Linux switched to eager FPU context switch some
time ago, and the idea was floated on the FreeBSD-current mailing list
some years ago[1].

Enable eager FPU context switch by default on amd64, with a tunable/sysctl
available to turn it back off.

[1] https://lists.freebsd.org/pipermail/freebsd-current/2015-March/055198.html

Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2018-06-13 17:55:09 +00:00
Jonathan T. Looney
0766f278d8 Make UMA and malloc(9) return non-executable memory in most cases.
Most kernel memory that is allocated after boot does not need to be
executable.  There are a few exceptions.  For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9).  The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.

(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory.  This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory.  This change makes the behavior consistent.)

This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified.  After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.

Allocations that do need executable memory have various choices.  They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.

Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.

PR:		228927
Reviewed by:	alc, kib, markj, jhb (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15691
2018-06-13 17:04:41 +00:00
Marcelo Araujo
ebc3c37c6f Add SPDX tags to vmm(4).
MFC after:	4 weeks.
Sponsored by:	iXsystems Inc.
2018-06-13 07:02:58 +00:00
Jung-uk Kim
6362b1a6b1 Fix number of auxargs entries to copy out for 32-bit Linuxulator.
PR:		228790
2018-06-12 22:54:48 +00:00
Konstantin Belousov
b45e10c3f4 Fix braino in r334799. Maxmem is in pages.
Reported by:	ae, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-06-11 15:28:20 +00:00
Bruce Evans
3cd246d9a9 Untangle configuration ifdefs a little. On x86, msi is optional on pci,
and also on apic in common and i386 files (except for xen it is optional
only on xenhvm), but it was not ifdefed except on apic in common and i386
files.

This is all that is left from an attempt to build a (sub-)minimal kernel
without any devices.  The isa "option" is still used without ifdefs in many
standard files even on amd64.  ISAPNP is not optional on at least i386.
ATPIC is not optional on i386 (it is used mainly for Xspuriousint).  But
pci is now supposed to be optional on x86.
2018-06-10 14:49:13 +00:00
Mark Johnston
f090f67503 Tell the compiler that rdtscp clobbers %ecx. 2018-06-09 18:31:19 +00:00
Tycho Nightingale
4d20e87b7e Don't bother looking for non-executable pages when a process is
excluded from PTI.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15708
2018-06-08 20:35:58 +00:00
Matt Macy
eb7c901995 hwpmc: simplify calling convention for hwpmc interrupt handling
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.

While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.

core2_intr(cpu, tf) ->
  pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
    pmc_add_sample(cpu, ring, pm, tf, inuserspace)

In the process of optimizing the tail call the tf pointer was getting
clobbered:

(kgdb) up
    at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709                                pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205                    error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,

resulting in a crash in pmc_save_kernel_callchain.
2018-06-08 04:58:03 +00:00
Mateusz Guzik
dfa5753e09 amd64: remove now unused bzero, bcmp and bcopy. move pagecopy higher up. 2018-06-08 04:18:42 +00:00
Mateusz Guzik
c9ca1a70cc amd64: fix a retarded bug in memset
memset fills the target buffer from a byte-sized value passed in as the
second argument.

The fully-sized (8 bytes) register containing it is named %rsi. Lower 4 bytes
can be referred to as %esi and finally the lowest byte is %sil.

Vast majority of all the callers just zero the target buffer and set it up by
doing xor %esi,%esi which has a side-effect of zeroing the upper parts of
the register as well. Some others do a word-sized move to %esi which has the
same result.

However, there are callers which only fill %sil. This does *not* clear up
the rest of the register.

The value of %rsi is multiplied by $0x0101010101010101 to create a 8-byte sized
pattern for 8-byte stores.

Prior to the patch, the func just blindly took %rsi assuming the unwanted bytes
are zeroed out. Since this is not the case for the callers which only play with
%sil (the rest of the register can have absolutely anything), the resulting
pattern can be garbage.

This has potential for funny bugs. One side effect (which was not amusing)
after enabling it instead of bzero was that the kernel was hanging on boot
as a xen domU.

Reported by:	Trond Endrestøl <Trond.Endrestol fagskolen.gjovik.no>
Pointy hat: me
2018-06-08 00:47:24 +00:00
Konstantin Belousov
943defc3a0 Account for dmap limit when selecting the pages for the bootstrap
pagetables.

physmap[] can be inconsistent with the physical memory limit due to
buggy bios, or to the hw.physmem tunable. Since bootstrap pagetables
are initialized by accesses through the DMAP, we must ensure that DMAP
really cover the selected pages. This is only relevant when machine
has less than 4G RAM and buggy BIOS, which is the combination on Acer
Chromebook 720.

The call to mp_bootaddress() is moved later to have Maxmem initialized.

An alternative could be to always cover 4G for DMAP, but this change
seems to be simpler.

Reported and tested by:	grembo
Reviewed by:	royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15675
2018-06-07 17:04:34 +00:00
Matt Macy
155046394a cpufunc: add rdtscp for x86 2018-06-07 00:54:11 +00:00
Matt Macy
07d80fd8dc hwpmc: ABI fixes
- increase pmc cpuid field from 8 to 12 bits
- add cpuid version string to initialize entry in the log
  so that filter can identify which counter index an
  event name maps to
- GC unused config flags
- make fixed counter assignment more robust as well as the
  changes needed to be properly identified for filter
2018-06-04 02:05:48 +00:00
Mateusz Guzik
d0a22279db Remove an unused argument to turnstile_unpend.
PR:	228694
Submitted by:	Julian Pszczołowski <julian.pszczolowski@gmail.com>
2018-06-02 22:37:53 +00:00
Mateusz Guzik
15825d5b78 amd64: add a mild depessimization to rep mov/stos users
Currently all the primitives are waiting for a rewrite, tidy them up in the
meantime.

Vast majority of cases pass sizes which are multiple of 8. Which means the
following rep stosb/movb has nothing to do. Turns out testing first if there
is anything to do is a big win across the board (cpus with and without ERMS,
Intel and AMD) while not pessimizing the case where there is work to do.

Sample results for zeroing 64 bytes (ops/second):
Ryzen Threadripper 1950X		91433212 -> 147265741
Intel(R) Xeon(R) CPU X5675 @ 3.07GHz	90714044 -> 121992888

bzero and bcopy are on their way out and were not modified. Nothing in the
tree uses them.
2018-06-02 20:14:43 +00:00
Bruce Evans
c507c512b9 Finish COMPAT_AOUT support for amd64. It wasn't in any amd64 or MI
file in /sys/conf, so was unavailable in configurations that don't use
modules, and was not testable or notable in NOTES.  Its normal
configuration (not using a module) is still silently deprecated in
aout(4) by not mentioning it there.

Update i386 NOTES for COMPAT_AOUT.  It is not i386-only, or even very MD.
Sort its entry better.

Finish gzip configuration (but not support) for amd64.  gzip is really
gzipped aout.  It is currently broken even for i386 (a call to vm fails).
amd64 has always attempted to configure and test it, but it depends on
COMPAT_AOUT (as noted).  The bug that it depends on unconfigured files
was not detected since it is configured as a device.  All other optional
image activators are configured properly using an option.
2018-06-02 06:40:15 +00:00
Bruce Evans
49c871278a Fix high resolution kernel profiling just enough to not crash at boot
time, especially for SMP.  If configured, it turns itself on at boot
time for calibration, so is fragile even if never otherwise used.

Both types of kernel profiling were supposed to use a global spinlock
in the SMP case.  If hi-res profiling is configured (but not necessarily
used), this was supposed to be optimized by only using it when
necessary, and slightly more efficiently, in asm.  But it was not done
at all for mcount entry where it is necessary.  This caused crashes
in the SMP case when either type of profiling was enabled.  For mcount
exit, it only caused wrong times.  The times were wrongest with an
i8254 timer since using that requires exclusive access to the hardware.
The i8254 timer was too slow to use here 20 years ago and is much less
usable now, but it is the default for the SMP case since TSCs weren't
invariant when SMP was new.  Do the locking in all hi-res SMP cases for
simplicity.

Calibration uses special asms, and the clobber lists in these were sort
of inverted.  They contained the arg and return registers which are not
clobbered, but on amd64 they didn't contain the residue of the call-used
registers which may be clobbered (%r10 and %r11).  This usually caused
hangs at boot time.  This usually affected even the UP case.
2018-06-02 05:48:44 +00:00
Bruce Evans
dbe3061729 Fix recent breakages of kernel profiling, mostly on i386 (high resolution
kernel profiling remains broken).

memmove() was broken using ALTENTRY().  ALTENTRY() is only different from
ENTRY() in the profiling case, and its use in that case was sort of
backwards.  The backwardness magically turned memmove() into memcpy()
instead of completely breaking it.  Only the high resolution parts of
profiling itself were broken.  Use ordinary ENTRY() for memmove().
Turn bcopy() into a tail call to memmove() to reduce complications.
This gives slightly different pessimizations and profiling lossage.
The pessimizations are minimized by not using a frame pointer() for
bcopy().

Calls to profiling functions from exception trampolines were not
relocated.  This caused crashes on the first exception.  Fix this using
function pointers.

Addresses of exception handlers in trampolines were not relocated.  This
caused unknown offsets in the profiling data.  Relocate by abusing
setidt_disp as for pmc although this is slower than necessary and
requires namespace pollution.  pmc seems to be missing some relocations.
Stack traces and lots of other things in debuggers need similar relocations.

Most user addresses were misclassified as unknown kernel addresses and
then ignored.  Treat all unknown addresses as user. Now only user
addresses in the kernel text range are significantly misclassified (as
known kernel addresses).

The ibrs functions didn't preserve enough registers.  This is the only
recent breakage on amd64.  Although these functions are written in
asm, in the profiling case they call profiling functions which are
mostly for the C ABI, so they only have to save call-used registers.
They also have to save arg and return registers in some cases and
actually save them in all cases to reduce complications.  They end up
saving all registers except %ecx on i386 and %r10 and %r11 on amd64.
Saving these is only needed for 1 caller on each of amd64 and i386.
Save them there.  This is slightly simpler.

Remove saving %ecx in handle_ibrs_exit on i386.  Both handle_ibrs_entry
and handle_ibrs_exit use %ecx, but only the latter needed to or did
save it.  But saving it there doesn't work for the profiling case.

amd64 has more automatic saving of the most common scratch registers
%rax, %rcx and %rdx (its complications for %r10 are from unusual use
of %r10 by SYSCALL).  Thus profiling of handle_ibrs_exit_rs() was not
broken, and I didn't simplify the saving by moving the saving of these
registers from it to the caller.
2018-06-02 04:25:09 +00:00
Matt Macy
e92a1350b5 hwpmc: remove unused pre-table driven bits for intel
Intel now provides comprehensive tables for all performance counters
and the various valid configuration permutations as text .json files.
Libpmc has been converted to use these and hwpmc_core has been greatly
simplified by moving to passthrough of the table values.

The one gotcha is that said tables don't support pentium pro and and pentium
IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is
unlikely that anyone is doing low level optimization on 15 year old Intel
hardware. Nonetheless, if someone feels strongly enough to populate the
corresponding tables for p4 and ppro I will reinstate the files in to the
build.

Code for the K8 counters and !x86 architectures remains unchanged.
2018-05-31 22:41:07 +00:00
Dimitry Andric
b451efbedc Resolve conflicts between macros in fenv.h and ieeefp.h
This is a follow-up to r321483, which disabled -Wmacro-redefined for
some lib/msun tests.

If an application included both fenv.h and ieeefp.h, several macros such
as __fldcw(), __fldenv() were defined in both headers, with slightly
different arguments, leading to conflicts.

Fix this by putting all the common macros in the machine-specific
versions of ieeefp.h.  Where needed, update the arguments in places
where the macros are invoked.

This also slightly reduces the differences between the amd64 and i386
versions of ieeefp.h.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D15633
2018-05-31 20:22:47 +00:00
Mateusz Guzik
64415b8b22 amd64: switch pagecopy from non-temporal stores to rep movsq
The copied data is accessed in part soon after and it results with additional
cache misses during a -j 1 buildkernel WITHOUT_CTF=yes KERNFAST=1, as measured
with pmc stat.

before:
       256165411  cache-references	#	0.003 refs/inst
        15105408  cache-misses		#	5.897%
           20.70  real			#	99.67% cpu
           13.24  user			#	63.94% cpu
            7.40  sys			#	35.73% cpu

after:
       256764469  cache-references	#	0.003 refs/inst
        11913551  cache-misses		#	4.640%
           20.70  real			#	99.67% cpu
           13.19  user			#	63.73% cpu
            7.44  sys			#	35.95% cpu

Note the real time did not change, but traffic to RAM was reduced (multiple
measurements performed with switching the implementation at runtime).
Since nobody else is using non-temporal for this and there is no apparent
benefit at least these days, don't use them either.

Side note is that pagecopy arguments should probably get reversed to not
have to flip them around in the primitive.

Discussed with:		jeff
2018-05-31 09:56:02 +00:00
Brooks Davis
cbf7e0cba7 Correct pointer subtraction in KASSERT().
The assertion would never fire without truly spectacular future
programming errors.

Reported by:	Coverity
CID:		1391370
Sponsored by:	DARPA, AFRL
2018-05-29 20:03:24 +00:00
Andriy Gapon
279be68bfd re-synchronize TSC-s on SMP systems after resume, if necessary
The TSC-s are checked and synchronized only if they were good
originally.  That is, invariant, synchronized, etc.

This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second.  The APs are in sync with each other.  Not
sure if this is a hardware quirk or a firmware bug.

This is what I see after a resume with this change:
    SMP: passed TSC synchronization test after adjustment
    acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
2018-05-25 07:33:20 +00:00
Brooks Davis
5f77b8a88b Avoid two suword() calls per auxarg entry.
Instead, construct an auxargs array and copy it out all at once.

Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.

Return errors on copyout() and suword() failures and handle them in the
caller.

Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).

Reviewed by:	kib
Comments from:	emaste, jhb
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15485
2018-05-24 16:25:18 +00:00
Matt Macy
14d13423dd take NUMA out 2018-05-24 04:31:53 +00:00
Matt Macy
e98bbcf9ca libpmcstat: compile in events based on json description 2018-05-24 04:30:06 +00:00
Konstantin Belousov
8936419a6c x86: stop unconditionally clearing PSL_T on the trace trap.
We certainly should clear PSL_T when calling the SIGTRAP signal
handler, which is already done by all x86 sendsig(9) ABI code.  On the
other hand, there is no obvious reason why PSL_T needs to be cleared
when returning from the signal handler.  For instance, Linux allows
userspace to set PSL_T and keep tracing enabled for the desired
period.  There are userspace programs which would use PSL_T if we make
it possible, for instance sbcl.

Remember if PSL_T was set by PT_STEP or PT_SETSTEP by mean of TDB_STEP
flag, and only clear it when the flag is set.

Discussed with:	Ali Mashtizadeh
Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D15054
2018-05-23 21:39:29 +00:00
Konstantin Belousov
61bc50d032 Style.
Wording and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D15054
2018-05-23 21:25:49 +00:00
Konstantin Belousov
14f7050dba Enable IBRS when entering an interrupt handler from usermode.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-22 13:25:15 +00:00
John Baldwin
9e2154ff1c Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
Konstantin Belousov
3621ba1ede Add Intel Spec Store Bypass Disable control.
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.

Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto

Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).

SSBD bit detection and control was verified with prerelease microcode.

Security:	CVE-2018-3639
Tested by:	emaste (previous version, without updated microcode)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:08:19 +00:00
Konstantin Belousov
2320153fcc Preserve other bits in IA32_SPEC_CTL MSR when changing the IBRS and
STIBP states.

Tested by:	emaste (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:05:55 +00:00
Konstantin Belousov
5988464ec4 Fix grammar.
Submitted by:	alc
MFC after:	1 week
2018-05-21 19:15:05 +00:00
Konstantin Belousov
0a4b04a616 Add missed barrier for pm_gen/pm_active interaction.
When we issue shootdown IPIs, we first assign zero to pm_gens to
indicate the need to flush on the next context switch in case our IPI
misses the context, next we read pm_active. On context switch we set
our bit in pm_active, then we read pm_gen. It is crucial that both
threads see the memory in the program order, otherwise invalidation
thread might read pm_active bit as zero and the context switching
thread might read pm_gen as zero.

IA32 allows CPU for both reads to see zero. We must use the barriers
between write and read. The pm_active bit set is already locked, so
only the invalidation functions need it.

I never saw it in real life, or at least I do not have a good
reproduction case. I found this during code inspection when hunting
for the Xen TLB issue reported by cperciva.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15506
2018-05-21 18:41:16 +00:00
Mateusz Guzik
edacda736b amd64: annotate pti with __read_frequently 2018-05-21 05:20:23 +00:00
Mark Johnston
892bdccca0 Enable kernel dump features in GENERIC for most platforms.
This turns on support for kernel dump encryption and compression, and
netdump. arm and mips platforms are omitted for now, since they are more
constrained and don't benefit as much from these features.

Reviewed by:	cem, manu, rgrimes
Tested by:	manu (arm64)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D15465
2018-05-19 19:53:23 +00:00
Matt Macy
f5ad6b4b00 pmap: silence warnings 2018-05-19 05:58:05 +00:00
Ed Maste
3dc3b1235a amd64 GENERIC: correct whitespace on smartpqi entry 2018-05-18 17:51:42 +00:00
Antoine Brodin
147d12a7d3 vmmdev: return EFAULT when trying to read beyond VM system memory max address
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.

Reviewed by:	imp, grehan, anish
Approved by:	grehan
Differential Revision:	https://reviews.freebsd.org/D15156
2018-05-15 17:20:58 +00:00
John Baldwin
0b3e6e4c50 Make the common interrupt entry point labels local labels.
Kernel debuggers depend on symbol names to find stack frames with a
trapframe rather than a normal stack frame.  The labels used for the
shared interrupt entry point for the PTI and non-PTI cases did not
match the existing patterns confusing debuggers.  Add the '.L' prefix
to mark these symbols as local so they are not visible in the symbol
table.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-05-14 17:27:53 +00:00
Konstantin Belousov
8b4fc8b11c Make fpusave() and fpurestore() on amd64 ifuncs.
From now on, linking amd64 kernel requires either lld or newer ld.bfd.

Reviewed by:	jhb (as part of the large patch)
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-10 15:01:43 +00:00
Mateusz Guzik
20ca271fdd amd64: depessimize bcmp for small buffers
Adapt assembly generated by clang for memcmp and use it for <= 64 sized
compares (which are the vast majority).

Sample result of doing stats on Broadwell (% of samples):
before: 4.0 kernel     bcmp                 cache_lookup
after : 0.7 kernel     bcmp                 cache_lookup

The routine is most definitely still not optimal. Anyone interested in
spending time improving it is welcome to take over.

Reviewed by:	kib
2018-05-09 15:16:25 +00:00
Konstantin Belousov
55c9d75e6b Avoid calls to bzero() before ireloc.
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.

Tested by:	mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D15367
2018-05-09 14:39:24 +00:00
Konstantin Belousov
71d1bbce91 Remove PG_U from the rest of the kernel pmap ptes.
Supposedly, they PG_U bits there were set to easier making some kernel
page accessible to userspace in-place.  Since it was not used for the
whole existence of the amd64 pmap.c and current design of the shared
pages prefers double-mapping over the in-place access, remove PG_U
both from the direct map and KVA slots.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-09 12:09:08 +00:00
Konstantin Belousov
5aaa5bc3d6 Remove PG_U from the recursive pte for kernel pmap' PML4 page.
This PML4 page is never used for the userspace process, so there is no
security implications.  But the configuration trips SMAP check, which
should be corrected.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-09 12:03:40 +00:00
Konstantin Belousov
053641bb1c Prepare DB# handler for deferred trigger of watchpoints.
Since pop %ss/mov %ss instructions defer all interrupts and exceptions
for the next instruction, it is possible that the userspace watchpoint
trap executes on the first instruction of the kernel entry for
syscall/bpt.

In this case, DB# should be treated similarly to NMI: on amd64 we must
always load GSBASE even if the trap comes from kernel mode, and load
the kernel page table root into %cr3.  Moreover, the trap must
use the dedicated stack, because we are still on the user stack when
trapped on syscall entry.

For i386, we must reload %cr3.  The syscall instruction is not configured,
so there is no issue with executing on user stack when trapping.

Due to some CPU erratas it is not always possible to detect that the
userspace watchpoint triggered by inspecting %dr6.  In trap(), compare the
trap %rip with the known unsafe entry points and if matched pretend that
the watchpoint did not fire at all.

Thank you to the MSRC Incident Response Team, and in particular Greg
Lenti and Nate Warfield, for coordinating the response to this issue
across multiple vendors.

Thanks to Computer Recycling at The Working Center of Kitchener for
making hardware available to allow us to test the patch on additional
CPU families.

Reviewed by:	jhb
Discussed with:	Matthew Dillon
Tested by:	emaste
Sponsored by:	The FreeBSD Foundation
Security:	CVE-2018-8897
Security:	FreeBSD-SA-18:06.debugreg
2018-05-08 17:00:34 +00:00
Mateusz Guzik
a9456603f2 amd64: stop asserting params != NULL in the syscall path
The parameter is effectively controllable by userspace. It does not matter
what it is set to as it is being passed to copyin - worst case the operation
will just fail.

While here stop computing it unless it is going to be used.

Noted by:	dillon@backplane.com
2018-05-07 21:32:08 +00:00
Mateusz Guzik
bed34b0b04 amd64: fix up memset added in r333324
There was a missing trick expanding the passed pattern to a full word
by multiplication. As a side effect non-zero patterns would be
incorrectly laid down.

This stems from the use of rep stosq which is word-sized, while the passed
argument is byte-sized.

I initially repurposed memcpy into memset without taking this into account.
All but non-bzero testing was performed with a variant utilizing ERMS, i.e.
using only stosb which happens to not into the problem whatsoever. So my bad
twice.

Thanks to Oliver Pinter for noting the problem and providing a testcase.
2018-05-07 20:54:42 +00:00
Mateusz Guzik
f185a3dc33 amd64: tweak the memmove comment regarding authorship
To make it clear the mentioned author did not write memmove.
2018-05-07 17:37:07 +00:00
Mateusz Guzik
6a909b9680 amd64: replace libkern's memset and memmove with assembly variants
memmove is repurposed bcopy (arguments swapped, return value added)
The libkern variant is a wrapper around bcopy, so this is a big
improvement.

memset is repurposed memcpy. The librkern variant is doing fishy stuff,
including branching on 0 and calling bzero.

Both functions are rather crude and subject to partial depessimization.

This is a soft prerequisite to adding variants utilizing the
'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime.
2018-05-07 15:07:28 +00:00
Mateusz Guzik
ac7edb45e1 amd64: syscall path bcopy -> memcpy 2018-05-04 22:41:12 +00:00
Mateusz Guzik
f0648bcc04 amd64: get rid of the pessimized bcopy in syscall arg copy
The code was unnecessarily conditionally copying either 5 or 6 args.
It can blindly copy 6, which also means the size is known at compilation
time and the operation can be depessimized.

Note the entire syscall handling code is rather slow.

Tested on Skylake, sample result for getppid (calls/s):
without pti: 7310106 -> 10653569
with pti: 3304843 -> 4148306

Some syscalls (like read) did not note any difference, other have typically
very modest wins.
2018-05-04 04:05:07 +00:00
Konstantin Belousov
7035cf14ee Implement support for ifuncs in the kernel linker.
Required MD bits are only provided for x86.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:37:46 +00:00
Konstantin Belousov
9ea6332090 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 10:17:37 +00:00
Peter Grehan
adb947a67a Use PCI power-mgmt to reset a device if FLR fails.
A large number of devices don't support PCIe FLR, in particular
graphics adapters. Use PCI power management to perform the
reset if FLR fails or isn't available, by cycling the device
through the D3 state.

This has been tested by a number of users with Nvidia and AMD GPUs.

Submitted and tested by: Matt Macy
Reviewed by:	jhb, imp, rgrimes
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D15268
2018-05-02 17:41:00 +00:00
Mark Johnston
20f85b1ddd Print the dump progress indicator after calling dump_start().
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.

MFC after:	1 week
2018-05-01 17:32:43 +00:00
Conrad Meyer
538184fa2f amd64/mp_machdep.c: Fix GCC build after r333059
GCC warns about the potentially confusing use of the binary AND ('&')
operator with a left operand containing an addition expression.  (The
confusion would be around the operator precedence between the + and & infix
operators.)  The warning is converted into an error with -Werror.

No functional change.

This construct was actually introduced in r328083, but r333059 (re)moved the
closing parentheses.

For reference, see http://en.cppreference.com/w/c/language/operator_precedence .
2018-04-28 17:55:28 +00:00
Tycho Nightingale
27275f8a52 Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be
excluded from PTI.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15100
2018-04-27 12:44:20 +00:00
Sean Bruno
14ec0f3a3b move smartpqi(4) controller out of NOTES and into sys/amd64/NOTES to
appease LINT

Submitted by:	rpokala
Reported by:	npn
2018-04-26 22:43:25 +00:00
Sean Bruno
1e66f787c8 martpqi(4):
- Microsemi SCSI driver for PQI controllers.
- Found on newer model HP servers.
- Restrict to AMD64 only as per developer request.

The driver provides support for the new generation of PQI controllers
from Microsemi. This driver is the first SCSI driver to implement the PQI
queuing model and it will replace the aacraid driver for Adaptec Series 9
controllers.  HARDWARE Controllers supported by the driver include:

    HPE Gen10 Smart Array Controller Family
    OEM Controllers based on the Microsemi Chipset.

Submitted by:   deepak.ukey@microsemi.com
Relnotes:       yes
Sponsored by:   Microsemi
Differential Revision:   https://reviews.freebsd.org/D14514
2018-04-26 16:59:06 +00:00
Tycho Nightingale
19c5cea336 If a trap is encountered upon executing iretq from within doreti() the
hardware will ensure the stack pointer is aligned to a 16-byte
boundary before saving the fault state on the stack.

In the PTI case, handle this potential alignment adjustment by copying
both frames independently while unwinding the stack in between.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15183
2018-04-25 14:21:13 +00:00
Mark Johnston
5cd29d0f3c Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
Konstantin Belousov
b7941dc91e Correct undesirable interaction between caching of %cr4 in bhyve and
invltlb_glob().

Reviewed by:	grehan, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15138
2018-04-24 13:44:19 +00:00
John Baldwin
73c8686e91 Simplify the code to allocate stack for auxv, argv[], and environment vectors.
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().

Reviewed by:	emaste
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15123
2018-04-19 16:00:34 +00:00
Andriy Gapon
f3f6ecb450 set kdb_why to "trap" when calling kdb_trap from trap_fatal
This will allow to hook a ddb script to "kdb.enter.trap" event.
Previously there was no specific name for this event, so it could only
be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks.
Both are very unspecific.

Having a specific event is useful because the fatal trap condition is
very similar to panic but it has an additional property that the current
stack frame is the frame where the trap occurred.  So, both a register
dump and a stack bottom dump have additional information that can help
analyze the problem.

I have added the event only on architectures that have trap_fatal()
function defined.  I haven't looked at other architectures.  Their
maintainers can add support for the event later.

Sample script:
kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20

Reviewed by:	kib, jhb, markj
MFC after:	11 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D15093
2018-04-19 05:06:56 +00:00
Andriy Gapon
6d83b2e971 don't check for kdb reentry in trap_fatal(), it's impossible
trap() checks for it earlier and calls kdb_reentry().

Discussed with:	jhb
MFC after:	12 days
Sponsored by:	Panzura
2018-04-18 15:44:54 +00:00
Brooks Davis
9c11d8d483 Remove the unused fuwintr() and suiwintr() functions.
Half of implementations always failed (returned (-1)) and they were
previously used in only one place.

Reviewed by:	kib, andrew
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15102
2018-04-17 18:04:28 +00:00
Konstantin Belousov
23084818ff Set PG_G global mapping bit on the trampoline ptes.
Trampoline mappings are better treated as global since they are valid
in all address spaces, even for PTI.  pmap_invalidate_range() must work
on global mappings for pti since kernel_pmap invalidations are really
same as for non-PTI.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D15052
2018-04-14 17:33:16 +00:00
Tycho Nightingale
6ac73777ea Add SDT probes to vmexit on Intel.
Submitted by:	domagoj.stolfa_gmail.com
Reviewed by:	grehan, tychon
Sponsored by:	DARPA/AFRL
Differential Revision:	https://reviews.freebsd.org/D14656
2018-04-13 17:23:05 +00:00
Konstantin Belousov
7c5d1690e9 Fix PSL_T inheritance on exec for x86.
The miscellaneous x86 sysent->sv_setregs() implementations tried to
migrate PSL_T from the previous program to the new executed one, but
they evaluated regs->tf_eflags after the whole regs structure was
bzeroed.  Make this functional by saving PSL_T value before zeroing.

Note that if the debugger is not attached, executing the first
instruction in the new program with PSL_T set results in SIGTRAP, and
since all intercepted signals are reset to default dispostion on
exec(2), this means that non-debugged process gets killed immediately
if PSL_T is inherited.  In particular, since suid images drop
P_TRACED, attempt to set PSL_T for execution of such program would
kill the process.

Another issue with userspace PSL_T handling is that it is reset by
trap().  It is reasonable to clear PSL_T when entering SIGTRAP
handler, to allow the signal to be handled without recursion or
delivery of blocked fault.  But it is not reasonable to return back to
the normal flow with PSL_T cleared.  This is too late to change, I
think.

Discussed with:	bde, Ali Mashtizadeh
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D14995
2018-04-12 20:43:39 +00:00
Konstantin Belousov
b7dbf1132e Optimize context switch for PTI on PCID pmap.
In pti-enabled pmap, the PCID allocation scheme assigns temporal id
for the kernel page table, and user page table twin PCID is
calculating by setting high bit in the kernel PCID.  So the kernel AS
is mapped with per-vmspace PCID, and we must completely shut down all
mappings in KVA when switching contexts, so that newly switched thread
would see all changes in KVA occured while it was not executing.
After all, KVA is same between all threads.

Currently the pti context switch for the user part of the page table
gets its TLB entries flushed too. It is excessive. The same PCID
flushing algorithm that is used for non-pti pmap, correctly works for
the UVA mappings.  The only shared TLB entries are the pages from KVA
accessed by the kernel entry trampoline.  All of them are static
except per-thread TSS and LDT. For TSS and LDT, the lifetime of newly
allocated entries is the whole thread life, so it is fine as well. If
not fine, then explicit shutdowns for current pmap of the newly
allocated LDT and TSS pages would be enough.

Also restore the constant value for the pm_pcid for the kernel_pmap.
Before, for PTI pmap, pm_pcid was erronously rolled same as user
pmap's pm_pcid, but it was not used.

Reviewed by:	markj (previous version)
Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D14961
2018-04-12 19:59:36 +00:00
Ed Maste
c7fb0e1ddf linuxulator: add else case braces to reduce diffs between archs
Sponsored by:	Turing Robotic Industries Inc.
2018-04-09 19:11:24 +00:00
Ed Maste
b267239d4b linuxulator: deduplicate linux_exec_imgact_try
Previously linuxulator had three identical copies of
linux_exec_imgact_try.  Deduplicate before adding another arch to
linuxulator.

Sponsored by:	Turing Robotic Industries Inc
Differential Revision:	https://reviews.freebsd.org/D14856
2018-04-09 17:24:01 +00:00
Rodney W. Grimes
01d822d33b Add the ability to control the CPU topology of created VMs
from userland without the need to use sysctls, it allows the old
sysctls to continue to function, but deprecates them at
FreeBSD_version 1200060 (Relnotes for deprecate).

The command line of bhyve is maintained in a backwards compatible way.
The API of libvmmapi is maintained in a backwards compatible way.
The sysctl's are maintained in a backwards compatible way.

Added command option looks like:
bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n]
The optional parts can be specified in any order, but only a single
integer invokes the backwards compatible parse.  [,maxcpus=n] is
hidden by #ifdef until kernel support is added, though the api
is put in place.

bhyvectl --get-cpu-topology option added.

Reviewed by:	grehan (maintainer, earlier version),
Reviewed by:	bcr (manpages)
Approved by:	bde (mentor), phk (mentor)
Tested by:	Oleg Ginzburg <olevole@olevole.ru> (cbsd)
MFC after:	1 week
Relnotes:	Y
Differential Revision:	https://reviews.freebsd.org/D9930
2018-04-08 19:24:49 +00:00
Brooks Davis
1a449272a3 Fix LINT (and static COMPAT_LINUX32) after r332122. 2018-04-08 17:10:32 +00:00
Konstantin Belousov
e55d32b7b3 Handle Skylake-X errata SKZ63.
SKZ63 Processor May Hang When Executing Code In an HLE Transaction
Region

Problem: Under certain conditions, if the processor acquires an HLE
(Hardware Lock Elision) lock via the XACQUIRE instruction in the Host
Physical Address range between 40000000H and 403FFFFFH, it may hang
with an internal timeout error (MCACOD 0400H) logged into
IA32_MCi_STATUS.

Move the pages from the range into the blacklist.  Add a tunable to
not waste 4M if local DoS is not the issue.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15001
2018-04-07 17:06:13 +00:00
John Baldwin
fc276d92ae Add a way to temporarily suspend and resume virtual CPUs.
This is used as part of implementing run control in bhyve's debug
server.  The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor.  Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).

The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server).  The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs.  The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.

Reviewed by:	avg, grehan
Tested on:	Intel (jhb), AMD (avg)
Differential Revision:	https://reviews.freebsd.org/D14466
2018-04-06 22:03:43 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Jonathan T. Looney
6a740d0bcf Pat the watchdog less while producing a coredump. Prior to this change,
we patted the watchdog approximately once per 4KB page of memory.  After
this change, we pat the watchdog approximately once per 128MB of memory.
On a sample machine, this translated to patting the watchdog approximately
every 5.4 seconds, which "seems reasonable". We can choose a different
value in the future, if warranted.

This has extensive field experience. It is a performance improvement, and
has not caused any known problems.

Reviewed by:	imp, kib
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D14988
2018-04-06 17:06:22 +00:00
Roger Pau Monné
e0f92f5c77 x86: fix trampoline memory allocation after r332073
Add the missing breaks in the for loops, in order to exit the loop
when a suitable entry is found.

Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to
find the virtual address of the boot_trampoline and the initial page
tables.

Reported and tested by:	pho
Sponsored by:		Citrix Systems R&D
2018-04-06 16:22:14 +00:00
Roger Pau Monné
444c6d6f03 remove GiB/MiB macros from param.h
And instead define them in the files where they are used.

Requested by: bde
2018-04-06 11:20:06 +00:00
Roger Pau Monné
9dba82a442 x86: improve reservation of AP trampoline memory
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D14878
2018-04-05 14:39:51 +00:00
Konstantin Belousov
2d7e563c39 Fix ERESTART for lcall $7,$0 syscalls.
The lcall trampoline enters kernel by int $0x80, which sets up invalid
length of the instruction for %rip rewind.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-05 11:03:21 +00:00
Konstantin Belousov
f407f5fb88 Make the INTO instruction operational in 32bit mode.
Having the IDT entry specify ring 0 DPL caused delivery of #GP instead
of #OF.

The instruction is not valid in 64bit mode, which probably explains
why the IDT entry for #OF was initially set this way.  It is
interesting to note that the BOUND instruction works with the IDT #BR
entry DPL 0, most likely CPU considers #BR from BOUND as generated by
a machine, not user.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-05 11:03:05 +00:00
Andriy Gapon
3da25bdb02 fix i386 build with CPU_ELAN (LINT for instance) after r331878
x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set.
While here, also remove the now unneeded inclusion of isareg.h in i386
and amd64 vm_machdep.c.

Reported by:	lwhsu
MFC after:	14 days
X-MFC with:	r331878
2018-04-03 17:16:06 +00:00
Andriy Gapon
8428d0f154 unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c
Because I didn't see any reason not too.
I've been making some changes to the code and couldn't help but notice
that the i386 and am64 code was nearly identical.

MFC after:	17 days
2018-04-02 13:45:23 +00:00
Andriy Gapon
ace498d81e x86 cpu_reset: if failed to switch to BSP proceed to cpu_reset_real
If cpu_reset() is called on an AP and if it somehow fails to wake the
BSP, then it's better to attempt the reset on the AP than just sit there
spinning on an unusable and undebuggable system.

MFC after:	16 days
2018-04-02 08:06:18 +00:00
Andriy Gapon
5d29acd810 x86 cpu_reset_proxy: no need to stop_cpus() the original processor
The processor is "parked" in a spin-loop already and that's sufficient
for the reset.  There is nothing that stop_cpus() would add here, only
extra complexity and fragility.
The original processor does not need to enable interrupts now, in fact,
it must not do that.

MFC after:	2 weeks
2018-04-02 07:45:13 +00:00
Kenneth D. Merry
ef270ab1b6 Bring in the Broadcom/Emulex Fibre Channel driver, ocs_fc(4).
The ocs_fc(4) driver supports the following hardware:

Emulex 16/8G FC GEN 5 HBAS
	LPe15004 FC Host Bus Adapters
	LPe160XX FC Host Bus Adapters

Emulex 32/16G FC GEN 6 HBAS
	LPe3100X FC Host Bus Adapters
	LPe3200X FC Host Bus Adapters

The driver supports target and initiator mode, and also supports FC-Tape.

Note that the driver only currently works on little endian platforms.  It
is only included in the module build for amd64 and i386, and in GENERIC
on amd64 only.

Submitted by:	Ram Kishore Vegesna <ram.vegesna@broadcom.com>
Reviewed by:	mav
MFC after:	5 days
Relnotes:	yes
Sponsored by:	Broadcom
Differential Revision:	https://reviews.freebsd.org/D11423
2018-03-30 15:28:25 +00:00
Jeff Roberson
27a3c9d710 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
John Baldwin
dbb4ba297b Fix kernel builds without options DDB after r331650.
Reported by:	cy
2018-03-28 16:24:56 +00:00
John Baldwin
d41e41f9f0 Remove very old and unused signal information codes.
These have been supplanted by the MI signal information codes in
<sys/signal.h> since 7.0.  The FPE_*_TRAP ones were deprecated even
earlier in 1999.

PR:		226579 (exp-run)
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D14637
2018-03-27 20:57:51 +00:00
Jeff Roberson
261c408744 Backout r331606 until I can identify why it does not boot on some
machines.
2018-03-27 10:20:50 +00:00
Jeff Roberson
a48de40bcc Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:	jhb, kib
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14838
2018-03-27 03:37:04 +00:00
Konstantin Belousov
a37d4032ed Improve the lcall $7,$0 syscall emulation on amd64.
Current code, which copies the potential syscall arguments into the
current frame, puts an arbitrary limit on the number of syscall
arguments.  Apparently, mmap(2) and lseek(2) (?) require larger
number.  But there is an issue that stack is only need to be mapped to
contain the number of arguments required by the syscall, so copying
arbitrary large number of words from the stack is not completely safe.

Use different approach to convert lcall frame into int $0x80 frame in
place, by doing the retl in kernel.  This also allows to stop proceed
vfork case specially, and stop making assumptions about %cs at the
syscall time.

Also, improve comments with the formulations provided by bde.

Reviewed and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 12:57:58 +00:00
Jonathan T. Looney
e24e568336 Make the TCP blackbox code committed in r331347 be an optional feature
controlled by the TCP_BLACKBOX option.

Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.

Sponsored by:	Netflix, Inc.
2018-03-24 12:48:10 +00:00
Ed Maste
f8268d4d97 Remove redundant cast from Linuxulator SYSINITs 2018-03-23 20:32:54 +00:00
Ed Maste
ad448975e6 Fixup return style(9) in amd64 linux*_sysvec.c
Sponsored by:	Turing Robotic Industries Inc.
2018-03-23 17:28:04 +00:00
Ed Maste
c0aa0e2c27 Sort headers in MD Linuxulator files
Bring #includes closer to style(9) and reduce differences between the
(three) MD versions of linux_machdep.c and linux_sysvec.c.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-23 17:16:36 +00:00
Konstantin Belousov
54f30ad961 Fixes for ptrace(PT_GETXSTATE_INFO) related to the padding in struct
ptrace_xstate_info).

struct ptrace_xstate_info has 64bit member but ends up with 32bit
one. As result, on amd64 there is a 32bit padding at the end, but not
on i386.

We must clear the padding before doing the copyout. For compat32 case,
we must copyout the structure which does not have the padding at the
end.  The later fixes 32bit gdb display of the YMM registers when
running on amd64 kernel.

Reported by:	Vlad Tsyrklevich
Reviewed by:	brooks (previous version)
Sponsored by:	The FreeBSD Foundation
admbugs:	765
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14794
2018-03-22 20:44:27 +00:00
Kyle Evans
ad456dd9fa Re-work efidev ordering to fix efirt preloaded by loader on amd64
On amd64, efi_enter calls fpu_kern_enter(). This may not be called until
fpuinitstate has been invoked, resulting in a kernel panic with
efirt_load="YES" in loader.conf(5).

Move fpuinitstate a little earlier in SI_SUB_DRIVERS so that we can squeeze
efirt between it and efirtc at SI_SUB_DRIVERS, SI_ORDER_ANY. efidev must be
after efirt and doesn't really need to be at SI_SUB_DEVFS, so drop it at
SI_SUB_DRIVER, SI_ORDER_ANY.

The not immediately obvious dependency of fpuinitstate by efirt has been
noted in both places.

Discussed with:	kib, andrew
Reported by:	Jakob Alvermark <jakob@alvermark.net>
X-MFC-With:	r330868
2018-03-22 18:24:00 +00:00
Ed Maste
1ac2776bbb Share Linux errno table with libsysdecode
Requested by:	jhb
Reviewed by:	jhb
Sponsored by:	Turing Robotic Industries Inc.
2018-03-22 12:58:49 +00:00
Konstantin Belousov
8fbcc3343f Move the CR0.WP manipulation KPI to x86.
This should allow to avoid some #ifdefs in the common x86/ code.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-20 20:20:49 +00:00
Ed Maste
b7d779b3e5 Make linuxulator fn declaration match definition
I accidentally swapped 'linux_fixup_elf' to 'linux_elf_fixup' in amd64's
declaration (only),  while bringing this change over from git and
encountering a conflict.
2018-03-20 19:28:52 +00:00
Ed Maste
fc2a8776a2 Rename assym.s to assym.inc
assym is only to be included by other .s files, and should never
actually be assembled by itself.

Reviewed by:	imp, bdrewery (earlier)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D14180
2018-03-20 17:58:51 +00:00
Konstantin Belousov
9cffc92c62 Disable write protection around patching of XSAVE instruction in the
context switch code.

Some BIOSes give control to the OS with CR0.WP already set, making the
kernel text read-only before cpu_startup().

Reported by:	Peter Lei <peter.lei@ieee.org>
Reviewed by:	jtl
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14768
2018-03-20 17:47:29 +00:00
Konstantin Belousov
2337dc6430 Provide KPI for handling of rw/ro kernel text.
This is a pure syntax patch to create an interface to enable and later
restore write access to the kernel text and other read-only mapped
regions.  It is in line with e.g. vm_fault_disable_pagefaults() by
allowing the nesting.

Discussed with:	Peter Lei <peter.lei@ieee.org>
Reviewed by:	jtl
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14768
2018-03-20 17:43:50 +00:00
Ed Maste
dc85846736 Rename linuxulator functions with linux_ prefix
It's preferable to have a consistent prefix.  This also reduces
differences between the three linux*_sysvec.c files.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-19 21:26:32 +00:00
Ed Maste
9bec2ea66e linux*_sysvec.c: rationalize whitespace and comments
There's a fair amount of duplication between MD linuxulator files.
Make indentation and comments consistent between the three versions of
linux_sysvec.c to reduce diffs when comparing them.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-19 15:11:10 +00:00
Ed Maste
6e481f83f7 Share a single bsd-linux errno table across MD consumers
Three copies of the linuxulator linux_sysvec.c contained identical
BSD to Linux errno translation tables, and future work to support other
architectures will also use the same table.  Move the table to a common
file to be used by all.  Make it 'const int' to place it in .rodata.

(Some existing Linux architectures use MD errno values, but x86 and Arm
share the generic set.)

This change should introduce no functional change; a followup will add
missing errno values.

MFC after:	3 weeks
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14665
2018-03-16 14:46:38 +00:00
Ed Maste
7b194b3d3b Remove stray ; at end of linux_vdso_deinstall() 2018-03-14 13:20:36 +00:00
Kyle Evans
63ee68c220 EFIRT: SetVirtualAddressMap with 1:1 mapping after exiting boot services
This fixes a problem encountered on the Lenovo Thinkpad X220/Yoga 11e where
runtime services would try to inexplicably jump to other parts of memory
where it shouldn't be when attempting to enumerate EFI vars, causing a
panic.

The virtual mapping is enabled by default and can be disabled by setting
efi_disable_vmap in loader.conf(5).

Reviewed by:	kib (earlier version)
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D14677
2018-03-13 17:10:52 +00:00
Ed Maste
a95659f75f Use C99 boolean type for translate_osrel
Migrate to modern types before creating MD Linuxolator bits for new
architectures.

Reviewed by:	cem
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14676
2018-03-13 16:40:29 +00:00
Ed Maste
4ba257591b Apply some style(9) to Linuxulator linux_sysvec.c comments 2018-03-13 00:40:05 +00:00
Ian Lepore
c7053bbe54 Revert r330780, it was improperly tested and results in taking a spin
mutex before acquiring sleep mutexes.

Reported by:	kib@
2018-03-11 20:13:15 +00:00
Ian Lepore
86051be993 Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking. 2018-03-11 19:22:58 +00:00
Tycho Nightingale
490768e24a Fix a lock recursion introduced in r327065.
Reported by:	kmacy
Reviewed by:	grehan, jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14548
2018-03-07 18:03:22 +00:00
Jonathan T. Looney
beb2406556 amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits
correctly for the data contained on each memory page.

There are several components to this change:
 * Add a variable to indicate the start of the R/W portion of the
   initial memory.
 * Stop detecting NX bit support for each AP.  Instead, use the value
   from the BSP and, if supported, activate the feature on the other
   APs just before loading the correct page table.  (Functionally, we
   already assume that the BSP and all APs had the same support or
   lack of support for the NX bit.)
 * Set the RW and NX bits correctly for the kernel text, data, and
   BSS (subject to some caveats below).
 * Ensure DDB can write to memory when necessary (such as to set a
   breakpoint).
 * Ensure GDB can write to memory when necessary (such as to set a
   breakpoint).  For this purpose, add new MD functions gdb_begin_write()
   and gdb_end_write() which the GDB support code can call before and
   after writing to memory.

This change is not comprehensive:
 * It doesn't do anything to protect modules.
 * It doesn't do anything for kernel memory allocated after the kernel
   starts running.
 * In order to avoid excessive memory inefficiency, it may let multiple
   types of data share a 2M page, and assigns the most permissions
   needed for data on that page.

Reviewed by:	jhb, kib
Discussed with:	emaste
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14282
2018-03-06 14:28:37 +00:00
Jonathan T. Looney
99e159dcf6 We shouldn't need to execute code in the recursive page table mappings;
therefore, it should be safe to set the NX bit on the PML4E for the
recursive page table mappings.  According to the Intel docs, the effect
of the NX bit should propogate to any page reached through a PML4E which
has the NX bit set.

Reviewed by:	kib, markj
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14333
2018-03-05 15:12:35 +00:00
Jonathan T. Looney
66ce8430aa Prior to r329071, pmap_bootstrap() used pmap_kmem_choose() to round the
first available virtual address to a 2MB boundary. After r329071,
create_pagetables() rounds firstaddr up to a 2MB boundary. This ensures
the kernel is mapped in super-pages, which is the point of the logic
in pmap_kmem_choose(). Therefore, it is no longer necessary for
pmap_bootstrap() to round up to the 2MB boundary again.

As pmap_bootstrap() was the only user of pmap_kmem_choose(), we can
delete pmap_kmem_choose().

Reviewed by:	kib
MFC after:	2 weeks
X-MFC-with:	r329071
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14355
2018-03-05 15:10:17 +00:00
Anish Gupta
9363000dfe Move the new AMD-Vi IVHD [ACPI_IVRS_HARDWARE_NEW]definitions added in r329360 in contrib ACPI to local files till ACPI code adds new definitions reported by jkim.
Rename ACPI_IVRS_HARDWARE_NEW to ACPI_IVRS_HARDWARE_EFRSUP, since new definitions add Extended Feature Register support.  Use IvrsType to distinguish three types of IVHD - 0x10(legacy), 0x11 and 0x40(with EFR). IVHD 0x40 is also called mixed type since it supports HID device entries.
Fix 2 coverity bugs reported by cem.

Reported by:jkim, cem
Approved by:grehan
Differential Revision://reviews.freebsd.org/D14501
2018-03-05 02:28:25 +00:00
Konstantin Belousov
8c8ee2ee1c Unify bulk free operations in several pmaps.
Submitted by:	Yoshihiro Ota
Reviewed by:	markj
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13485
2018-03-04 20:53:20 +00:00
Andriy Gapon
ae0eceab56 db_nextframe/amd64: catch up with r328083 to recognize fast_syscall_common
Since that change the system call stack traces look like this:
  ...
  sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe0028e13ac0
  amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe0028e13bf0
  fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0028e13bf0
So, db_nextframe() stopped recognizing the system call frame.
This commit should fix that.

Reviewed by:	kib
MFC after:	4 days
2018-03-03 15:10:37 +00:00
Ravi Pokala
24f93aa05f imcsmb(4): Intel integrated Memory Controller (iMC) SMBus controller driver
imcsmb(4) provides smbus(4) support for the SMBus controller functionality
in the integrated Memory Controllers (iMCs) embedded in Intel Sandybridge-
Xeon, Ivybridge-Xeon, Haswell-Xeon, and Broadwell-Xeon CPUs. Each CPU
implements one or more iMCs, depending on the number of cores; each iMC
implements two SMBus controllers (iMC-SMBs).

*** IMPORTANT NOTE ***
Because motherboard firmware or the BMC might try to use the iMC-SMBs for
monitoring DIMM temperatures and/or managing an NVDIMM, the driver might
need to temporarily disable those functions, or take a hardware interlock,
before using the iMC-SMBs. Details on how to do this may vary from board to
board, and the procedure may be proprietary. It is strongly suggested that
anyone wishing to use this driver contact their motherboard vendor, and
modify the driver as described in the manual page and in the driver itself.
(For what it's worth, the driver as-is has been tested on various SuperMicro
motherboards.)

Reviewed by:	avg, jhb
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D14447
Discussed with:	avg, ian, jhb
Tested by:	allanjude (previous version), Panasas
2018-03-03 01:53:51 +00:00
Ed Maste
023b850b62 Rationalize license text on Linuxolator files
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text.  To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders.  Additional files
still waiting on permission from others are listed in review D14210.

Approved by:    dchagin, rdivacky, sos
MFC after:	1 week
MFC with:	r329370
Sponsored by:	The FreeBSD Foundation
2018-03-01 13:52:18 +00:00
John Baldwin
5f8754c077 Add a new variant of the GLA2GPA ioctl for use by the debug server.
Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify
the guest.  In particular, it does not inject any faults or modify
PTEs in the guest when performing an address space translation.

This is used by bhyve's debug server to read and write memory for
the remote debugger.

Reviewed by:	grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D14075
2018-02-26 19:19:05 +00:00
Patrick Kelsey
18a7530938 Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.
The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by:	hiren
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14048
2018-02-26 03:03:41 +00:00
Jung-uk Kim
0ef8c0cb57 Partially revert r197863 to reduce diff against i386.
When I wrote the patch, I wanted to remove SYSINIT() usage from amd64 code.
There is no reason to keep the divergence any more because iwasaki merged
most amd64 suspend/resume code to i386 with r235622.  Note this also fixed
an enge case reported by royger. [1]

Suggested by:		jhb
Reviewed by:		royger
Tested by:		royger [1]
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D14400 [1]
2018-02-24 01:24:57 +00:00
Conrad Meyer
849ce31a82 Remove unused error return from API that cannot fail
No implementation of fpu_kern_enter() can fail, and it was causing needless
error checking boilerplate and confusion. Change the return code to void to
match reality.

(This trivial change took nine days to land because of the commit hook on
sys/dev/random.  Please consider removing the hook or otherwise lowering the
bar -- secteam never seems to have free time to review patches.)

Reported by:	Lachlan McIlroy <Lachlan.McIlroy AT isilon.com>
Reviewed by:	delphij
Approved by:	secteam (delphij)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14380
2018-02-23 20:15:19 +00:00
Ed Maste
716cfaab96 Use linux types for linux-specific syscalls
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14065
2018-02-23 19:09:27 +00:00
Ed Maste
315fbaeca2 Correct pseudo misspelling in sys/ comments
contrib code and #define in intel_ata.h unchanged.
2018-02-23 18:15:50 +00:00
Ed Maste
a0409b6f36 Remove accidental vim droppings
Reported by:	cy
2018-02-22 03:37:01 +00:00
Ed Maste
eae594f7d5 Correct proper nouns in the Linuxulator
- Capitalize Linux
- Spell FreeBSD out in full
- Address some style(9) on changed lines

Sponsored by:	Turing Robotic Industries Inc.
2018-02-22 02:24:17 +00:00
John Baldwin
4f8666989a Add two new ioctls to bhyve for batch register fetch/store operations.
These are a convenience for bhyve's debug server to use a single
ioctl for 'g' and 'G' rather than a loop of individual get/set
ioctl requests.

Reviewed by:	grehan
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D14074
2018-02-22 00:39:25 +00:00
Konstantin Belousov
2c0f13aa59 vm_wait() rework.
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass.  If there is no object
associated with the wait, use curthread' policy domainset.  The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.

Eliminate pagedaemon_wait().  vm_domain_clear() handles the same
operations.

Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.

Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.

Scetched and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
Differential revision:	https://reviews.freebsd.org/D14384
2018-02-20 10:13:13 +00:00
Ed Maste
0ba1b36553 Rationalize license text on Linuxolator files
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text.  To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders.  Additional files
waiting on permission from others are listed in review D14210.

Approved by:	kan, marcel, sos, rdivacky
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-02-16 15:00:14 +00:00
Konstantin Belousov
13cad9af82 Use local symbol for offset.
Small global symbols confuse ddb which matches them against small
unrelated displacements and makes the disassembly ugly.

Reported by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-02-16 13:32:46 +00:00
Andriy Gapon
7b394c1066 move vintr_intercept_enabled under INVARIANTS
The function is not used outside of INVARIANTS since r328622.

MFC after:	1 week
2018-02-16 07:02:14 +00:00
Anish Gupta
0b37d3d90e This change fixes duplicate detection of same IOMMU/AMD-Vi device for Ryzen with EFR support.
IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created.
Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware.

Reviewed by:	avg
Approved by:	grehan
Differential Revision:https://reviews.freebsd.org/D13160
2018-02-16 05:17:00 +00:00
Jung-uk Kim
ea4fe1da62 Change size of padding to reflect reality. No functional change.
Discussed with:		kib
2018-02-15 20:42:38 +00:00
Conrad Meyer
5bd0149714 x86 pmap: Make memory mapped via pmap_qenter() non-executable
The idea is, the pmap_qenter() API is now defined to not produce executable
mappings.  If you need executable mappings, use another API.

Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable.

Other architectures that support execute-specific permissons on page table
entries should subsequently be updated to match.

Submitted by:	Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by:	markj
Discussed with:	alc, jhb, kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14062
2018-02-14 23:35:47 +00:00
Ed Maste
83ab0f9f33 amd64/pmap: Move Foundation copyright to the 2-clause section
Sponsored by:	The FreeBSD Foundation
2018-02-13 19:19:26 +00:00
Hans Petter Selasky
33ec1ccbae Import the mthca kernel side infiniband driver from Linux 4.9 and fix
compilation under FreeBSD. The mthca driver was temporarily removed as
part of the Linux 4.9 RoCE/infinband upgrade.

Top commit in Linux source tree:
69973b830859bc6529a7a0468ba0d80ee5117826

Sponsored by:	Mellanox Technologies
2018-02-13 17:04:34 +00:00
Jeff Roberson
e958ad4cf3 Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
Jonathan T. Looney
48fca66157 Mark the pages used for the initial page-table entries as wired. This
makes them consistent with the way other page-table pages are allocated.
It also provides the rest of the VM system a good clue that these pages
are used.

Reviewed by:	alc, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14269
2018-02-12 17:27:50 +00:00
Warner Losh
982e7bdafc We don't support gcc < 4.2.1, so varargs.h now is just #error
always. Unifdef for versions prior to 4.2.1 and remove now-unused
header files.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14323
2018-02-12 14:48:14 +00:00
Tycho Nightingale
58a6aaf7ec Provide further mitigation against CVE-2017-5715 by flushing the
return stack buffer (RSB) upon returning from the guest.

This was inspired by this linux commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b

Reviewed by:	grehan
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14272
2018-02-12 14:45:27 +00:00
Jonathan T. Looney
31ba4c7b5b On bootup, the amd64 pmap initialization code creates page-table
mappings for the pages used for the kernel and some initial allocations
used for the page table. It maps the kernel and the blocks used for
these initial allocations using 2MB pages.

However, if the kernel does not end on a 2MB boundary, it still maps the
last portion using a 2MB page, but reports that the unused 4K blocks
within this 2MB allocation are free physical blocks. This means that
these same physical blocks could also be mapped elsewhere - for example,
into a user process. Given the proximity to the kernel text and data
area, it seems wise to avoid allowing someone to write data to physical
blocks also mapped into these virtual addresses.

(Note that this isn't a security vulnerability: the direct map makes
most/all memory on the system mapped into kernel space. And, nothing
in the kernel should be trying to access these pages, as the virtual
addresses are unused. It simply seems wise to avoid reusing these
physical blocks while they are mapped to virtual addresses so close
to the kernel text and data area.)

Consequently, let's reserve the physical blocks covered by the
page-table mappings for these initial allocations.

Reviewed by:	kib, markj
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14268
2018-02-09 17:46:33 +00:00
Mark Johnston
ab7c09f121 Use vm_page_unwire_noq() instead of directly modifying page wire counts.
No functional change intended.

Reviewed by:	alc, kib (previous revision)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14266
2018-02-08 19:28:51 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Ed Maste
8a3b44cfc2 Additional linuxolator whitespace cleanup, missed in r328890 2018-02-05 18:39:06 +00:00
Ed Maste
132f90c660 Linuxolator whitespace cleanup
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator.  Clean these up so that new
architectures do not inherit whitespace issues.

Clean up shared Linuxolator files while here.

Sponsored by:	Turing Robotic Industries Inc.
2018-02-05 17:29:12 +00:00
Konstantin Belousov
f7f14d9dea When switching IBRS on, also enable STIBP (Single Thread Indirect
Branch Predictors) mitigation.

DOcument 336996-001 promises that CPUs which implement IBRS but not
STIBP silently ignore setting of the bit instead of trapping.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-31 16:56:02 +00:00
Konstantin Belousov
319117fd57 IBRS support, AKA Spectre hardware mitigation.
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0.  Current status can be checked in sysctl
hw.ibrs_active.  The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14029
2018-01-31 14:36:27 +00:00
Andriy Gapon
6a8b7aa424 vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts
The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR
fields of VMCB to inject a virtual interrupt into a guest VM.  This
method has many advantages over the direct event injection as it
offloads all decisions of whether and when the interrupt can be
delivered to the guest.  But with a purely software emulated vAPIC the
advantage is also a problem.  The problem is that the hypervisor does
not have any precise control over when the interrupt is actually
delivered to the guest (or a notification about that).  Because of that
the hypervisor cannot update the interrupt vector in IRR and ISR in the
same way as real hardware would.  The hypervisor becomes aware that the
interrupt is being serviced only upon the first VMEXIT after the
interrupt is delivered.  This creates a window between the actual
interrupt delivery and the update of IRR and ISR.  That means that IRR
and ISR might not be correctly set up to the point of the
end-of-interrupt signal.

The described deviation has been observed to cause an interrupt loss in
the following scenario.  vCPU0 posts an inter-processor interrupt to
vCPU1.  The interrupt is injected as a virtual interrupt by the
hypervisor.  The interrupt is delivered to a guest and an interrupt
handler is invoked.  The handler performs a requested action and
acknowledges the request by modifying a global variable.  So far, there
is no VMEXIT and the hypervisor is unaware of the events.  Then, vCPU0
notices the acknowledgment and sends another IPI with the same vector.
The IPI gets collapsed into the previous IPI in the IRR of vCPU1.  Only
after that a VMEXIT of vCPU1 occurs.  At that time the vector is cleared
in the IRR and is set in the ISR.  vCPU1 has vAPIC state as if the
second IPI has never been sent.
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.

I saw several possibilities of fixing the problem.  One is to intercept
the virtual interrupt delivery to update IRR and ISR at the right
moment.  The other is to deliver the LAPIC interrupts using the event
injection, same as legacy interrupts.  I opted to use the latter
approach for several reasons.  It's equivalent to what VMM/Intel does
(in !VMX case).  It appears to be what VirtualBox and KVM do.  The code
is already there (to support legacy interrupts).

Another possibility was to use a special intermediate state for a vector
after it is injected using a virtual interrupt and before it is known
whether it was accepted or is still pending.
That approach was implemented in https://reviews.freebsd.org/D13828
That method is more complex and does not have any clear advantage.

Please see sections 15.20 and 15.21.4 of "AMD64 Architecture
Programmer's Manual Volume 2: System Programming" (publication 24593,
revision 3.29) for comparison between event injection and virtual
interrupt injection.

PR:		215972
Reported by:	ajschot@hotmail.com, grehan
Tested by:	anish, grehan,  Nils Beyer <nbe@renzel.net>
Reviewed by:	anish, grehan
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D13780
2018-01-31 11:14:26 +00:00
John Baldwin
05d56d83b6 Ensure 'name' is not NULL before passing to strcmp().
This avoids a nested page fault when obtaining a stack trace in DDB if
the address from the first frame does not resolve to a known symbol.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-01-30 23:29:27 +00:00
Bryan Drewery
595109196a Don't use an .OBJDIR for 'make sysent'.
Reported by:	emaste, jhb
Sponsored by:	Dell EMC
2018-01-29 19:14:15 +00:00
Warner Losh
d6b6639713 Add ISA PNP tables to ISA drivers. Fix a few incidental comments.
ACPI ISA PBP tables not tagged, there's bigger issues with them.
2018-01-29 00:22:30 +00:00
Konstantin Belousov
c8f9c1f3d9 Use PCID to optimize PTI.
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.

I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.

Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address.  Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.

On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.

I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc (some aspects)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D13985
2018-01-27 11:49:37 +00:00
Edward Tomasz Napierala
28f3d8b2c2 Add SPDX identifiers to linux_ptrace.c and cfumass.c.
MFC after:	2 weeks
2018-01-24 17:04:01 +00:00
Ed Maste
7eb2159f6a Use BSD-2-Clause-FreeBSD license on linux_support.s
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.

Approved by:	kib
2018-01-23 20:35:43 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Konstantin Belousov
c398c14664 Use correct symbol name in r328202.
Sponsored by:	The FreeBSD Foundation
MFC after:	11 days
2018-01-20 18:05:14 +00:00
Konstantin Belousov
3a5e472e17 Use predefined symbol for the CR3.PCID mask.
Sponsored by:	The FreeBSD Foundation
MFC after:	11 days
2018-01-20 17:46:09 +00:00
Roger Pau Monné
50a53194f6 xen: fix IDT setup after PTI
On amd64 the IDT handler was not set correctly when using PTI.

While there also fix the selectors to SEL_KPL.

Obtained from:	kib
MFC with:	r328083
2018-01-20 14:59:37 +00:00
Konstantin Belousov
b4dfc9d7ad PTI: Trap if we returned to userspace with kernel (full) page table
still active.

Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup.  Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.

I peek this trick in some article about Linux implementation.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
DIfferential revision:	https://reviews.freebsd.org/D13956
2018-01-19 22:10:29 +00:00
Nathan Whitehorn
9a8196ce19 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
Ed Maste
b3327f62f0 Enable KPTI by default on amd64 for non-AMD CPUs
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability.  AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:

    We believe AMD processors are not susceptible due to our use of
    privilege level protections within paging architecture and no
    mitigation is required.

Thus default KPTI to off for AMD CPUs, and to on for others.  This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.

Submitted by:	Mitchell Horne
Reviewed by:	cem
Relnotes:	Yes
Security:	CVE-2017-5754
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D13971
2018-01-19 15:42:34 +00:00
John Baldwin
68fd3b0ef5 Use a dedicated per-CPU stack for machine check exceptions.
Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF.  This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs.  Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3.  Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).

This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.

While here, bypass trap() entirely and just call mca_intr().  This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).

Reviewed by:	kib
Tested by:	avg (on AMD without PTI)
Differential Revision:	https://reviews.freebsd.org/D13962
2018-01-18 23:50:21 +00:00
John Baldwin
f36b1fe0bd Remove two no-longer-used labels from the NMI interrupt handler.
Reviewed by:	kib
2018-01-18 22:13:53 +00:00
John Baldwin
7f513d17b2 Adjust branch target in NMI handler for the !PTI case.
In the !PTI case the NMI handler jumped past the instructions that set
%rdi to point to the current PCB, but the target instructions assumed %rdi
were set.

Reviewed by:	kib
Tested by:	pho
2018-01-18 20:12:12 +00:00
Konstantin Belousov
3705dda7e4 Move the kernphys declaration to machine/md_var.h.
Apparently machinde/cpu.h is supposed to contain MD implementations of
MI interfaces.  Also, remove kernphys declaration from machdep.c,
since it is already provided by md_var.h.

Requested and reviewed by:	bde
MFC after:	13 days
2018-01-18 15:15:35 +00:00
Konstantin Belousov
ac97ccbab5 Fix compilation with gcc.
etext is already declared in machine/cpu.h, move kernphys declaration
there too.

Based on the patch by:	bde
MFC after:	13 days
2018-01-18 11:21:03 +00:00
Konstantin Belousov
406bc0da95 Fix compilation with gas.
Submitted by:	bde
MFC after:	13 days
2018-01-18 11:19:58 +00:00
Konstantin Belousov
0d6c61ec30 Remove the 'last' argument from the pmap_pti_free_page().
It is in fact unused.

Noted and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-01-18 11:01:41 +00:00
John Baldwin
65eefbe422 Save and restore guest debug registers.
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context.  One result is that hardware
watchpoints do not work reliably within a guest under VT-x.

Due to differences in SVM and VT-x, slightly different approaches are
used.

For VT-x:

- Enable debug register save/restore for VM entry/exit in the VMCS for
  DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
  %rflags for the host.  Note that because DR6 is "software" managed
  and not stored in the VMCS a kernel debugger which single steps
  through VM entry could corrupt the guest DR6 (since a single step
  trap taken after loading the guest DR6 could alter the DR6
  register).  To avoid this, explicitly disable single-stepping via
  the trace flag before loading the guest DR6.  A determined debugger
  could still defeat this by setting a breakpoint after the guest DR6
  was loaded and then single-stepping.

For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host.  Since SVM
  saves the guest DR6 in the VMCB, the race with single-stepping
  described for VT-x does not exist.

For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.

Discussed with:	avg, grehan
Tested by:	avg (SVM), myself (VT-x)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D13229
2018-01-17 23:11:25 +00:00
Mark Johnston
9cb26f73ea Annotate a couple of changes from r328083.
Reviewed by:	kib
X-MFC with:	r328083
2018-01-17 21:52:12 +00:00
Konstantin Belousov
bd50262f70 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
Konstantin Belousov
94b011c4bc Amd64 user_ldt_deref() is not used outside sys_machdep.c. Mark it as
static.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-17 11:21:03 +00:00
Pedro F. Giffuni
74641f0bc6 x86: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:08:22 +00:00
Tycho Nightingale
91fe5fe7e7 Provide some mitigation against CVE-2017-5715 by clearing registers
upon returning from the guest which aren't immediately clobbered by
the host.  This eradicates any remaining guest contents limiting their
usefulness in an exploit gadget.

This was inspired by this linux commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa

Reviewed by:	grehan, rgrimes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13573
2018-01-15 18:37:03 +00:00
Konstantin Belousov
5f7b9ff2e3 Add STAC and CLAC instructions wrappers.
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13838
2018-01-14 12:39:50 +00:00
Jeff Roberson
b6715dab8f Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA.
Sponsored by:	Netflix, Dell/EMC Isilon
Discussed with:	jhb
2018-01-14 03:36:03 +00:00
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Konstantin Belousov
c751f90c0c Fix grammar.
Submitted by:	alc
MFC after:	3 days
2018-01-11 16:50:03 +00:00
Konstantin Belousov
6da5c56ae5 Remove redundand CLD instructions.
We already clear %RFLAGS.DF on the kernel entry due to the compiler's
ABI requirements.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-11 13:22:13 +00:00
Konstantin Belousov
4975c202ac Do not clear %RFLAGS.DF on fast syscall entry.
Hardware already did it for us due to the mask loaded into the
MSR_SF_MASK msr register.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13838
2018-01-11 12:54:33 +00:00
Konstantin Belousov
0f7c159f6b Move the hardware setup for fast syscalls into a common function.
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-11 12:40:43 +00:00
Konstantin Belousov
4275e16fa9 Rename COMMON_TSS_RSP0 to TSS_RSP0.
The symbol is just an offset in the hardware TSS structure, it is not
limited to the common_tss instance.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-01-11 12:28:08 +00:00
Konstantin Belousov
3ee6e65875 Update comment explaining the check, to reality.
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-01-11 12:07:24 +00:00
Conrad Meyer
e6fcf7898d x86: Document purpose of _safe variants of {rd,wr}msr()
Sponsored by:	Dell EMC Isilon
2018-01-10 22:41:00 +00:00
Andriy Gapon
091da2dfa5 vmm/svm: contigmalloc of the whole svm_softc is excessive
This is a followup to r307903.

struct svm_softc takes more than 200 kilobytes while what we really need
is 3 contiguous pages for I/O permission map and 2 contiguous pages for
MSR permission map.  Other physically mapped structures have a size of
a single page, so a proper alignment is sufficient for their correct
mapping.

Thus, only the permission maps are allocated with contigmalloc now,
the softc is allocated with a regular malloc.

Additionally, this commit adds a check that malloc returns memory with the
expected page alignment and that contigmalloc does not fail.
Unfortunately, at present svm_vminit() is expected to always succeed and
there is no way to report an error.
So, a contigmalloc failure leads to a panic.
We should probably fix this.

MFC after:	2 weeks
2018-01-09 14:22:18 +00:00
Konstantin Belousov
0530a9360f Make it possible to re-evaluate cpu_features.
Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of
cpu_features, cpu_features2, cpu_stdext_features, and
std_stdext_features2.

The intent is to allow the kernel to see the changes in the CPU
features after micocode update.  Of course, the update is not atomic
across variables and not synchronized with readers.  See the man page
warning as well.

Reviewed by:	imp (previous version), jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13770
2018-01-05 21:06:19 +00:00
Andriy Gapon
5f3c7d6580 Fix a couple of comments in AMD Virtual Machine Control Block structure
MFC after:	1 week
2018-01-05 19:15:24 +00:00
Konstantin Belousov
84874cc151 Avoid re-check of usermode condition.
It does not change anything in the behavior of trap_pfault(), while
eliminating obfuscation of jumping to the code which checks for the
condition reversed of the goto cause.  Also avoid force initialize the
rv variable, since it is now only accessed after storing vm_fault()
return value.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13725
2018-01-01 20:47:03 +00:00
Konstantin Belousov
1865d6b851 Remove MP SAFE marks and stray register name in comments.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-12-31 17:07:59 +00:00
Colin Percival
31a55efdc5 Use the TSLOG framework to record entry/exit timestamps for hammer_time.
The entry must be logged "manually" using TSRAW rather than TSENTER
since PCPU data structures have not yet been initialized and thus
curthread cannot be accessed; &thread0 is what will become curthread
later in hammer_time.

Other MD initialization code should be similarly instrumented in order
to gain visibility into the time spent before entering mi_startup; this
will require some care and testing from people with access to such
hardware.
2017-12-31 09:22:07 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Tycho Nightingale
9e33a61693 Recognize a pending virtual interrupt while emulating the halt instruction.
Reviewed by:	grehan, rgrimes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13573
2017-12-21 18:30:11 +00:00
Konstantin Belousov
30d4f9e888 Add atomic_load(9) and atomic_store(9) operations.
They provide relaxed-ordered atomic access semantic.  Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses.  The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.

The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations.  It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.

Suggested by:	jhb
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 09:59:20 +00:00
Mark Johnston
5bab623438 Pass the trap frame to fasttrap hooks.
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.

MFC after:	2 weeks
2017-12-11 19:21:39 +00:00
Bruce Evans
fb3cc1c37d Move instantiation of msgbufp from 9 MD files to subr_prf.c.
This variable should be pure MI except possibly for reading it in MD
dump routines.  Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998.  There were many imperfections in
r36441.  This commit fixes only a small one, to simplify fixing the
others 1 arch at a time.  (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
2017-12-07 07:55:38 +00:00
Andriy Gapon
a7437a3e9d amd-vi: set iommu msi configuration using pci_enable_msi method
This is better than directly changing PCI configuration space of the
device because it makes the PCI bus aware of the configuration.
Also, the change allows to drop a bunch of code that duplicated
pci_enable_msi() functionality.

I wonder if it's possible to further simplify the code by using
pci_alloc_msi().
2017-12-04 17:10:52 +00:00
Andriy Gapon
df92c28d6a vmm/amd: add ivhd device with a higher order
ivhd should attach after the root PCI bus and, thus, after the ACPI
Host-PCI bridge off which the bus hangs.  This is because ivhd changes
PCI configuration of a PCI IOMMU device that is located on the root bus.
If the bus attaches after ivhd it clears the MSI portion of the
configuration.  As a result IOMMU event interrupts would never be
delivered.

For regular ACPI devices the order is calculated as
    ACPI_DEV_BASE_ORDER + level * 10
where level is a depth of the device in the ACPI namespace.
I expect the depth of the Host-PCI bridge to be two or three,
so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order
for ivhd.

This should fix the setup of the AMD-Vi event interrupt when vmm is
preloaded (as opposed to kldload-ed).
2017-12-04 17:08:03 +00:00
Andriy Gapon
8f09494d1e amd-vi: clear event interrupt and overflow bits upon handling the interrupt
This ensures that we can receive further event interrupts.
See the description of the bits in the specification for
MMIO Offset 2020h IOMMU Status Register.
The bits are defined as set-by-hardware write-1-to-clear, same as all
the bits in the status register.

Discussed with:	anish
2017-12-04 17:02:53 +00:00
Scott Long
c15269ccb8 It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from
the standard kernels.  They are still available as custom compile
options.
2017-11-29 23:41:49 +00:00
Brooks Davis
5cd667e65f Disable vim syntax highlighting.
Vim's default pick doesn't understand that ';' is a comment character
and the result looks horrible.

Reviewed by:	emaste
2017-11-28 18:23:17 +00:00
Konstantin Belousov
dde5602786 Fix index calculation for the page table pages for efirt 1:1 map.
Stop issuing pre-assigned number to enumerate all page table pages,
the assignment is incorrect.  Instead automatically calculate the next
unused index. This index in fact does not serve any purpose except to
be unique to satisfy vm_page_grab() interface, we do not look up the
page by the index later.

Reported and tested by:	emaste
Reviewed by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
PR:	223906
Differential revision:	https://reviews.freebsd.org/D13273
2017-11-28 09:34:43 +00:00
Fedor Uporov
cd76ee1ee3 Remap ENOATTR to ENODATA in the linuxulator.
In the linux ENOADATA is frequently #defined as ENOATTR.
The change is required for an xattrs support implementation.

MFC after: 1 week
Discussed with: netchild
Approved by: pfg

Differential Revision: https://reviews.freebsd.org/D13221
2017-11-27 17:03:11 +00:00
Pedro F. Giffuni
c49761dd57 sys/amd64: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:03:07 +00:00
Ed Schouten
ee13ffbe03 Use TO_PTR() to convert integers to pointers.
For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in
this place. Also use it for all of the other source files, so that the
difference remains as minimal as possible.

MFC after:	2 weeks
2017-11-26 14:45:56 +00:00
Hans Petter Selasky
8a53e1340f Merge ^/head r326132 through r326161. 2017-11-24 12:13:27 +00:00
Andriy Gapon
a3bbbd5e40 amd-vi: a small whitespace cleanup
Reviewed by:	anish
2017-11-24 11:37:41 +00:00
Andriy Gapon
685c54fc6a amd-vi: use correct type for pci_rid, start_dev_rid, end_dev_rid sysctls
Previously, the values could look confusing because of unrelated bits from
adjacent memory.

Reviewed by:	anish
2017-11-24 11:36:35 +00:00
Andriy Gapon
eb6c9c128c amd-vi: small improvements to event printing
Ensure that an opening bracket always has a matching closing one.
Ensure that there is always a new-line at the end of a report line.
Also, add a space before the printed event flag.

Reviewed by:	anish
2017-11-24 11:35:43 +00:00
Andriy Gapon
dee38cdc2a amd-vi: print some additional details for INVALID_DEVICE_REQUEST event
Namely, the type of the hardware event and whether the transaction
was a translation request.

Reviewed by:	anish
2017-11-24 11:34:46 +00:00
Andriy Gapon
53d580f984 amd-vi: fix up r326152, the new width requires a wider type
This is my brain-o from extending the width at the last moment.
2017-11-24 11:25:06 +00:00
Andriy Gapon
5a041f2183 amd-vi: fix and extend definition of Command and Event Status Register (0x2020)
The defined bits are the lower bits, not the higher ones.

Also, the specification has been extended to define bits 0:18 and they
all could potentially be interesting to us, so extend the width of the
field accordingly.

Reviewed by:	anish
2017-11-24 11:20:10 +00:00
Andriy Gapon
8523ad24ba vmm/amd: improve iteration over IVHD (type 10h) entries in IVRS table
Many 8-byte entries have zero at byte 4, so the second 4-byte part is
skipped as a 4-byte padding entry.  But not all 8-byte entries have that
property and they get misinterpreted.

A real example:
    48 00 00 00 ff 01 00 01
This an 8-byte ACPI_IVRS_TYPE_SPECIAL entry for IOAPIC with ID 255 (bogus).
It is reported as:
    ivhd0: Unknown dev entry:0xff
Fortunately, it was completely harmless.

Also, bail out early if we encounter an entry of a variable length type.
We do not have proper handling for those yet.

Reviewed by:	anish
2017-11-24 11:10:36 +00:00
Ed Schouten
814629dd64 Don't let cpu_set_syscall_retval() clobber exec_setregs().
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).

Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.

Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.

To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.

Reviewed by:	kib, jhibbits
Differential Revision:	https://reviews.freebsd.org/D13180
2017-11-24 07:35:08 +00:00
Hans Petter Selasky
82725ba9bf Merge ^/head r325999 through r326131. 2017-11-23 14:28:14 +00:00
Konstantin Belousov
383f241dce Remove lint support from system headers and MD x86 headers.
Reviewed by:	dim, jhb
Discussed with:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13156
2017-11-23 11:40:16 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Hans Petter Selasky
937d37fc6c Merge ^/head r325842 through r325998. 2017-11-19 12:36:03 +00:00
Pedro F. Giffuni
df57947f08 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
Hans Petter Selasky
55b1c6e7e4 Merge ^/head r325663 through r325841. 2017-11-15 11:28:11 +00:00
Hans Petter Selasky
8dee9a7a44 Remove no longer supported mthca driver.
Sponsored by:	Mellanox Technologies
2017-11-13 10:59:38 +00:00
Mateusz Guzik
ca0227933e amd64: stop nesting preemption counter in spinlock_enter
Discussed with:	jhb
2017-11-12 03:13:01 +00:00
Jeff Roberson
8d6fbbb867 Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated.  This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons.  This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by:	alc, kib, markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2017-11-08 02:39:37 +00:00
Konstantin Belousov
b535ed2898 Zero the structure instead of the pointer to it.
Reported by:	Don Morris <Don.Morris@dell.com>
MFC after:	4 days
2017-11-05 20:03:57 +00:00
Konstantin Belousov
5b9a3721e6 x86: Do not emit unused TD_TID symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-04 10:51:52 +00:00
Konstantin Belousov
ad4e4ae591 Restore an optimization that was temporary disabled by r324665.
In reclaim_pv_chunk(), rotate the pv chunks list so that next
invocations of the reclaim do not scan the same pv chunks that could
not be freed.  Only do the rotation when there is no parallel scan,
tracked by active_reclaims counter.

To rotate, move all chunks that are before current iteration marker,
after another marker that is inserted at the list tail on start of the
reclaim.

Reviewed by:	alc
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-01 18:06:44 +00:00
Konstantin Belousov
aa788cc387 Consistently ensure that we do not load MXCSR with reserved bits set.
Some callers of fpusetregs()/npxsetregs(), most importantly
set_fpcontext(), clear reserved bits.  But some did not.  Do the
clearing in fpusetregs() and remove now redundand operation from
set_fpcontext().

Reported by:	Maxime Villard <max@m00nbsd.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-01 10:32:44 +00:00
Peter Grehan
9d210a4a18 Emulate the "OR reg, r/m" instruction (opcode 0BH).
This is	needed for the HDA emulation with FreeBSD guests.

Reviewed by:	marcelo
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12832
2017-11-01 03:26:53 +00:00
Tijl Coosemans
f236378b54 Set the return address for stack entry points to zero.
Stack unwinders treat zero as a stop condition.  The value on the stack can
be non-zero because thread stacks may be arbitrary memory provided via
pthread_attr_setstack(3) or may be recycled from previous threads.

Reference:
https://lists.freebsd.org/pipermail/freebsd-current/2017-August/066855.html
https://lists.freebsd.org/pipermail/freebsd-current/2017-October/067254.html

Discussed with:	kib
MFC after:	1 week
2017-10-31 11:51:34 +00:00
Ian Lepore
5deb1573e8 Improve the performance of the hpet timer in bhyve guests by making the
timer frequency a power of two.  This changes the frequency from 10 to
16.7 MHz (2 ^ 24 HZ).  Using a power of two avoids roundoff errors when
doing arithmetic in sbintime_t units.

Testing shows this can fix erratic ntpd behavior in guests using the
hpet timer (which is the default for multicore guests).

Reported by:	bsam@
2017-10-29 20:50:03 +00:00
Eitan Adler
a2aef24aa3 Update several more URLs
- Primarily http -> https
- Primarily FreeBSD project URLs
2017-10-29 08:17:03 +00:00
John Baldwin
6db55a0f3a Rework pass through changes in r305485 to be safer.
Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated.  The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption.  Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.

As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.

PR:		222937
Tested by:	Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12661
2017-10-27 14:57:14 +00:00
Mark Johnston
5fca1d90c1 Fix the VM_NRESERVLEVEL == 0 build.
Add VM_NRESERVLEVEL guards in the pmaps that implement transparent
superpage promotion using reservations.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12764
2017-10-23 15:34:05 +00:00
Mateusz Guzik
9e68989764 Make the sleepq chain hash size configurable per-arch and bump on amd64.
While here cache-align chains.

This shortens longest found chain during poudriere -j 80 from 32 to 16.

Pushing this higher up will probably require allocation on boot.
2017-10-22 20:43:50 +00:00
Bjoern A. Zeeb
8e94025b41 With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into
HEAD.  Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.

Disable building LINT-VIMAGE with VIMAGE being default.

This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.

Requested by:		many
Reviewed by:		kristof, emaste, hiren
X-MFC after:		never
Relnotes:		yes
Differential Revision:	https://reviews.freebsd.org/D12639
2017-10-20 21:40:59 +00:00
Mateusz Guzik
e66167764a amd64: plug missed dt_lock in cpu_fork 2017-10-20 18:58:11 +00:00
Mateusz Guzik
a5db8ade37 amd64: __exclusive_cache_line pv_chunks_mutex and pv_list_locks
Note that pv_list_locks is an array and currently it fits 2 locks per line.
Resizing it and/or putting more locks in different lines requires several tests.

MFC after:	1 week
2017-10-20 03:38:58 +00:00
Mateusz Guzik
d95498d44f amd64: avoid acquiring dt lock if possible (which is the common case)
Discussed with:	kib
MFC after:	1 week
2017-10-20 03:30:02 +00:00
Mark Johnston
46fcd1af63 Move kernel dump offset tracking into MI code.
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.

This is needed to implement support for kernel dump compression in the
MI kernel dump code.

Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.

Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11722
2017-10-18 15:38:05 +00:00
Konstantin Belousov
ca1f624517 Fix the pv_chunks pc_lru tailq handling in reclaim_pv_chunk().
For processing, reclaim_pv_chunk() removes the pv_chunk from the lru
list, which makes pc_lru linkage invalid.  Then the pmap lock is
released, which allows for other thread to free the last pv entry
allocated from the chunk and call free_pv_chunk(), which tries to
modify the invalid linkage.

Similarly, the chunk is inserted into the private tailq new_tail
temporary.  Again, free_pv_chunk() might be run and corrupt the
linkage for the new_tail after the pmap lock is dropped.

This is a consequence of r299788 elimination of pvh_global_lock, which
allowed for reclaim to run in parallel with other pmap calls which
free pv chunks.

As a fix, do not remove the chunk from pc_lru queue, use a marker to
remember the position in the queue iteration.  We can safely operate
on the chunks after the chunk's pmap is locked, we fetched the chunk
after the marker, and we checked that chunk pmap is same as we have
locked, because chunk removal from pc_lru requires both pv_chunk_mutex
and the pmap mutex owned.

Note that the fix lost an optimization which was present in the
previous algorithm.  Namely, new_tail requeueing rotated the pv chunks
list so that reclaim didn't scan the same pv chunks that couldn't be
freed (because they contained a wired and/or superpage mapping) on
every invocation.  An additional change is planned which would improve
this.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 15:16:24 +00:00
Konstantin Belousov
1df04cc069 Change amd64_get_ldt() to return 'EOF' when the LDT is not yet
allocated, when requested range of descriptors does not fit into
currently allocated LDT, or trim the return if the range fits
partially.  Before, the function returned EINVAL.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 16:20:39 +00:00
Mateusz Guzik
801eec865f amd64: remove unused variable from pmap_delayed_invl_genp
Reported by:	gcc
MFC after:	1 week
2017-10-05 18:51:48 +00:00
Konstantin Belousov
a6d4b1dc48 Ensure that after sucessfull i386_set_ldt() call, other threads can
use LDT segments immediately.

If the i386_set_ldt() call created a first LDT descriptor (and
consequently created the LDT) for our address space, LDTR is currently
loaded only on the CPU executing the syscall.  Other CPUs executing
threads sharing the address space, would only load LDTR after context
switch.

Uncomment set_user_ldt_rv() and call it on all CPUs.  Remove critical
section inside set_user_ldt(), it is not needed in the context of call
from smp_rendezvous().

Set md_ldt after md_ldt_sd is initialized using the same code sequence
as in user_ldt_free().  Do the whole initialization in a critical
section, to not race with the context switching while we set LDT.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 13:12:59 +00:00
Konstantin Belousov
78d58cb6bc Avoid a race betweem freeing LDT and context switches.
cpu_switch.S uses curproc->p_md.md_ldt value as the flag indicating
presence of the process LDT.  The flag is checked and then ldt segment
descriptor is copied into the CPU' GDT slot.

Disallow context switches around clearing of the curproc LDT state by
performing the cleanup in critical section.  Ensure that the md_ldt
flag is cleared before md_ldt_sd descriptor content is destroyed by
inserting fence between the operations.

We depend on the x86 memory model strong ordering guarantees, in
particular, that cpu_switch.S observes the writes to md_ldt and
md_ldt_sd in the expected order.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:50:03 +00:00
Konstantin Belousov
287c718f32 Improve amd64_get_ldt().
Provide consistent snapshot of the requested descriptors by preventing
other threads from modifying LDT while we fetch the data, lock dt_lock
around the read.  Copy the data into intermediate buffer, which is
copied out after the lock is dropped.

Use guaranteed atomic (aligned volatile) reads of the descriptors to
use same-size atomic as CPU update to set A bit in the descriptor type
field.

Improve overflow checking for the descriptors range calculations and
remove unneeded casts.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:29:34 +00:00
Konstantin Belousov
8fc26d9612 Minor style fix.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:19:55 +00:00
Konstantin Belousov
a58679a93b Complete r323772 on amd64.
Compilers are allowed to combine plain reads into group operations,
e.g. 64bit element copies of one array into another can be
legitimately optimized back to a memcpy() call, which r323772 tried to
prevent.

Qualify accesses to LDT descriptors with volatile dereference to
ensure that each write indeed occurs.  After that, our usual claim of
native-size aligned writes being atomic applies.

This is equivalent to atomic_store(memory_order_relaxed) C11 accesses,
but our machine/atomic.h does not provide corresponding primitive.

Noted and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:16:45 +00:00
Konstantin Belousov
98af67c78e Use ANSI C declaration for amd64_get_ldt().
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:07:38 +00:00
Konstantin Belousov
83d55c8ac2 Correct format specifiers in the debug code.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 12:01:39 +00:00
Konstantin Belousov
687a5be47a Remove useless comments.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 11:56:04 +00:00
Konstantin Belousov
a1fc6a8c49 On amd64, mark the set_user_ldt() function as static.
On i386, the function is used from the context switch code and needs
to be accessible externally.  Amd64 MD context switch does not lock an
LDT spinlock and inlines switching in assembly.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 11:50:01 +00:00
Konstantin Belousov
37afe7dfd2 Reduce default max_ldt_segment value to 512.
This makes the LDT to use only one page with default settings,
avoiding the need to find contigous 2 pages in KVA.  It seems that
most users are fine even with 512 segments.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 11:36:55 +00:00
Konstantin Belousov
843d5752f5 Update comment to note that we skip LDT reload for kthreads as well.
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-10-05 11:34:51 +00:00
Konstantin Belousov
9674d76346 Hide kernel stuff from userspace.
Sponsored by:	Mellanox Technologies
2017-10-02 08:37:43 +00:00
Andrew Turner
0e73a61997 To prepare for adding EFI runtime services support on arm64 move the
machine independent parts of the existing code to a new file that can be
shared between amd64 and arm64.

Reviewed by:	kib (previous version), imp
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D12434
2017-10-01 19:52:47 +00:00
Konstantin Belousov
3cabd93e26 Do not do torn writes to active LDTs.
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12413
2017-09-19 17:57:04 +00:00
Ilya Bakulin
a9bfc8d2ae Add MMCCAM-enabled kernel config for IMX6, reduce debug noice in MMCCAM kernels
CAM_DEBUG_TRACE results in way too much debug output than needed now.
When debugging, it's always possible to turn on trace level using camcontrol.

Approved by:	imp (mentor)
Differential Revision:	https://reviews.freebsd.org/D12110
2017-09-13 10:56:02 +00:00
Conrad Meyer
907f50fe04 Add smn(4) driver for AMD System Management Network
AMD Family 17h CPUs have an internal network used to communicate between
the host CPU and the PSP and SMU coprocessors.  It exposes a simple
32-bit register space.

Reviewed by:	avg (no +1), mjoras, truckman
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12217
2017-09-05 15:13:41 +00:00
Josh Paetzel
9d0ec2a920 Revert r323087
This needs more thinking out and consensus, and the commit message
was wrong AND there was a typo in the commit.

pointyhat:	jpaetzel
2017-09-01 17:03:48 +00:00
Josh Paetzel
0be04b100c Take options IPSEC out of GENERIC
PR:	220170
Submitted by:	delphij
Reviewed by:	ae, glebius
MFC after:	2 weeks
Differential Revision:	D11806
2017-09-01 15:54:53 +00:00
Josh Paetzel
3b65550eec Allow kldload tcpmd5
PR:	220170
MFC after:	2 weeks
2017-08-31 20:16:28 +00:00
Alexander Motin
ed9652da5f Add NTB driver for PLX/Avago/Broadcom PCIe switches.
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too.  It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells.  There are also 4 DMA engines
in those chips, but they are not yet supported.

While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-08-30 21:16:32 +00:00
Conrad Meyer
2744a0b69b Drop CACHE_LINE_SIZE to 64 bytes on x86
The actual cache line size has always been 64 bytes.

The 128 number arose as an optimization for Core 2 era Intel processors.  By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently.  Newer CPUs prefetch more intelligently.

The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington").  If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.

Reported by:	mjg
Reviewed by:	np
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2017-08-28 22:28:41 +00:00
Ryan Libby
0d4e7ec5f3 amd64: drop q suffix from rd[fg]sbase for gas compatibility
Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12133
2017-08-26 23:13:18 +00:00
Konstantin Belousov
9fc847133e Save KGSBASE in pcb before overriding it with the guest value.
Reported by:	lwhsu, mjoras
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	18 days
2017-08-24 10:49:53 +00:00
Konstantin Belousov
761fb3ef29 Ensure that fs/gs bases are stored in pcb before copying the pcb for
new process or thread.

Reported and tested by:	ae, dhw
Sponsored by:	The FreeBSD Foundation
MFC after:	20 days
2017-08-22 18:15:47 +00:00
Konstantin Belousov
3e902b3d76 Make WRFSBASE and WRGSBASE instructions functional.
Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.

Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.

Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.

The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.

Reviewed by:	jhb (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D12023
2017-08-21 17:38:02 +00:00
Konstantin Belousov
9ed84d55c1 Simplify the code.
Noted by:	Oliver Pinter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 11:18:16 +00:00
Konstantin Belousov
43b7b1f29b Simplify amd64 trap().
- Use more relevant name 'signo' instead of 'i' for the local variable
  which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
  of the trap() function.  Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.

Requested and reviewed by:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 09:52:25 +00:00
Konstantin Belousov
4031ebef84 Trim excessive 'extern' and remove unused declaration.
Reviewed by:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 09:42:09 +00:00
Konstantin Belousov
dad2e0e420 Use ANSI C declaration for trap_pfault(). Style.
Reviewed by:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-20 09:39:10 +00:00
Ruslan Bukin
5651294282 Fix module unload when SGX support is not present in CPU.
Sponsored by:	DARPA, AFRL
2017-08-18 14:47:06 +00:00
Mark Johnston
01938d3666 Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
Mark Johnston
50ef60dabe Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
Conrad Meyer
dc6a82801d x86: Add dynamic interrupt rebalancing
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs.  Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources.  Reduced latency
was proven by, e.g., SPEC2008.

Submitted by:	jeff@ (earlier version)
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10435
2017-08-16 18:48:53 +00:00
Ruslan Bukin
7dea76609b Rename macro DEBUG to SGX_DEBUG.
This fixes LINT kernel build.

Reported by:	lwhsu
Sponsored by:	DARPA, AFRL
2017-08-16 13:44:46 +00:00
Ruslan Bukin
2164af29a0 Add support for Intel Software Guard Extensions (Intel SGX).
Intel SGX allows to manage isolated compartments "Enclaves" in user VA
space. Enclaves memory is part of processor reserved memory (PRM) and
always encrypted. This allows to protect user application code and data
from upper privilege levels including OS kernel.

This includes SGX driver and optional linux ioctl compatibility layer.
Intel SGX SDK for FreeBSD is also available.

Note this requires support from hardware (available since late Intel
Skylake CPUs).

Many thanks to Robert Watson for support and Konstantin Belousov
for code review.

Project wiki: https://wiki.freebsd.org/Intel_SGX.

Reviewed by:	kib
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11113
2017-08-16 10:38:06 +00:00
Konstantin Belousov
baec5778ed Print whole machine state on double fault.
It is quite useful when double fault is not caused by a stack overflow.

Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-14 11:23:07 +00:00
Konstantin Belousov
0fd7ea1f21 Add {rd,wr}{fs,gs}base C wrappers for instructions.
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-14 11:20:54 +00:00
Konstantin Belousov
7bf0049e48 Style.
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-14 11:20:10 +00:00
Jung-uk Kim
b5669d0aa8 Split identify_cpu() into two functions for amd64 as we do for i386. This
reduces diff between amd64 and i386.  Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.

Reported by:	dexuan
Tested by:	dexuan
2017-08-09 18:09:09 +00:00
Warner Losh
9057f54d74 Fail to open efirt device when no EFI on system.
libefivar expects opening /dev/efi to indicate if the we can make efi
runtime calls. With a null routine, it was always succeeding leading
efi_variables_supported() to return the wrong value. Only succeed if
we have an efi_runtime table. Also, while I'm hear, out of an
abundance of caution, add a likely redundant check to make sure
efi_systbl is not NULL before dereferencing it. I know it can't be
NULL if efi_cfgtbl is non-NULL, but the compiler doesn't.
2017-08-08 20:44:16 +00:00
Konstantin Belousov
fe04f5e9d0 Avoid DI recursion when reclaim_pv_chunk() is called from
pmap_advise() or pmap_remove().

Reported and tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-07 17:29:54 +00:00
Konstantin Belousov
1a47eac0f5 Explain why delayed invalidation is not required in pmap_protect() and
pmap_remove_pages().

Submitted by:	alc
MFC after:	1 week
2017-08-07 17:23:10 +00:00
Jung-uk Kim
0105034487 Detect hypervisors early. We used to set lower hz on hypervisors by default
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().

MFC after:	3 days
2017-08-05 06:56:46 +00:00
Conrad Meyer
6f240e18b5 x86: Tag some intrinsics with __pure2
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__.  Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-03 22:28:30 +00:00
Ed Schouten
c852847584 Keep top page on CloudABI to work around AMD Ryzen stability issues.
Similar to r321899, reduce sv_maxuser by one page inside of CloudABI.
This ensures that the stack, the vDSO and any allocations cannot touch
the top page of user virtual memory.

Considering that CloudABI userspace is completely oblivious to virtual
memory layout, don't bother making this conditional based on the CPU of
the running system.

Reviewed by:	kib, truckman
Differential Revision:	https://reviews.freebsd.org/D11808
2017-08-02 13:08:10 +00:00
Mateusz Guzik
fd1d4c8159 amd64: annotate the syscall return address check with __predict_false
before:
   0xffffffff80b03ebb <+2059>:	mov    0x460(%r14),%rax
   0xffffffff80b03ec2 <+2066>:	mov    0x98(%rax),%rax
   0xffffffff80b03ec9 <+2073>:	shr    $0x2f,%rax
   0xffffffff80b03ecd <+2077>:	je     0xffffffff80b03edd <amd64_syscall+2093>
   0xffffffff80b03ecf <+2079>:	mov    0x3f8(%r14),%rax
   0xffffffff80b03ed6 <+2086>:	orl    $0x1,0xc8(%rax)
   0xffffffff80b03edd <+2093>:	add    $0xf8,%rsp

after:
   0xffffffff80b03ebb <+2059>:	mov    0x460(%r14),%rax
   0xffffffff80b03ec2 <+2066>:	mov    0x98(%rax),%rax
   0xffffffff80b03ec9 <+2073>:	shr    $0x2f,%rax
   0xffffffff80b03ecd <+2077>:	jne    0xffffffff80b03eef <amd64_syscall+2111>
   0xffffffff80b03ecf <+2079>:	add    $0xf8,%rsp

Reviewed by:	kib
MFC after:	1 week
2017-08-02 11:25:38 +00:00
Konstantin Belousov
6632a4330f Do not call trapsignal() after handling usermode fault or interrupt,
when a signal is not intended to be sent.

The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.

For NMI, a return to usermode without ast processing is done.  On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.

Reported by:	Nils Beyer <nbe@renzel.net>
PR:	221151
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-02 10:12:10 +00:00
Don Lewis
cd155b5603 Lower the amd64 shared page, which contains the signal trampoline,
from the top of user memory to one page lower on machines with the
Ryzen (AMD Family 17h) CPU.  This pushes ps_strings and the stack
down by one page as well.  On Ryzen there is some sort of interaction
between code running at the top of user memory address space and
interrupts that can cause FreeBSD to either hang or silently reset.
This sounds similar to the problem found with DragonFly BSD that
was fixed with this commit:
  https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
but our signal trampoline location was already lower than the address
that DragonFly moved their signal trampoline to.  It also does not
appear to be related to SMT as described here:
  https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498

  "Hi, Matt Dillon here. Yes, I did find what I believe to be a
   hardware issue with Ryzen related to concurrent operations. In a
   nutshell, for any given hyperthread pair, if one hyperthread is
   in a cpu-bound loop of any kind (can be in user mode), and the
   other hyperthread is returning from an interrupt via IRETQ, the
   hyperthread issuing the IRETQ can stall indefinitely until the
   other hyperthread with the cpu-bound loop pauses (aka HLT until
   next interrupt). After this situation occurs, the system appears
   to destabilize. The situation does not occur if the cpu-bound
   loop is on a different core than the core doing the IRETQ. The
   %rip the IRETQ returns to (e.g. userland %rip address) matters a
   *LOT*. The problem occurs more often with high %rip addresses
   such as near the top of the user stack, which is where DragonFly's
   signal trampoline traditionally resides. So a user program taking
   a signal on one thread while another thread is cpu-bound can cause
   this behavior. Changing the location of the signal trampoline
   makes it more difficult to reproduce the problem. I have not
   been because the able to completely mitigate it. When a cpu-thread
   stalls in this manner it appears to stall INSIDE the microcode
   for IRETQ. It doesn't make it to the return pc, and the cpu thread
   cannot take any IPIs or other hardware interrupts while in this
   state."
since the system instability has been observed on FreeBSD with SMT
disabled.  Interrupts to appear to play a factor since running a
signal-intensive process on the first CPU core, which handles most
of the interrupts on my machine, is far more likely to trigger the
problem than running such a process on any other core.

Also lower sv_maxuser to prevent a malicious user from using mmap()
to load and execute code in the top page of user memory that was made
available when the shared page was moved down.

Make the same changes to the 64-bit Linux emulator.

PR:		219399
Reported by:	nbe@renzel.net
Reviewed by:	kib
Reviewed by:	dchagin (previous version)
Tested by:	nbe@renzel.net (earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11780
2017-08-02 01:43:35 +00:00
Mark Johnston
2375aaa8e9 Batch updates to v_wire_count when freeing page table pages on x86.
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.

Reviewed by:	alc, kib (previous revision)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11790
2017-08-01 05:26:30 +00:00
Konstantin Belousov
9adf30b0c3 Remove unused symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-30 21:52:22 +00:00
Dmitry Chagin
c151945c86 Avoid using [LINUX_]SHAREDPAGE constant directly in the vdso code.
This is needed for https://reviews.freebsd.org/D11780.

Reported by:	kib@
2017-07-30 21:24:20 +00:00
Alan Cox
782e896088 Add support for pmap_enter(..., psind=1) to the amd64 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping.  (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.

Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).

Submitted by:	Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by:	kib, markj
Tested by:	pho
MFC after:	10 days
Differential Revision:	https://reviews.freebsd.org/D11556
2017-07-23 06:33:58 +00:00
Ryan Libby
b1a987bb34 __pcpu: gcc -Wredundant-decls
Pollution from counter.h made __pcpu visible in amd64/pmap.c.  Delete
the existing extern decl of __pcpu in amd64/pmap.c and avoid referring
to that symbol, instead accessing the pcpu region via PCPU_SET macros.
Also delete an unused extern decl of __pcpu from mp_x86.c.

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11666
2017-07-21 17:11:36 +00:00
Ryan Libby
5e6f40bdef efi: restrict visibility of EFIABI_ATTR-declared functions
In-tree gcc (4.2) doesn't understand __attribute__((ms_abi))
(EFIABI_ATTR).  Avoid declaring functions with that attribute when the
compiler is detected to be gcc < 4.4.

Reviewed by:	kib, imp (previous version)
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11636
2017-07-20 06:47:06 +00:00
Alan Cox
5818f05d39 Style-only change: Consistently use the variable name "pdpg" throughout
this file.  Previously, half of the pointers to a vm_page being used as
a page directory page were named "pdpg" and the rest were named "mpde".

Discussed with:	kib
MFC after:	1 week
2017-07-15 16:42:55 +00:00
Alan Cox
68a870f3bc Extract the innermost loop of pmap_remove() out into its own function,
pmap_remove_ptes().  (This new function will also be used by an upcoming
change to pmap_enter() that adds support for psind == 1 mappings.)

Submitted by:	Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by:	kib, markj
MFC after:	1 week
2017-07-15 01:49:54 +00:00
Konstantin Belousov
60686c3703 Fix size argument to vm_pager_allocate(), it is in bytes, not in pages.
It is believed to be only cosmetic.

Noted by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-13 08:23:37 +00:00
Konstantin Belousov
e766a6bb01 Revert r320936 to recommit with the correct log message. 2017-07-13 08:23:12 +00:00
Konstantin Belousov
89f91fa960 It is believed to be only cosmetic.
Noted by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-13 08:19:50 +00:00