Commit Graph

2415 Commits

Author SHA1 Message Date
bde
48738e7c73 Oops, the last minute reduction in the clobber list for i386
MCOUNT_OVERHEAD() in r334522 was too agressive.  Only mcount exit
preserves %eax and %edx.
2018-06-02 09:59:27 +00:00
bde
0409ad53fa Fix high resolution kernel profiling just enough to not crash at boot
time, especially for SMP.  If configured, it turns itself on at boot
time for calibration, so is fragile even if never otherwise used.

Both types of kernel profiling were supposed to use a global spinlock
in the SMP case.  If hi-res profiling is configured (but not necessarily
used), this was supposed to be optimized by only using it when
necessary, and slightly more efficiently, in asm.  But it was not done
at all for mcount entry where it is necessary.  This caused crashes
in the SMP case when either type of profiling was enabled.  For mcount
exit, it only caused wrong times.  The times were wrongest with an
i8254 timer since using that requires exclusive access to the hardware.
The i8254 timer was too slow to use here 20 years ago and is much less
usable now, but it is the default for the SMP case since TSCs weren't
invariant when SMP was new.  Do the locking in all hi-res SMP cases for
simplicity.

Calibration uses special asms, and the clobber lists in these were sort
of inverted.  They contained the arg and return registers which are not
clobbered, but on amd64 they didn't contain the residue of the call-used
registers which may be clobbered (%r10 and %r11).  This usually caused
hangs at boot time.  This usually affected even the UP case.
2018-06-02 05:48:44 +00:00
bde
d4e99d50f0 Fix recent breakages of kernel profiling, mostly on i386 (high resolution
kernel profiling remains broken).

memmove() was broken using ALTENTRY().  ALTENTRY() is only different from
ENTRY() in the profiling case, and its use in that case was sort of
backwards.  The backwardness magically turned memmove() into memcpy()
instead of completely breaking it.  Only the high resolution parts of
profiling itself were broken.  Use ordinary ENTRY() for memmove().
Turn bcopy() into a tail call to memmove() to reduce complications.
This gives slightly different pessimizations and profiling lossage.
The pessimizations are minimized by not using a frame pointer() for
bcopy().

Calls to profiling functions from exception trampolines were not
relocated.  This caused crashes on the first exception.  Fix this using
function pointers.

Addresses of exception handlers in trampolines were not relocated.  This
caused unknown offsets in the profiling data.  Relocate by abusing
setidt_disp as for pmc although this is slower than necessary and
requires namespace pollution.  pmc seems to be missing some relocations.
Stack traces and lots of other things in debuggers need similar relocations.

Most user addresses were misclassified as unknown kernel addresses and
then ignored.  Treat all unknown addresses as user. Now only user
addresses in the kernel text range are significantly misclassified (as
known kernel addresses).

The ibrs functions didn't preserve enough registers.  This is the only
recent breakage on amd64.  Although these functions are written in
asm, in the profiling case they call profiling functions which are
mostly for the C ABI, so they only have to save call-used registers.
They also have to save arg and return registers in some cases and
actually save them in all cases to reduce complications.  They end up
saving all registers except %ecx on i386 and %r10 and %r11 on amd64.
Saving these is only needed for 1 caller on each of amd64 and i386.
Save them there.  This is slightly simpler.

Remove saving %ecx in handle_ibrs_exit on i386.  Both handle_ibrs_entry
and handle_ibrs_exit use %ecx, but only the latter needed to or did
save it.  But saving it there doesn't work for the profiling case.

amd64 has more automatic saving of the most common scratch registers
%rax, %rcx and %rdx (its complications for %r10 are from unusual use
of %r10 by SYSCALL).  Thus profiling of handle_ibrs_exit_rs() was not
broken, and I didn't simplify the saving by moving the saving of these
registers from it to the caller.
2018-06-02 04:25:09 +00:00
mmacy
2f6bd2cd39 hwpmc: remove unused pre-table driven bits for intel
Intel now provides comprehensive tables for all performance counters
and the various valid configuration permutations as text .json files.
Libpmc has been converted to use these and hwpmc_core has been greatly
simplified by moving to passthrough of the table values.

The one gotcha is that said tables don't support pentium pro and and pentium
IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is
unlikely that anyone is doing low level optimization on 15 year old Intel
hardware. Nonetheless, if someone feels strongly enough to populate the
corresponding tables for p4 and ppro I will reinstate the files in to the
build.

Code for the K8 counters and !x86 architectures remains unchanged.
2018-05-31 22:41:07 +00:00
dim
a34aa33cd6 Resolve conflicts between macros in fenv.h and ieeefp.h
This is a follow-up to r321483, which disabled -Wmacro-redefined for
some lib/msun tests.

If an application included both fenv.h and ieeefp.h, several macros such
as __fldcw(), __fldenv() were defined in both headers, with slightly
different arguments, leading to conflicts.

Fix this by putting all the common macros in the machine-specific
versions of ieeefp.h.  Where needed, update the arguments in places
where the macros are invoked.

This also slightly reduces the differences between the amd64 and i386
versions of ieeefp.h.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D15633
2018-05-31 20:22:47 +00:00
hselasky
831d81975f Implement atomic_add_64() and atomic_subtract_64() for the i386 target.
While at it add missing _acq_ and _rel_ variants for 64-bit atomic
operations under i386.

Reviewed by:	kib @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-05-29 11:59:02 +00:00
kib
b8cb7e293a Optimize i386 pmap_extract_and_hold().
In particular, stop using pmap_pte() to read non-promoted pte while
walking the page table.  pmap_pte() needs to shoot down the kernel
mapping globally which causes IPI broadcast.  Since
pmap_extract_and_hold() is used for slow copyin(9), it is very
significant hit for the 4/4 kernels.

Instead, create single purpose per-processor page frame and use it to
locally map page table page inside the critical section, to avoid
reuse of the frame by other thread if context switched.

Measurement demostrated very significant improvements in any load that
utilizes copyin/copyout.

Found and benchmarked by:	bde
Sponsored by:	The FreeBSD Foundation
2018-05-25 16:29:22 +00:00
avg
546f863d51 re-synchronize TSC-s on SMP systems after resume, if necessary
The TSC-s are checked and synchronized only if they were good
originally.  That is, invariant, synchronized, etc.

This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second.  The APs are in sync with each other.  Not
sure if this is a hardware quirk or a firmware bug.

This is what I see after a resume with this change:
    SMP: passed TSC synchronization test after adjustment
    acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
2018-05-25 07:33:20 +00:00
kib
4bdf909413 Support IBRS for i386.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15522
2018-05-23 16:31:46 +00:00
kib
b10c76de35 Use local unique labels inside most often used macros.
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-22 13:45:40 +00:00
kib
684e2d2368 Fix double-load of %cr3 and double-copy of the stack frame for the
kernel entry from userspace vm86.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-22 13:30:56 +00:00
mmacy
cb38945503 fix i386 builds after r334005 and r334009
r334005: add pc_ibpb_set as it is now referenced by common code
(although presumably not needed on i386 since it has been there
since the first spectre mitigation work on amd64)

r334009: there is no amd64 rflags -> i386 eflags
2018-05-22 05:09:33 +00:00
jhb
31acfe0f07 Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
kib
088f960e02 Fix PMC_IN_TRAP_HANDLER() for i386 after the 4/4 split.
Sponsored by:	The FreeBSD Foundation
2018-05-13 20:10:02 +00:00
kib
99b8ab0632 Kernel entry from vm86 mode, where PCB_VM86CALL pcb flag is not set,
is executed on the right stack already.  No copy from the entry stack
to the kstack must be performed for vm86 bios call code to function.

To access the pcb flags on kernel entry, unconditionally switch to
kernel address space if vm86 mode is detected.

This fixes very early vm86 bios calls, typically done when boot is
performed by boot2 without loader, and kernel falls back to BIOS calls
to get SMAP.

Reported by:	bde
Sponsored by:	The FreeBSD Foundation
2018-05-12 11:06:59 +00:00
kib
137b66125c Create a macro for PIC code which loads %cr3 from tramp_idleptd.
Sponsored by:	The FreeBSD Foundation
2018-05-12 10:51:50 +00:00
kib
009a13f92b Remove dead declaration.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-11 20:47:45 +00:00
imp
fbbe5571b3 Remove unused bcopyb.
Differential Revision: https://reviews.freebsd.org/D15374
2018-05-10 02:31:54 +00:00
kib
10723fd24d Fix move of the frame to the normal stack for interrupts occuring from
the vm86 mode.

Submitted by:	jhb
2018-04-26 21:07:45 +00:00
lwhsu
64431be5f0 Fix i386 build after r332970 by adding IS_BSP() definition.
Approved by:	kib
2018-04-25 07:51:41 +00:00
kib
a42a93a102 Fix futexes on i386 after the 4/4G split.
Use proper method to access userspace.  For now, only the slow copyout
path is implemented.

Reported and tested by:	tijl (previous version)
Sponsored by:	The FreeBSD Foundation
2018-04-24 12:50:21 +00:00
kib
82b83ca7ef Use symbolic constant, explaining the operation.
Sponsored by:	The FreeBSD Foundation
2018-04-19 18:08:46 +00:00
kib
e3089a0318 i386 4/4G split.
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.

By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap.  The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.

There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.

copyout(9) was rewritten to use vm_fault_quick_hold().  An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls.  The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline.  If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.

The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done.  The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging.  I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.

Tested by: pho
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D14633
2018-04-13 20:30:49 +00:00
royger
5f1547e410 x86: improve reservation of AP trampoline memory
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D14878
2018-04-05 14:39:51 +00:00
jeff
bfe01083f9 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
jeff
124dca372e Backout r331606 until I can identify why it does not boot on some
machines.
2018-03-27 10:20:50 +00:00
jeff
d1125a4e0d Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:	jhb, kib
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14838
2018-03-27 03:37:04 +00:00
brooks
1db1eebcab Remove obsolete pcaudioio.h.
Nothing uses the #define's values or the types.  (Some NTP code does use
an audio_info_t, but it is in #ifdef'd support for Solaris and is not
this audio_info_t).

Sponsored by:	DARPA, AFRL
2018-03-11 16:17:53 +00:00
jtl
8e9b6569cb amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits
correctly for the data contained on each memory page.

There are several components to this change:
 * Add a variable to indicate the start of the R/W portion of the
   initial memory.
 * Stop detecting NX bit support for each AP.  Instead, use the value
   from the BSP and, if supported, activate the feature on the other
   APs just before loading the correct page table.  (Functionally, we
   already assume that the BSP and all APs had the same support or
   lack of support for the NX bit.)
 * Set the RW and NX bits correctly for the kernel text, data, and
   BSS (subject to some caveats below).
 * Ensure DDB can write to memory when necessary (such as to set a
   breakpoint).
 * Ensure GDB can write to memory when necessary (such as to set a
   breakpoint).  For this purpose, add new MD functions gdb_begin_write()
   and gdb_end_write() which the GDB support code can call before and
   after writing to memory.

This change is not comprehensive:
 * It doesn't do anything to protect modules.
 * It doesn't do anything for kernel memory allocated after the kernel
   starts running.
 * In order to avoid excessive memory inefficiency, it may let multiple
   types of data share a 2M page, and assigns the most permissions
   needed for data on that page.

Reviewed by:	jhb, kib
Discussed with:	emaste
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14282
2018-03-06 14:28:37 +00:00
cem
807035047f Remove unused error return from API that cannot fail
No implementation of fpu_kern_enter() can fail, and it was causing needless
error checking boilerplate and confusion. Change the return code to void to
match reality.

(This trivial change took nine days to land because of the commit hook on
sys/dev/random.  Please consider removing the hook or otherwise lowering the
bar -- secteam never seems to have free time to review patches.)

Reported by:	Lachlan McIlroy <Lachlan.McIlroy AT isilon.com>
Reviewed by:	delphij
Approved by:	secteam (delphij)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14380
2018-02-23 20:15:19 +00:00
imp
c78af2dc5f We don't support gcc < 4.2.1, so varargs.h now is just #error
always. Unifdef for versions prior to 4.2.1 and remove now-unused
header files.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14323
2018-02-12 14:48:14 +00:00
nwhitehorn
49a1f46412 Define PHYS_TO_DMAP() and DMAP_TO_PHYS() as panics on the architectures
(i386 and arm) that never implement them. This allows the removal of
#ifdef PHYS_TO_DMAP on code otherwise protected by a runtime check on
PMAP_HAS_DMAP. It also fixes the build on ARM and i386 after I forgot an
#ifdef in r328168.

Reported by:	Milan Obuch
Pointy hat to:	me
2018-01-19 22:17:13 +00:00
nwhitehorn
e79f2b9178 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
kib
3d18a9d66f Add atomic_load(9) and atomic_store(9) operations.
They provide relaxed-ordered atomic access semantic.  Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses.  The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.

The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations.  It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.

Suggested by:	jhb
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 09:59:20 +00:00
cem
c89be21d55 i386: Bump KSTACK_PAGES default to match amd64
Logically, extend r286288 to cover all threads, by default.

The world has largely moved on from i386.  Most FreeBSD users and developers
test on amd64 hardware.  For better or worse, we have written a non-trivial
amount of kernel code that relies on stacks larger than 8 kB, and it "just
works" on amd64, so there has been little incentive to shrink it.

amd64 had its KSTACK_PAGES bumped to 4 back in Peter's initial AMD64 commit,
r114349, in 2003.  Since that time, i386 has limped along on a stack half
the size.  We've even observed the stack overflows years ago, but neglected
to fix the issue; see the 20121223 and 20150728 entries in UPDATING.

If anyone is concerned with this change, I suggest they configure their
AMD64 kernels with KSTACK_PAGES 2 and fix the fallout there first.  Eugene
has identified a list of high stack usage functions in the first PR below.

PR:		219476, 224218
Reported by:	eugen@, Shreesh Holla <hshreesh AT yahoo.com>
Relnotes:	maybe
Sponsored by:	Dell EMC Isilon
2017-12-11 04:32:37 +00:00
pfg
712696a24c sys/i386: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:08:52 +00:00
hselasky
cd19f399a1 Implement atomic_fetchadd_64() for i386. This function is needed by the
atomic64 header file in the LinuxKPI for i386.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2017-11-24 12:10:42 +00:00
kib
873f304292 Remove lint support from system headers and MD x86 headers.
Reviewed by:	dim, jhb
Discussed with:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13156
2017-11-23 11:40:16 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
pfg
9da7bdde06 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
kib
b53cf0d5b7 Remove i386 XBOX support.
It is for console presented at 2001 and featuring Pentium III
processor.  Even if any of them are still alive and run FreeBSD, we do
not have any sign of life from their users.  While removing another
dozens of #ifdefs from the i386 sources reduces the aversion from
looking at the code and improves the platform vitality.

Reviewed by:	cem, pfg, rink (XBOX support author)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13016
2017-11-16 14:27:02 +00:00
kib
b941255a34 Reset the fs and gs bases on exec(2).
The values from the old address space do not make sense for the new
program.  In particular, gsbase might be the TLS base for the old
program but the new program has no TLS now.

amd64 already handles this correctly.

Reported and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 15:39:43 +00:00
kib
071c00f495 Restore a part of r323722.
Do not return from interrupt using the POP_FRAME;iret instruction
sequence, always jump to doreti.

The user segments selectors saved on the stack might become invalid
because userspace manipulated LDT in a parallel thread.  trap() is
aware of such issue, but it is only prepared to handle it at iret and
segment registers load operations in doreti path.

Also remove POP_FRAME macro because it is no longer used.

Reviewed by:	bde, jhb (as part of r323722)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:46:15 +00:00
kib
81faa225ff Revert r323722. A better fix will be committed shortly, as well as
some still useful bits of the reverted revision.

The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers.  If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.

Fixing the problem is complicated.  Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:38:24 +00:00
kib
981650a056 Fix handling of the segment registers on i386.
Suppose that userspace is executing with the non-standard segment
descriptors.  Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers.  If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to.  As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.

More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them.  In this case kernel panics.

The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.

Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate.  Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers.  Second, we only load the
segment registers from the trap frame when returning to usermode.  For
the later, all interrupt return paths must happen through the doreti
common code.

There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls.  Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.

All the measures make the segment register handling similar to that of
amd64.  We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho (the non-lcall part)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12402
2017-09-18 20:22:42 +00:00
cem
6659d8ab68 Drop CACHE_LINE_SIZE to 64 bytes on x86
The actual cache line size has always been 64 bytes.

The 128 number arose as an optimization for Core 2 era Intel processors.  By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently.  Newer CPUs prefetch more intelligently.

The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington").  If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.

Reported by:	mjg
Reviewed by:	np
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2017-08-28 22:28:41 +00:00
cem
5a79b729aa x86: Add dynamic interrupt rebalancing
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs.  Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources.  Reduced latency
was proven by, e.g., SPEC2008.

Submitted by:	jeff@ (earlier version)
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10435
2017-08-16 18:48:53 +00:00
jkim
d871b34dbc Split identify_cpu() into two functions for amd64 as we do for i386. This
reduces diff between amd64 and i386.  Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.

Reported by:	dexuan
Tested by:	dexuan
2017-08-09 18:09:09 +00:00
cem
d4a7946bbd x86: Tag some intrinsics with __pure2
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__.  Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-03 22:28:30 +00:00
jah
d1caaa9300 Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00