Commit Graph

13650 Commits

Author SHA1 Message Date
avg
2bd01ebfdb align i386 cpu_reset() with amd64 version
Maybe this code could be moved to x86.

MFC after:	1 week
2018-03-30 11:25:30 +00:00
jeff
bfe01083f9 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
jhb
7c78547f5c Fix kernel builds without options DDB after r331650.
Reported by:	cy
2018-03-28 16:24:56 +00:00
jhb
24fa2df20b Remove very old and unused signal information codes.
These have been supplanted by the MI signal information codes in
<sys/signal.h> since 7.0.  The FPE_*_TRAP ones were deprecated even
earlier in 1999.

PR:		226579 (exp-run)
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D14637
2018-03-27 20:57:51 +00:00
jeff
124dca372e Backout r331606 until I can identify why it does not boot on some
machines.
2018-03-27 10:20:50 +00:00
jeff
d1125a4e0d Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:	jhb, kib
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14838
2018-03-27 03:37:04 +00:00
emaste
fa8b1b66d0 Remove redundant cast from Linuxulator SYSINITs 2018-03-23 20:32:54 +00:00
emaste
9f106a66bc Sort headers in MD Linuxulator files
Bring #includes closer to style(9) and reduce differences between the
(three) MD versions of linux_machdep.c and linux_sysvec.c.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-23 17:16:36 +00:00
kib
e114084144 There is no need to disable interrupts around npxsave call.
i386 was changed to only require critical section around the thread
FPU state manipulations, and vm86_bioscall callers already enter
critical section for other reasons.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-23 15:46:53 +00:00
kib
97713e367c Update comment to match current field names.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-03-23 15:44:31 +00:00
emaste
289be8516c Share Linux errno table with libsysdecode
Requested by:	jhb
Reviewed by:	jhb
Sponsored by:	Turing Robotic Industries Inc.
2018-03-22 12:58:49 +00:00
emaste
444301c25e Fix kernel memory disclosure in ibcs2_getdents
ibcs2_getdents() copies a dirent structure to userland.  The ibcs2
dirent structure contains a 2 byte pad element.  This element is never
initialized, but copied to userland none-the-less.

Note that ibcs2 has not built on HEAD since r302095.

Submitted by:	Domagoj Stolfa <ds815@cam.ac.uk>
Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
MFC after:	3 days
Security:	Kernel memory disclosure (803)
2018-03-21 23:26:42 +00:00
emaste
af3be5b4fb Add ) missing from r330297
Sponsored by:	The FreeBSD Foundation
2018-03-21 23:17:26 +00:00
emaste
6d7d087d6c Restore close quote lost in r331254 2018-03-20 21:04:47 +00:00
emaste
6fe54a5343 Rename assym.s to assym.inc
assym is only to be included by other .s files, and should never
actually be assembled by itself.

Reviewed by:	imp, bdrewery (earlier)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D14180
2018-03-20 17:58:51 +00:00
emaste
3df5d233f9 Rationalize license text on Linuxolator files
i386 linux.h missed in r330239.

Approved by:	sos
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-03-20 02:50:11 +00:00
emaste
58e1b5421f Rename linuxulator functions with linux_ prefix
It's preferable to have a consistent prefix.  This also reduces
differences between the three linux*_sysvec.c files.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-19 21:26:32 +00:00
emaste
47fc89046a linux*_sysvec.c: rationalize whitespace and comments
There's a fair amount of duplication between MD linuxulator files.
Make indentation and comments consistent between the three versions of
linux_sysvec.c to reduce diffs when comparing them.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-19 15:11:10 +00:00
emaste
566c3d41cc Share a single bsd-linux errno table across MD consumers
Three copies of the linuxulator linux_sysvec.c contained identical
BSD to Linux errno translation tables, and future work to support other
architectures will also use the same table.  Move the table to a common
file to be used by all.  Make it 'const int' to place it in .rodata.

(Some existing Linux architectures use MD errno values, but x86 and Arm
share the generic set.)

This change should introduce no functional change; a followup will add
missing errno values.

MFC after:	3 weeks
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14665
2018-03-16 14:46:38 +00:00
emaste
d90e216823 ANSIfy i386/vm86.c 2018-03-16 12:12:41 +00:00
emaste
9975d0e7b5 Remove stray ; at end of linux_vdso_deinstall() 2018-03-14 13:20:36 +00:00
emaste
6851f84d1f Use C99 boolean type for translate_osrel
Migrate to modern types before creating MD Linuxolator bits for new
architectures.

Reviewed by:	cem
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14676
2018-03-13 16:40:29 +00:00
emaste
bc4d21ce60 Apply some style(9) to Linuxulator linux_sysvec.c comments 2018-03-13 00:40:05 +00:00
emaste
c303c68d6a imgact_linux.c: use standard indentation
Sponsored by:	Turing Robotic Industries Inc.
2018-03-12 23:28:25 +00:00
emaste
ac81b47ca6 Linuxulator: apply style(9) to return
Sponsored by:	Turing Robotic Industries Inc.
2018-03-12 15:35:24 +00:00
brooks
1db1eebcab Remove obsolete pcaudioio.h.
Nothing uses the #define's values or the types.  (Some NTP code does use
an audio_info_t, but it is in #ifdef'd support for Solaris and is not
this audio_info_t).

Sponsored by:	DARPA, AFRL
2018-03-11 16:17:53 +00:00
eadler
9b1ea58f58 sys: Fix a few potential infoleaks in cloudabi
While there is no immediate leak, if the structure changes underneath
us, there might be in the future.

Submitted by:	Domagoj Stolfa <domagoj.stolfa@gmail.com>
MFC After:	1 month
Sponsored by:	DARPA/AFRL
2018-03-07 14:44:32 +00:00
jtl
8e9b6569cb amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits
correctly for the data contained on each memory page.

There are several components to this change:
 * Add a variable to indicate the start of the R/W portion of the
   initial memory.
 * Stop detecting NX bit support for each AP.  Instead, use the value
   from the BSP and, if supported, activate the feature on the other
   APs just before loading the correct page table.  (Functionally, we
   already assume that the BSP and all APs had the same support or
   lack of support for the NX bit.)
 * Set the RW and NX bits correctly for the kernel text, data, and
   BSS (subject to some caveats below).
 * Ensure DDB can write to memory when necessary (such as to set a
   breakpoint).
 * Ensure GDB can write to memory when necessary (such as to set a
   breakpoint).  For this purpose, add new MD functions gdb_begin_write()
   and gdb_end_write() which the GDB support code can call before and
   after writing to memory.

This change is not comprehensive:
 * It doesn't do anything to protect modules.
 * It doesn't do anything for kernel memory allocated after the kernel
   starts running.
 * In order to avoid excessive memory inefficiency, it may let multiple
   types of data share a 2M page, and assigns the most permissions
   needed for data on that page.

Reviewed by:	jhb, kib
Discussed with:	emaste
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14282
2018-03-06 14:28:37 +00:00
kib
d27cb27779 Unify bulk free operations in several pmaps.
Submitted by:	Yoshihiro Ota
Reviewed by:	markj
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13485
2018-03-04 20:53:20 +00:00
rpokala
03d8fd318a imcsmb(4): Intel integrated Memory Controller (iMC) SMBus controller driver
imcsmb(4) provides smbus(4) support for the SMBus controller functionality
in the integrated Memory Controllers (iMCs) embedded in Intel Sandybridge-
Xeon, Ivybridge-Xeon, Haswell-Xeon, and Broadwell-Xeon CPUs. Each CPU
implements one or more iMCs, depending on the number of cores; each iMC
implements two SMBus controllers (iMC-SMBs).

*** IMPORTANT NOTE ***
Because motherboard firmware or the BMC might try to use the iMC-SMBs for
monitoring DIMM temperatures and/or managing an NVDIMM, the driver might
need to temporarily disable those functions, or take a hardware interlock,
before using the iMC-SMBs. Details on how to do this may vary from board to
board, and the procedure may be proprietary. It is strongly suggested that
anyone wishing to use this driver contact their motherboard vendor, and
modify the driver as described in the manual page and in the driver itself.
(For what it's worth, the driver as-is has been tested on various SuperMicro
motherboards.)

Reviewed by:	avg, jhb
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D14447
Discussed with:	avg, ian, jhb
Tested by:	allanjude (previous version), Panasas
2018-03-03 01:53:51 +00:00
brooks
a0e10aee51 Rename kernel-only members of semid_ds and msgid_ds.
This deliberately breaks the API in preperation for future syscall
revisions which will remove these nonstandard members.

In an exp-run a single port (devel/qemu-user-static) was found to
use them which it did becuase it emulates system calls.  This has
been fixed in the ports tree.

PR:		224443 (exp-run)
Reviewed by:	kib, jhb (previous version)
Exp-run by:	antoine
Sponsored by:	DARPA, AFRP
Differential Revision:	https://reviews.freebsd.org/D14490
2018-03-02 22:10:48 +00:00
cem
807035047f Remove unused error return from API that cannot fail
No implementation of fpu_kern_enter() can fail, and it was causing needless
error checking boilerplate and confusion. Change the return code to void to
match reality.

(This trivial change took nine days to land because of the commit hook on
sys/dev/random.  Please consider removing the hook or otherwise lowering the
bar -- secteam never seems to have free time to review patches.)

Reported by:	Lachlan McIlroy <Lachlan.McIlroy AT isilon.com>
Reviewed by:	delphij
Approved by:	secteam (delphij)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14380
2018-02-23 20:15:19 +00:00
emaste
5122d84e64 Correct pseudo misspelling in sys/ comments
contrib code and #define in intel_ata.h unchanged.
2018-02-23 18:15:50 +00:00
emaste
f4add23ff5 Correct proper nouns in the Linuxulator
- Capitalize Linux
- Spell FreeBSD out in full
- Address some style(9) on changed lines

Sponsored by:	Turing Robotic Industries Inc.
2018-02-22 02:24:17 +00:00
kib
ee3d0fb8ef vm_wait() rework.
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass.  If there is no object
associated with the wait, use curthread' policy domainset.  The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.

Eliminate pagedaemon_wait().  vm_domain_clear() handles the same
operations.

Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.

Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.

Scetched and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
Differential revision:	https://reviews.freebsd.org/D14384
2018-02-20 10:13:13 +00:00
emaste
624a2a708b Rationalize license text on Linuxolator files
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text.  To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders.  Additional files
waiting on permission from others are listed in review D14210.

Approved by:	kan, marcel, sos, rdivacky
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-02-16 15:00:14 +00:00
cem
47a3cad5ce x86 pmap: Make memory mapped via pmap_qenter() non-executable
The idea is, the pmap_qenter() API is now defined to not produce executable
mappings.  If you need executable mappings, use another API.

Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable.

Other architectures that support execute-specific permissons on page table
entries should subsequently be updated to match.

Submitted by:	Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by:	markj
Discussed with:	alc, jhb, kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14062
2018-02-14 23:35:47 +00:00
hselasky
4287266c4f Import the mthca kernel side infiniband driver from Linux 4.9 and fix
compilation under FreeBSD. The mthca driver was temporarily removed as
part of the Linux 4.9 RoCE/infinband upgrade.

Top commit in Linux source tree:
69973b830859bc6529a7a0468ba0d80ee5117826

Sponsored by:	Mellanox Technologies
2018-02-13 17:04:34 +00:00
jeff
ba27b5187b Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
imp
c78af2dc5f We don't support gcc < 4.2.1, so varargs.h now is just #error
always. Unifdef for versions prior to 4.2.1 and remove now-unused
header files.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14323
2018-02-12 14:48:14 +00:00
markj
7bc81b6db1 Use vm_page_unwire_noq() instead of directly modifying page wire counts.
No functional change intended.

Reviewed by:	alc, kib (previous revision)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14266
2018-02-08 19:28:51 +00:00
jeff
e67ec0d694 Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
kib
fba8d52923 Move signal trampolines out of locore.s into separate source file.
Similar to other arches, the move makes the subject of locore.s only
the kernel startup.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-02-06 00:02:30 +00:00
emaste
0eb6366bc8 Additional linuxolator whitespace cleanup, missed in r328890 2018-02-05 18:39:06 +00:00
emaste
5c9ea56c98 Linuxolator whitespace cleanup
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator.  Clean these up so that new
architectures do not inherit whitespace issues.

Clean up shared Linuxolator files while here.

Sponsored by:	Turing Robotic Industries Inc.
2018-02-05 17:29:12 +00:00
kib
01b52fdebb IBRS support, AKA Spectre hardware mitigation.
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0.  Current status can be checked in sysctl
hw.ibrs_active.  The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14029
2018-01-31 14:36:27 +00:00
bdrewery
59f27534cc Don't use an .OBJDIR for 'make sysent'.
Reported by:	emaste, jhb
Sponsored by:	Dell EMC
2018-01-29 19:14:15 +00:00
imp
1912ffb2e5 Add ISA PNP tables to ISA drivers. Fix a few incidental comments.
ACPI ISA PBP tables not tagged, there's bigger issues with them.
2018-01-29 00:22:30 +00:00
kib
545e25ea75 Use PCID to optimize PTI.
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.

I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.

Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address.  Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.

On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.

I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc (some aspects)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D13985
2018-01-27 11:49:37 +00:00
emaste
ae11f64597 Use BSD-2-Clause-FreeBSD license on linux_support.s
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.

Approved by:	kib
2018-01-23 20:35:43 +00:00
pfg
ced875130d Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
royger
774ffaf6a8 xen: fix IDT setup after PTI
On amd64 the IDT handler was not set correctly when using PTI.

While there also fix the selectors to SEL_KPL.

Obtained from:	kib
MFC with:	r328083
2018-01-20 14:59:37 +00:00
nwhitehorn
49a1f46412 Define PHYS_TO_DMAP() and DMAP_TO_PHYS() as panics on the architectures
(i386 and arm) that never implement them. This allows the removal of
#ifdef PHYS_TO_DMAP on code otherwise protected by a runtime check on
PMAP_HAS_DMAP. It also fixes the build on ARM and i386 after I forgot an
#ifdef in r328168.

Reported by:	Milan Obuch
Pointy hat to:	me
2018-01-19 22:17:13 +00:00
nwhitehorn
e79f2b9178 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
jhb
0aaf564f94 Use long for the last argument to VOP_PATHCONF rather than a register_t.
pathconf(2) and fpathconf(2) both return a long.  The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly.  Instead, the system calls explicitly store the
returned long value in td_retval[0].

Requested by:	bde
Reviewed by:	kib
Sponsored by:	Chelsio Communications
2018-01-17 22:36:58 +00:00
kib
c35d24e497 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
pfg
a7c6776f59 x86: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:08:22 +00:00
emaste
48fbddaae8 Enable VIMAGE in i386 GENERIC (revert r327840)
We've switched back to ld.bfd on i386 for now.

PR:		225077
Sponsored by:	The FreeBSD Foundation
2018-01-14 16:04:51 +00:00
jeff
f375b4dd66 Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
emaste
3ba4847ec6 Temporarily disable VIMAGE on i386
An lld-linked i386 kernel hangs on boot, after em(4) starts.  This seems
similar to the issue that prompted us to disable VIMAGE on arm64 in
r326179.  Disable VIMAGE on i386 for now while we investigate.

PR:		225077
Sponsored by:	The FreeBSD Foundation
2018-01-11 19:08:43 +00:00
kib
9a35b0521e Fix grammar.
Submitted by:	alc
MFC after:	3 days
2018-01-11 16:50:03 +00:00
kib
3a13615ee3 Remove redundand CLD instructions.
We already clear %RFLAGS.DF on the kernel entry due to the compiler's
ABI requirements.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-11 13:22:13 +00:00
kib
7429338dc0 Update comment explaining the check, to reality.
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-01-11 12:07:24 +00:00
cem
8899c9cd84 x86: Document purpose of _safe variants of {rd,wr}msr()
Sponsored by:	Dell EMC Isilon
2018-01-10 22:41:00 +00:00
imp
6d4bb5fab7 inittodr(0) actually sets the time, so there's no need to call
tc_setclock(). It's redundant. Tweak UPDATING based on code review of
past releases.

Relnotes: yes (for the removal of pmtimer)
2018-01-10 17:25:08 +00:00
imp
10d1a3f2a4 Move prof_machdep.c to it's more traditional place under i386/i386. 2018-01-10 16:51:55 +00:00
imp
07f8cf61c1 Retire pmtimer driver. Move time fixing into apm driver. Move
Iwasaki-san's copyright over. Remove FIXME code that couldn't possibly
work. Call tc_settime() with our estimate of the delta we've been
alseep (the one we print) to adjust the time. Not sure what to do
about callouts, so keep the small #ifdef in place there.

Differential Revision: https://reviews.freebsd.org/D13823
2018-01-10 14:59:19 +00:00
imp
568d742088 Remove ccbque.h from i386/isa.
inline ccbque.h into scsi_low.h. The file isn't MD, so shouldn't live
in i386/isa. It's only used by scsi_low, so move it there so no new
clients accidentally grow. scsi_low may not even still work, and the
locking here is still SPL based. CAM should do the right thing, but
I've received no reports of these cards still working. At least it
compiles still and there's one fewer files in sys/i386/isa. While I'm
here, ansify and de-splize. CCB_MWANTED appears to be a clear-only
flag, but I've not changed that.

Differential Review: https://reviews.freebsd.org/D13672
2018-01-09 16:11:33 +00:00
kib
e478c0d2cf Avoid re-check of usermode condition.
It does not change anything in the behavior of trap_pfault(), while
eliminating obfuscation of jumping to the code which checks for the
condition reversed of the goto cause.  Also avoid force initialize the
rv variable, since it is now only accessed after storing vm_fault()
return value.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13725
2018-01-01 20:47:03 +00:00
kib
e75b16052d Move i386/isa/elink.[hc] to dev/ep.
The ep(4) driver is the only consumer of the two functions from
elink.c.  I removed the standalone module as well, and most likely,
the module metadata is not needed anywhere, but this is for later
cleanup.

Discussed with:	imp, jhb
Sponsored by:	The FreeBSD Foundation
2017-12-30 11:42:49 +00:00
kib
5df88ba9c0 Move i386/isa/npx.c to i386i386/npx.c.
The i386 FPU (AKA npx) code does not depend on ISA devices at all,
after the support for IRQ13 FPU exceptions was removed.  Put the file
into the expected place in the kernel source tree.

Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
2017-12-30 11:33:04 +00:00
imp
7634b7d88a Remove two stray references to wl driver, removed some time ago. 2017-12-30 07:59:32 +00:00
jhb
cc9a800848 Remove a few more dangling references to ie(4). 2017-12-28 21:35:53 +00:00
pfg
cab3e1fc44 sys/i386/isa/elink*: sync with (older) NetBSD.
Make it easier to identify the point where we started diverging from
NetBSD. Newer versions of these files have been updated to a new license
so this information may become useful some day.

Obtained from:	NetBSD
2017-12-28 14:14:35 +00:00
eadler
421a929b1e kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
kib
3d18a9d66f Add atomic_load(9) and atomic_store(9) operations.
They provide relaxed-ordered atomic access semantic.  Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses.  The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.

The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations.  It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.

Suggested by:	jhb
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 09:59:20 +00:00
bde
5c5c139b6a Also forgotten in the previous that removed the permanent double mapping
of low physical memory:

Update the comment about leaving the permanent mapping in place.  This
also improves the wording of the comment.  PTD 0 is still left alone
because it is fairly important that it was unmapped earlier, and the
comment now describes the unmapping of the other low PTDs that the code
actually does.

Reviewed by:	kib
2017-12-18 14:29:48 +00:00
bde
994bacdf8f Remove the permanent double mapping of low physical memory and replace
it by a transient double mapping for the one instruction in ACPI wakeup
where it is needed (and for many surrounding instructions in ACPI resume).
Invalidate the TLB as soon as convenient after undoing the transient
mapping.  ACPI resume already has the strict ordering needed for this.

This fixes the non-trapping of null pointers and other garbage pointers
below NBPDR (except transiently).  NBPDR is quite large (4MB, or 2MB for
PAE).

This fixes spurious traps at the first instruction in VM86 bioscalls.
The traps are for transiently missing read permission in the first
VM86 page (physical page 0) which was just written to at KERNBASE in
the kernel.  The mechanism is unknown (it is not simply PG_G).

locore uses a similar but larger transient double mapping and needs
it for 2 instructions instead of 1.  Unmap the first PDE in it after
the 2 instructions to detect most garbage pointers while bootstrapping.
pmap_bootstrap() finishes the unmapping.

Remove the avoidance of the double mapping for a recently fixed special
case.  ACPI resume could use this avoidance (made non-special) to avoid
any problems with the transient double mapping, but no such problems
are known.

Update comments in locore.  Many were for old versions of FreeBSD which
tried to map low memory r/o except for special cases, or might have
allowed access to low memory via physical offsets.  Now all kernel
maps are r/w, and removal of of the double map disallows use of physical
offsets again.
2017-12-18 13:53:22 +00:00
bde
6031fc5935 Fix the undersupported option KERNLOAD, part 2: fix crashes in locore
when KERNLOAD is smaller than NBPDR (not the default) and PG_G is
enabled (the default if the CPU supports it).  This case has relatively
minor problems with coherency of the permanent double mapping, but the
fix in r167869 to improve coherency creates page tables with 3 different
errors so never worked.

The permanent double mapping is fundamentally broken and will be removed
soon.  It fundamentally breaks trapping for null pointers and requires
complications to avoid cache coherency bugs.  It is currently used for
only a single instruction in ACPI resume,

Many fixes VM86 and/or ACPI and/or the double map were attempted near
r1200000.  r167869 attempted to fix cache coherency bugs in an unusual
case, but the bugs were unreachable because older errors in page tables
caused a crash first.

This commit just makes r167869 work as intended.  Part 1 of these fixes
fixed the other errors, but also stopped mapping the PDE for KERNBASE
as a large page, so double mapping of this PDE only causes the same
problems as when KERNLOAD is the default.  Except for the problem of
trapping null pointers, r167869 could be used to fix these problems,
but it is inactive in usual cases.  The only known other problem is
that incoherent permissions for page 0 cause spurious traps in VM86
BIOS calls.

Reviewed by:	kib
2017-12-18 11:57:05 +00:00
bde
622efbbef8 Fix the undersupported option KERNLOAD, part 1: fix crashes in locore
when KERNLOAD is not a multiple of NBPDR (not the default) and PSE is
enabled (the default if the CPU supports it).  Addresses in PDEs must
be a multiple of NBPDR in the PSE case, but were not so in the crashing
case.

KERNLOAD defaults to NBPDR.  NBPDR is 4 MB for !PAE and 2 MB for PAE.
The default can be changed by editing i386/include/vmparam.h or using
makeoptions.  It can be changed to less than NBPDR to save real and
virtual memory at a small cost in time, or to more than NBPDR to waste
real and virtual memory.  It must be larger than 1 MB and a multiple of
PAGE_SIZE.  When it is less than NBPDR, it is necessarily not a multiple
of NBPDR.  This case has much larger bugs which will be fixed in part 2.

The fix is to only use PSE for physical addresses above <KERNLOAD
rounded _up_ to an NBPDR boundary>.  When the rounding is non-null,
this leaves part of the kernel not using large pages.  Rounding down
would avoid this pessimization, but would break setting of PAT bits
on i/o pages if it goes below 1MB.  Since rounding down always goes
below 1MB when KERNLOAD < NBPDR and the KERNLOAD > NBPDR case is not
useful, never round down.

Fix related style bugs (e.g., wrong literal values for NBPDR in comments).

Reviewed by:	kib
2017-12-18 09:32:56 +00:00
bde
4e663070d6 Minor cleanups found while fixing a bug involving double mapping of low
memory:

Load the kernel eflags less magically, as in locore.  The magic increased
when I removed eflags from the pcb in r305899.

Remove a jump to low memory that became garbage when the i386 version was
mostly replaced by the amd64 version in r235622.

The amd64 version is very similar.  It still loads the flags magically,
but is not missing comments about using the special page table.

Reviewed by:	kib
2017-12-15 03:05:14 +00:00
markj
b0b9b4fcf4 Pass the trap frame to fasttrap hooks.
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.

MFC after:	2 weeks
2017-12-11 19:21:39 +00:00
cem
c89be21d55 i386: Bump KSTACK_PAGES default to match amd64
Logically, extend r286288 to cover all threads, by default.

The world has largely moved on from i386.  Most FreeBSD users and developers
test on amd64 hardware.  For better or worse, we have written a non-trivial
amount of kernel code that relies on stacks larger than 8 kB, and it "just
works" on amd64, so there has been little incentive to shrink it.

amd64 had its KSTACK_PAGES bumped to 4 back in Peter's initial AMD64 commit,
r114349, in 2003.  Since that time, i386 has limped along on a stack half
the size.  We've even observed the stack overflows years ago, but neglected
to fix the issue; see the 20121223 and 20150728 entries in UPDATING.

If anyone is concerned with this change, I suggest they configure their
AMD64 kernels with KSTACK_PAGES 2 and fix the fallout there first.  Eugene
has identified a list of high stack usage functions in the first PR below.

PR:		219476, 224218
Reported by:	eugen@, Shreesh Holla <hshreesh AT yahoo.com>
Relnotes:	maybe
Sponsored by:	Dell EMC Isilon
2017-12-11 04:32:37 +00:00
bde
d81660feeb Move instantiation of msgbufp from 9 MD files to subr_prf.c.
This variable should be pure MI except possibly for reading it in MD
dump routines.  Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998.  There were many imperfections in
r36441.  This commit fixes only a small one, to simplify fixing the
others 1 arch at a time.  (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
2017-12-07 07:55:38 +00:00
pfg
b0f7aa75d4 SPDX: use the Beerware identifier. 2017-11-30 20:33:45 +00:00
scottl
49fb5dd79b It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from
the standard kernels.  They are still available as custom compile
options.
2017-11-29 23:41:49 +00:00
brooks
c6fbed1a3a Disable vim syntax highlighting.
Vim's default pick doesn't understand that ';' is a comment character
and the result looks horrible.

Reviewed by:	emaste
2017-11-28 18:23:17 +00:00
fsu
24e4690114 Remap ENOATTR to ENODATA in the linuxulator.
In the linux ENOADATA is frequently #defined as ENOATTR.
The change is required for an xattrs support implementation.

MFC after: 1 week
Discussed with: netchild
Approved by: pfg

Differential Revision: https://reviews.freebsd.org/D13221
2017-11-27 17:03:11 +00:00
pfg
712696a24c sys/i386: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:08:52 +00:00
ed
d13a2ec254 Use TO_PTR() to convert integers to pointers.
For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in
this place. Also use it for all of the other source files, so that the
difference remains as minimal as possible.

MFC after:	2 weeks
2017-11-26 14:45:56 +00:00
hselasky
091ce9badd Merge ^/head r326132 through r326161. 2017-11-24 12:13:27 +00:00
hselasky
7b5126003a Merge ^/head r325999 through r326131. 2017-11-23 14:28:14 +00:00
kib
873f304292 Remove lint support from system headers and MD x86 headers.
Reviewed by:	dim, jhb
Discussed with:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13156
2017-11-23 11:40:16 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
hselasky
c6f05b2594 Merge ^/head r325842 through r325998. 2017-11-19 12:36:03 +00:00
pfg
9da7bdde06 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
kib
b53cf0d5b7 Remove i386 XBOX support.
It is for console presented at 2001 and featuring Pentium III
processor.  Even if any of them are still alive and run FreeBSD, we do
not have any sign of life from their users.  While removing another
dozens of #ifdefs from the i386 sources reduces the aversion from
looking at the code and improves the platform vitality.

Reviewed by:	cem, pfg, rink (XBOX support author)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13016
2017-11-16 14:27:02 +00:00
hselasky
b909dfecb7 Remove no longer supported mthca driver.
Sponsored by:	Mellanox Technologies
2017-11-13 10:59:38 +00:00
kib
a43e8bfb3a Remove useless DEBUG printfs in i386 sendsig() implementations.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-08 13:05:14 +00:00
kib
8113f89b71 x86: Do not emit unused TD_TID symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-04 10:51:52 +00:00
kib
a46f239e34 Eliminate unused load.
Based on github pull request:	#117
Submitted by:	Wuyang-Chung@github
MFC after:	1 week
2017-11-04 10:50:47 +00:00
kib
a6dcbd1557 Consistently ensure that we do not load MXCSR with reserved bits set.
Some callers of fpusetregs()/npxsetregs(), most importantly
set_fpcontext(), clear reserved bits.  But some did not.  Do the
clearing in fpusetregs() and remove now redundand operation from
set_fpcontext().

Reported by:	Maxime Villard <max@m00nbsd.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-01 10:32:44 +00:00
tijl
173e2ebded Set the return address for stack entry points to zero.
Stack unwinders treat zero as a stop condition.  The value on the stack can
be non-zero because thread stacks may be arbitrary memory provided via
pthread_attr_setstack(3) or may be recycled from previous threads.

Reference:
https://lists.freebsd.org/pipermail/freebsd-current/2017-August/066855.html
https://lists.freebsd.org/pipermail/freebsd-current/2017-October/067254.html

Discussed with:	kib
MFC after:	1 week
2017-10-31 11:51:34 +00:00
eadler
45275e3a26 Update several more URLs
- Primarily http -> https
- Primarily FreeBSD project URLs
2017-10-29 08:17:03 +00:00
markj
1588800df4 Fix the VM_NRESERVLEVEL == 0 build.
Add VM_NRESERVLEVEL guards in the pmaps that implement transparent
superpage promotion using reservations.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12764
2017-10-23 15:34:05 +00:00
bz
48b1992757 With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into
HEAD.  Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.

Disable building LINT-VIMAGE with VIMAGE being default.

This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.

Requested by:		many
Reviewed by:		kristof, emaste, hiren
X-MFC after:		never
Relnotes:		yes
Differential Revision:	https://reviews.freebsd.org/D12639
2017-10-20 21:40:59 +00:00
markj
c87fb69add Move kernel dump offset tracking into MI code.
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.

This is needed to implement support for kernel dump compression in the
MI kernel dump code.

Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.

Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11722
2017-10-18 15:38:05 +00:00
kib
8844e99855 Change i386_get_ldt() to return 'EOF' when the requested range of
descriptors does not fit into currently allocated LDT, or trim the
return if the range fits partially.  Before, the function returned
EINVAL.

Fix two bugs in r324366: use capped num counter for malloc size, and
do not leak allocated buffer on EINVAL (by handling EINVAL case as
normal, see above).

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 16:19:26 +00:00
kib
3851626613 Improvements to set_user_ldt().
Remove mtx_owned() checks from set_user_ldt().  Split the function
into _locked() version which requires the dt_lock spinlock owned, and
make set_user_ldt() a wrapper.  Add a comment in swtch.s noting that
the call to the new set_user_ldt() cannot recurse on dt_lock.

Remove #ifdef SMP block, the addend is always zero on UP.

Fix type of set_user_ldt_rv(), making it match the type used for
smb_rendezvous() callback, and remove the cast.  Use curproc.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 16:07:27 +00:00
kib
b941255a34 Reset the fs and gs bases on exec(2).
The values from the old address space do not make sense for the new
program.  In particular, gsbase might be the TLS base for the old
program but the new program has no TLS now.

amd64 already handles this correctly.

Reported and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 15:39:43 +00:00
kib
b9de6b3f87 More style.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 15:24:18 +00:00
kib
52a8a4f490 Improve i386_get_ldt().
Provide consistent snapshot of the requested descriptors by preventing
other threads from modifying LDT while we fetch the data, lock dt_lock
around the read.  Copy the data into intermediate buffer, which is
copied out after the lock is dropped.

Comparing with the amd64 version, the read is done byte by byte, since
there is no atomic 64bit read (cmpxchg8b method is too heavy comparing
with the avoided issues).

Improve overflow checking for the descriptors range calculations and
remove unneeded casts.  Use unsigned types for sizes.

Allow zero num argument to i386_get_ldt() and i386_set_ldt().  This
case is handled naturally by the code flow.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-06 14:29:53 +00:00
kib
e300c500de Remove unneeded cast.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-06 10:17:50 +00:00
kib
e0c7ec41ee Style.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-06 10:16:57 +00:00
kib
c2dbc3a78e Use ANSI C declarations.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 19:11:25 +00:00
kib
a81dddf1d3 Correct format specifiers in the debug code. Style.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 18:58:28 +00:00
kib
b85dee047c Style.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 18:42:13 +00:00
kib
50cb59e230 A different fix for the issue from r323722.
Split the handlers for pop of invalid selectors from the trap frame
into usermode and kernel variants.  Usermode handler is kept as is, it
restores the already loaded parts of the trap frame and jumps to set
up a signal delivery to the user process.

New kernel part of the handler emulates IRET treatment of the segments
which would violate access right.  It loads NUL selector in the
segment register which load causes the fault, and then continues the
return to interrupted kernel code.  Since invalid selectors in the
segment registers in the kernel mode can only exist while kernel still
enters or exits from userspace, we only zero invalid userspace
selectors.  If userspace tries to use the segment register, it gets a
signal, as if the processor segment descriptor cache was reloaded.

Reported by:	Maxime Villard <max@m00nbsd.net>
Suggested and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 09:01:28 +00:00
kib
071c00f495 Restore a part of r323722.
Do not return from interrupt using the POP_FRAME;iret instruction
sequence, always jump to doreti.

The user segments selectors saved on the stack might become invalid
because userspace manipulated LDT in a parallel thread.  trap() is
aware of such issue, but it is only prepared to handle it at iret and
segment registers load operations in doreti path.

Also remove POP_FRAME macro because it is no longer used.

Reviewed by:	bde, jhb (as part of r323722)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:46:15 +00:00
kib
81faa225ff Revert r323722. A better fix will be committed shortly, as well as
some still useful bits of the reverted revision.

The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers.  If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.

Fixing the problem is complicated.  Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:38:24 +00:00
jpaetzel
b35131985b Fix indentation for r323068
PR:	220170
Reported by:	lidl
MFC after:	3 days
Pointyhat to:	jpaetzel
2017-09-19 20:40:05 +00:00
kib
466bfe2553 Do not do torn writes to active LDTs.
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12413
2017-09-19 17:57:04 +00:00
kib
981650a056 Fix handling of the segment registers on i386.
Suppose that userspace is executing with the non-standard segment
descriptors.  Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers.  If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to.  As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.

More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them.  In this case kernel panics.

The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.

Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate.  Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers.  Second, we only load the
segment registers from the trap frame when returning to usermode.  For
the later, all interrupt return paths must happen through the doreti
common code.

There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls.  Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.

All the measures make the segment register handling similar to that of
amd64.  We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho (the non-lcall part)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12402
2017-09-18 20:22:42 +00:00
jpaetzel
bfd734f77c Revert r323087
This needs more thinking out and consensus, and the commit message
was wrong AND there was a typo in the commit.

pointyhat:	jpaetzel
2017-09-01 17:03:48 +00:00
jpaetzel
612bb8539d Take options IPSEC out of GENERIC
PR:	220170
Submitted by:	delphij
Reviewed by:	ae, glebius
MFC after:	2 weeks
Differential Revision:	D11806
2017-09-01 15:54:53 +00:00
jpaetzel
f7739d7e09 Allow kldload tcpmd5
PR:	220170
MFC after:	2 weeks
2017-08-31 20:16:28 +00:00
mav
5849e8f575 Add NTB driver for PLX/Avago/Broadcom PCIe switches.
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too.  It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells.  There are also 4 DMA engines
in those chips, but they are not yet supported.

While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-08-30 21:16:32 +00:00
cem
6659d8ab68 Drop CACHE_LINE_SIZE to 64 bytes on x86
The actual cache line size has always been 64 bytes.

The 128 number arose as an optimization for Core 2 era Intel processors.  By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently.  Newer CPUs prefetch more intelligently.

The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington").  If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.

Reported by:	mjg
Reviewed by:	np
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2017-08-28 22:28:41 +00:00
kib
36c08d0250 MFamd64 r322720, r322723:
Simplify i386 trap().

- Use more relevant name 'signo' instead of 'i' for the local variable
  which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
  of the trap() function.  Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:12:25 +00:00
kib
27dbd32f45 Remove unused code.
The machdep.uprintf_signal sysctl replaced it in more convenient way,
not requiring recompilation to use and providing more information on
fault.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:09:27 +00:00
kib
f6d447048d MFamd64 r322718:
Use ANSI C declaration for trap_pfault().  Style.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:06:29 +00:00
kib
0a239887e3 MFamd64 r322719:
Trim excessive 'extern'.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:04:29 +00:00
kib
03221989ce Use the known valid segment when accessing memory in #UD handler.
Make sure that %eflags.D flag is cleared for hook.
Improve comments.

When #UD dtrace code checks for a registered hook before checking that
the exception was raised from kernel mode, we might run with the user
%ds, trapping on access.  Exception entry from userspace automatically
load valid %ss, which we can use there instead.

Noted and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-19 21:00:02 +00:00
kib
6dd0b9e76c When checking that #UD comes from kernel mode, check that the
exception did not happen in vm86 mode.  A vm86 userland process could
have a %cs that matches GSEL_KPL, while dtrace cannot hook it.

Submitted by:	Maxime Villard <max@m00nbsd.net>
MFC after:	3 days
2017-08-18 17:11:15 +00:00
markj
ce8e2801bf Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
markj
f6dd3eb223 Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
cem
5a79b729aa x86: Add dynamic interrupt rebalancing
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs.  Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources.  Reduced latency
was proven by, e.g., SPEC2008.

Submitted by:	jeff@ (earlier version)
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10435
2017-08-16 18:48:53 +00:00
jkim
d871b34dbc Split identify_cpu() into two functions for amd64 as we do for i386. This
reduces diff between amd64 and i386.  Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.

Reported by:	dexuan
Tested by:	dexuan
2017-08-09 18:09:09 +00:00
jkim
0ac013123c Detect hypervisors early. We used to set lower hz on hypervisors by default
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().

MFC after:	3 days
2017-08-05 06:56:46 +00:00
cem
d4a7946bbd x86: Tag some intrinsics with __pure2
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__.  Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-03 22:28:30 +00:00
kib
38f0d940ba Do not call trapsignal() after handling usermode fault or interrupt,
when a signal is not intended to be sent.

The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.

For NMI, a return to usermode without ast processing is done.  On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.

Reported by:	Nils Beyer <nbe@renzel.net>
PR:	221151
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-02 10:12:10 +00:00
markj
b23f6b9b03 Batch updates to v_wire_count when freeing page table pages on x86.
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.

Reviewed by:	alc, kib (previous revision)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11790
2017-08-01 05:26:30 +00:00
kib
030abb1986 Remove unused symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-30 21:52:22 +00:00
dchagin
93ea6317f4 Avoid using [LINUX_]SHAREDPAGE constant directly in the vdso code.
This is needed for https://reviews.freebsd.org/D11780.

Reported by:	kib@
2017-07-30 21:24:20 +00:00
kib
e83fbfb104 Fix handling of one more possible exception on return to usermode.
If %ss is loaded with a segment pointing to a non-present descriptor
by the IRETD instruction, a kernel-mode #SS exception is generated.
Resulting T_STKFLT trap must be checked against doreti_iret_fault
location and handled, otherwise userspace may panic the kernel.

Note that this is i386 variant of FreeBSD-SA-15:21.amd64, but unlike
amd64, there is no swapgs on i386 and the issue is arguably not
exploitable.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-08 11:07:39 +00:00
sbruno
12de4d18c8 Garbage collect kernel option TWA_FLASH_FIRMWARE
Submitted by:	 kevin.bowling0kev009.com
Differential Revision:	https://reviews.freebsd.org/D11387
2017-07-03 19:33:50 +00:00
dchagin
a0f4bac287 Add support for musl consumers to the Linuxulator.
PR:		213809
Submitted by:	Yonas Yanfa
Reported by:	Yonas Yanfa
MFC after:	1 week
Relnotes:	yes
2017-07-03 10:24:49 +00:00
alc
bf2a2bba1b When "force" is specified to pmap_invalidate_cache_range(), the given
start address is not required to be page aligned.  However, the loop
within pmap_invalidate_cache_range() that performs the actual cache
line invalidations requires that the starting address be truncated to
a multiple of the cache line size.  This change corrects an error in
that truncation.

Submitted by:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-07-01 16:42:09 +00:00
jah
d1caaa9300 Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00
kib
c25bc401a9 Fix indent.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-06-24 10:19:06 +00:00
kib
01fe33c656 Correct translations between abridged and full x87 tags.
Reported and tested by:	karnajit wangkhem <karnajitw@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-17 11:25:31 +00:00
kib
e2a14c603f Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
kib
7b6fe97487 Make struct syscall_args visible to userspace compilation environment
from machine/proc.h, consistently on all architectures.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 20:53:44 +00:00
jhb
5387dbf595 Remove the BSD/OS 2.1 system call gate LDT entry.
An extra copy of the system call gate was added to the default LDT back
in 1996 (r18513 / r18514).  However, the ability to run BSD/OS 2.1
i386 binaries under FreeBSD's native ABI is most likely no longer
needed.

Discussed with:	kib
2017-05-23 22:34:18 +00:00
emaste
1901c3e1f2 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
jkim
9445e9d6e1 Use kmem_malloc() instead of malloc(9) for the native amd64 filter.
r316767 broke the BPF JIT compiler for amd64 because malloc()'d space is no
longer executable.

Discussed with:	kib, alc
2017-04-17 22:02:09 +00:00
jkim
22babe5387 Move declarations for a machine-dependent function to the header file. 2017-04-17 21:51:26 +00:00
jkim
98569e5163 Reduce diff with amd64 version. 2017-04-17 21:46:54 +00:00
glebius
21ead51d79 - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
glebius
0266d4259e Remove unused assembly symbols pointing to vmmeter. 2017-04-17 17:18:07 +00:00
pkelsey
33064e92a2 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
avatar
93f11e526d Trying to be more compatible with Linux if.h definitions:
- renaming l_ifreq::ifru_metric to l_ifreq::ifru_ivalue;
	- adding a definition for ifr_ifindex which points to l_ifreq::ifru_ivalue.

A quick search indicates that Linux already got the above changes since 2.1.14.

Reviewed by:	kib, marcel, dchagin
MFC after:	1 week
2017-04-08 14:41:39 +00:00
markj
d294fb49c0 Adjust the constraint for "src" in atomic_(f)cmpset_8.
"r" is not sufficient to prevent the use of invalid byte-width registers
with at least gcc.

Reported and reviewed by:	bde
X-MFC-With:	r315718
2017-03-27 16:18:19 +00:00
avg
7a52acd8b3 revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
bde
254458ab34 Fix printing of negative offsets (typically from frame pointers) again.
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before.  i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386.  powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.

Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.

-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.

The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.

Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
2017-03-26 18:46:35 +00:00
avg
04ec8ce247 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
dchagin
7e4bbabbee Implement Linux mincore() system call.
This is necessary for the upcoming drm-next.

Suggested by:	hselasky@
MFC after:	1 month
2017-03-25 15:47:29 +00:00
bde
27ac811b7a Remove buggy adjustment of page tables in db_write_bytes().
Long ago, perhaps only on i386, kernel text was mapped read-only and
it was necessary to change the mapping to read-write to set breakpoints
in kernel text.  Other writes by ddb to kernel text were also allowed.
This write protection is harder to implement with 4MB pages, and was
lost even for 4K pages when 4MB pages were implemented.  So changing
the mapping became useless.  It was actually worse than useless since
it followed followed various null and otherwise garbage pointers to
not change random memory instead of the mapping.  (On i386s, the
pointers became good in pmap_bootstrap(), and on amd64 the pointers
became bad in pmap_bootstrap() if not before.)

Another bug broke detection of following of null pointers on i386,
except early in boot where not detecting this was a feature.  When
I fixed the bug, I accidentally broke the feature and soon got traps
in db_write_bytes().  Setting breakpoints early in ddb was broken.

kib pointed out that a clean way to do the adjustment would be to use
a special [sub]map giving a small window on the bytes to be written.

The trap handler didn't know how to fix up errors for pagefaults
accessing the map itself.  Such errors rarely need fixups, since most
traps for the map are for the first access which is a read.

Reviewed by:	kib
2017-03-24 17:34:55 +00:00
glebius
d44335ea0c Remove Solaris 2.6 syscalls selector.
Discussed with:	kib
2017-03-23 19:54:41 +00:00
ed
63254ceea6 Stop providing the compat_3_brand.
As of r315860, the ELF image activator works fine for CloudABI without it.

Reviewed by:	kib
MFC after:	2 weeks
2017-03-23 14:12:21 +00:00
kib
da63ef1f60 Update r315753 with the proper flag name.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:28:13 +00:00
kib
a22b5a3135 Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only
matches static binaries.

Interpretation of the 'static' there is that the binary must not
specify an interpreter.  In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.

This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.

Revert r315701.

Discussed with and tested by:	ed (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:23:01 +00:00
markj
c4e0ff355e Add support for 8- and 16-bit atomic_(f)cmpset to x86.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10068
2017-03-22 17:29:04 +00:00
ed
fc95dfd2e5 Set the interpreter path to /nonexistent.
CloudABI executables are statically linked and don't have an
interpreter. Setting the interpreter path to NULL used to work
previously, but r314851 introduced code that checks the string
unconditionally. Running CloudABI executables now causes a null pointer
dereference.

Looking at the rest of imgact_elf.c, it seems various other codepaths
already leaned on the fact that the interpreter path is set. Let's just
go ahead and pick an obviously incorrect interpreter path to appease
imgact_elf.c.

MFC after:	1 week
2017-03-22 07:05:27 +00:00
dchagin
69ea87350f Implement getrandom() syscall.
Note. GRND_RANDOM option is not supported for now.

MFC after:	1 month
2017-03-18 18:34:29 +00:00
dchagin
82392d7947 To reduce code duplication move socket defines to the MI path.
MFC after:	1 week
2017-03-18 18:23:30 +00:00
bde
02d7df1d66 Don't access the reserved registers %dr4 and %dr5 on i386.
On the original i386, %dr[4-5] were unimplemented but not very clearly
reserved, so debuggers read them to print them.  i386 was still doing
this.

On the original athlon64, %dr[4-5] are documented as reserved but are
aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps.

On 2 of my systems, accessing %dr[4-5] trapped sometimes.  On my Haswell
system, the apparent randomness was because the boot CPU starts with
CR4_DE set while all other CPUs start with CR4_DE clear.  FreeBSD
doesn't support the data breakpoints enabled by CR4_DE and it never
changes this flag, so the flag remains different across CPUs and
the behaviour seemed inconsistent except while booting when the CPU
doesn't change.

The invalid accesses broke:
- read access for printing the registers in ddb "show watches" on CPUs
  with CR4_DE set
- read accesses in fill_dbregs() on CPUs with CR4_DE set.  This didn't
  implement panic(3) since the user case always skipped %dr[4-5].
- write accesses in set_dbregs().  This also didn't affect userland.
  When it didn't trap, the aliasing made it fragile.

Don't print the dummy (zero) values of %dr[4-5] in "show watches" for
i386 or amd64.  Fix style bugs near this printing.

amd64 also has space in the dbregs struct for the reserved %dr[8-15]
and already didn't print the dummy values for these, and never accessed
any of the 10 reserved debug registers.

Remove cpufuncs for making the invalid accesses.  Even amd64 had these.
2017-03-17 13:49:05 +00:00
manu
044a23ec3c Remove i915drm and radeondrm from NOTES and conf.
This unbreak LINT kernel.

Reported by:	lwhsu
2017-03-12 00:52:16 +00:00
dchagin
d7b4f21065 Reduce code duplication between MD Linux code by moving SYSV IPC 64-bit
related struct definitions out into the MI path.

Invert the native ipc structs to the Linux ipc structs convesion logic.
Since 64-bit variant of ipc structs has more precision convert native ipc
structs to the 64-bit Linux ipc structs and then truncate 64-bit values
into the non 64-bit if needed. Unlike Linux, return EOVERFLOW if the
values do not fit.

Fix SYSV IPC for 64-bit Linuxulator which never sets IPC_64 bit.

MFC after:	1 month
2017-03-07 17:07:16 +00:00
mmokhi
d2f0c04c1f Regenerated Linuxulator syscall tables for r314782
Approved by:	dchagin
MFC after:	1 month
2017-03-06 18:20:37 +00:00
mmokhi
4c583e7b8e Add UNIMPLEMENTED() placeholder macro for
the syscalls that are not implemented in Linux kernel itself.
Cleanup DUMMY() macros.

Reviewed by:	dchagin, trasz
Approved by:	dchagin
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9804
2017-03-06 18:11:38 +00:00
pfg
82bdb13eb8 Revert r314669, r314670:
Bring back the i486 option in GENERIC by default.

The code related to i386 CPU variants configuration has received many
changes in the last years: most of the features are detected automatically,
so there are no performance penalties from keeping the 486 support enabled.

Re-instate the 486 support: while the general configuration could still be
cleaned a bit, there is no advantage in removing it.

Differential Revision:	https://reviews.freebsd.org/D9879
2017-03-06 03:52:15 +00:00
pfg
e0c54413a9 Drop i486 from the default i386 GENERIC kernel configuration.
80486 production was stopped by Intel on September 2007. Dropping the 486
configuration option from the GENERIC kernel improves performance
slightly.

Removing I486_CPU is consistent at this time: we don't support any
processor without a FPU and the PC-98 arch, which frequently involved i486
CPUs, is also gone so we don't test such platforms anymore.

Relnotes:	yes
MFC after:	2 weeks
https://reviews.freebsd.org/D9879
2017-03-04 15:04:17 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
kib
06450fe6c4 Initialize pcb_save for thread0.
Otherwise kernel traps on NULL dereference if fpu_kern(9) is used from the
thread0 context.

Reported by:	cem
Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-28 22:54:52 +00:00
glebius
745bcd6fba Remove SVR4 (System V Release 4) binary compatibility support.
UNIX System V Release 4 is operating system released in 1988. It ceased
to exist in early 2000-s.
2017-02-28 05:14:42 +00:00
dchagin
948e5bcce0 Regen for r314312 (Linux epoll_pwait).
MFC after:	1 month
2017-02-26 19:59:28 +00:00
dchagin
8a318fc47b Change Linux epoll_pwait syscall definition to match Linux actual one.
MFC after:	1 month
2017-02-26 19:57:18 +00:00
alc
c062fef7f7 Refine the fix from r312954. Specifically, add a new PDE-only flag,
PG_PROMOTED, that indicates whether lingering 4KB page mappings might
need to be flushed on a PDE change that restricts or destroys a 2MB
page mapping.  This flag allows the pmap to avoid range invalidations
that are both unnecessary and costly.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9665
2017-02-26 19:54:02 +00:00
dchagin
396e17522f Implement timerfd family syscalls.
MFC after:	1 month
2017-02-26 09:48:18 +00:00
dchagin
d9e839a5db Regen after r314291 (timerfd definition).
MFC after:	1 month
2017-02-26 09:37:25 +00:00
dchagin
d7d0ab04b1 Change Linuxulator timerfd syscalls definition to match actual Linux one.
MFC after:	1 month
2017-02-26 09:35:44 +00:00
trasz
aaee258648 Fix linux_fstatfs() to return proper value for f_frsize. Without it,
linux df(1) binary from Xenial shows garbage.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9692
2017-02-25 20:32:37 +00:00
mmokhi
bf114a4c4a Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 20:04:02 +00:00
dchagin
8f19e32110 Revert r314217. Commit is not match that I have approved. 2017-02-24 19:47:27 +00:00
mmokhi
0de5ce3724 Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 19:22:17 +00:00
dchagin
ca1a1a9d60 Implement rt_tgsigqueueinfo system call used by glibc for pthread_sigqueue(3).
MFC after:	2 week
2017-02-19 07:38:11 +00:00
kib
6ea6500b7e MFamd64 r313933: microoptimize pmap_protect_pde().
Noted by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-19 06:14:33 +00:00
kib
5c280e36bd Merge i386 and amd64 mtrr drivers.
Reviewed by:	royger, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9648
2017-02-17 21:08:32 +00:00
royger
bf0e949003 x86: fix MTRR initialization if EARLY_AP_STARTUP is used
MTRR handlers are set in {amd64/i686}_mem_drvinit, which is called at
SI_SUB_DRIVERS, and that's too late when EARLY_AP_STARTUP is set because APs
have already started at this point. {amd64/i686}_mrinit is also called too late
for the BSP, since that happens when the memory device is attached, also after
APs have already started.

Move the position to SI_SUB_CPU, and also initialize the state for the BSP, so
that the APs can correctly get to the same state as the BSP.

Sponsored by:		Citrix Systems R&D
MFC after:		1 week
Reviewed by:		jhb, kib
Differential Revision:	https://reviews.freebsd.org/D9630
2017-02-17 12:47:51 +00:00
imp
36fafdbb83 Remove EISA bus support for add-in cards. Remove related kernel and
compile options. Remove doxygen pointers to now deleted files. Remove
EISA and VME as examples in bus_space.9.

Retained EISA mode code for IO PIC and MPTABLES because that's not
EISA bus, per se, and some people have abused EISA to mean "EISA-like
behavior as opposed to ISA" rather than using it for EISA add-in
cards.

Relnotes: yes
2017-02-16 21:57:35 +00:00
imp
5e19920be6 Remove Micro Channel Architecture support. Of the commonly available
machines, only a few 486 machines that used it, and those haven't had
enough memory to run FreeBSD for quite some time (often limited to
16MB).

Not to be confused with the Machine Check Architecture, which is still
very much alive and used (and untouched by this commit).

No Objection From: arch@
2017-02-15 23:04:25 +00:00
jhb
e80fc50712 Regenerate all the system call tables to drop "created from" lines.
One of the ibcs2 files contains some actual changes (new headers) as
it hasn't been regenerated after older changes to makesyscalls.sh.
2017-02-10 19:45:02 +00:00
dchagin
ef67426db9 Regen after r313284.
MFC after:	2 week
2017-02-05 14:19:19 +00:00
dchagin
1ba976c2fa Update syscall.master to 4.10-rc6. Also fix comments, a typo,
and wrong numbering for a few unimplemented syscalls.

For 32-bit Linuxulator, socketcall() syscall was historically
the entry point for the sockets API. Starting in Linux 4.3, direct
syscalls are provided for the sockets API. Enable it.

The initial version of patch was provided by trasz@ and extended by me.

Submitted by:	trasz
MFC after:	2 week
Differential Revision:	https://reviews.freebsd.org/D9381
2017-02-05 14:17:09 +00:00
kib
5cb41cd56e For i386, remove config options CPU_DISABLE_CMPXCHG, CPU_DISABLE_SSE
and device npx.

This means that FPU is always initialized and handled when available,
and SSE+ register file and exception are handled when available.  This
makes the kernel FPU code much easier to maintain by the cost of
slight bloat for CPUs older than 25 years.

CPU_DISABLE_CMPXCHG outlived its usefulness, see the removed comment
explaining the original purpose.

Suggested by and discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2017-02-03 12:51:40 +00:00
kib
38af3da4f6 Use ANSI definitions for some i386 functions.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-02 22:02:10 +00:00
mjg
7055568df5 i386: fixup fcmpset
An incorrect output specifier was used which worked with clang by accident,
but breaks with the in-tree gcc version.

While here plug a whitespace nit.

Reported by:	bde
2017-02-02 01:33:08 +00:00
trasz
69861f6d1f Replace sys_ftruncate() with kern_ftruncate() in various compats.
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9368
2017-01-30 11:50:54 +00:00
mjg
efb45bc626 i386: add atomic_fcmpset
Tested by:	pho
2017-01-30 02:24:54 +00:00
kib
82ecdbf179 Do not leave stale 4K TLB entries on pde (superpage) removal or
protection change.

On superpage promotion, x86 pmaps do not invalidate existing 4K
entries for the superpage range, because they are compatible with the
promoted 2/4M entry.  But the invalidation on superpage removal or
protection change only did single INVLPG with the base address of the
superpage.  This reliably flushed superpage TLB entry, and 4K entry
for the first page of the superpage, potentially leaving other 4K TLB
entries lingering.  Do the invalidation of the whole superpage range
to correct the problem.

Note that the precise invalidation is done by x86 code for kernel_pmap
only, for user pmaps whole (per-AS) TLB is flushed.  This made the bug
well hidden, because promotions of the kernel mappings require
specific load.

Reported and tested by:	Jonathan Looney <jtl@netflix.com> (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-29 19:14:48 +00:00
jah
6a2ed46209 Implement get_pcpu() for i386 and use it to replace pcpu_find(curcpu)
in the i386 pmap.

The curcpu macro loads the per-cpu data pointer as its first step,
so the remaining steps of pcpu_find(curcpu) are circular.

get_pcpu() is already implemented for arm, arm64, and risc-v.
My plan is to implement it for the remaining architectures and use
it to replace several instances of pcpu_find(curcpu) in MI code.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9370
2017-01-29 16:54:55 +00:00
nyan
edf372a226 Garbage collect the FPU_ERROR_BROKEN option.
It is for pc98 only.
2017-01-28 03:53:53 +00:00
nyan
259480b6de Remove pc98 support completely.
I thank all developers and contributors for pc98.

Relnotes:	yes
2017-01-28 02:22:15 +00:00
kib
d52cc80daf Use SFENCE for ordering CLFLUSHOPT.
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-01-20 19:08:44 +00:00
ed
be72efbdd4 Catch up with changes to structure member names.
Pointer/length pairs are now always named ${name} and ${name}_len.
2017-01-17 22:05:52 +00:00
jah
07791773f9 Add comment explaining relative order of sched_unpin() and mtx_unlock().
Suggested by:	alc
MFC after:	1 week
2017-01-14 19:35:36 +00:00
jah
63a408be16 For i386 temporary mappings, unpin the thread before releasing
the cmap lock.  Releasing the lock first may result in the thread
being immediately rescheduled and bound to the same CPU, only to
unpin itself upon resuming execution.

Noted by:	skra (in review for armv6 equivalent)
MFC after:	1 week
2017-01-14 09:56:01 +00:00
markj
d8a11d2a36 Coalesce TLB shootdowns of global PTEs in pmap_advise() on x86.
We would previously invalidate such entries individually, resulting in more
IPIs than necessary.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9094
2017-01-10 21:52:48 +00:00
sbruno
efab05d612 Migrate e1000 to the IFLIB framework:
- em(4) igb(4) and lem(4)
- deprecate the igb device from kernel configurations
- create a symbolic link in /boot/kernel from if_em.ko to if_igb.ko

Devices tested:
- 82574L
- I218-LM
- 82546GB
- 82579LM
- I350
- I217

Please report problems to freebsd-net@freebsd.org

Partial review from jhb and suggestions on how to *not* brick folks who
originally would have lost their igbX device.

Submitted by:	mmacy@nextbsd.org
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks and Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8299
2017-01-10 03:23:22 +00:00
kib
5def9fa2c2 Do not allocate struct statfs on kernel stack.
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable.  With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from:	ino64 work by gleb
Discussed with:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:19:26 +00:00
jah
173cd7bef3 Move the objects used to create temporary mappings for i386 pmap zero and copy
operations to the MD PCPU region.  Change sysmap initialization to only
allocate KVA pages for CPUs that are actually present.  As a minor
optimization, this also prevents false sharing between adjacent sysmap objects
since the pcpu struct is already cacheline-aligned.

While here, move pc_qmap_addr initialization for the BSP into
pmap_bootstrap(), which allows use of pmap_quick* functions during early boot.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8833
2016-12-23 15:14:56 +00:00
jhb
203d8b88f7 Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default.
PR:		199321, 203682
MFC after:	2 months
Sponsored by:	Netflix
2016-12-16 21:10:37 +00:00
def
f63c437216 Add support for encrypted kernel crash dumps.
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.

A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.

dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable.  Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.

When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore

A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.

Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.

savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.

decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.

Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.

EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.

Designed by:	def, pjd
Reviewed by:	cem, oshogbo, pjd
Partial review:	delphij, emaste, jhb, kib
Approved by:	pjd (mentor)
Differential Revision:	https://reviews.freebsd.org/D4712
2016-12-10 16:20:39 +00:00
markj
8bb19c4929 Add a COMPAT_FREEBSD11 kernel option.
Use it wherever COMPAT_FREEBSD10 is currently specified.

Reviewed by:	glebius, imp, jhb
Differential Revision:	https://reviews.freebsd.org/D8736
2016-12-09 18:54:12 +00:00
alc
7571ef95c1 Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index.  However, with this
change, vm_radix_remove() no longer panics.  Instead, it returns NULL
if there is no vm_page_t at the specified index.  Otherwise, it
returns the vm_page_t.  The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations.  Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D8708
2016-12-08 04:29:29 +00:00
jhb
3ec06e919d MFamd64: Various fatal page fault fixes.
- If a page fault is triggered due to reserved bits in a PTE, treat it
  as a fatal fault and panic.
- If PG_NX is in use, report whether a fatal page fault is due to an
  instruction fetch or a data access.
- If a fatal page fault is due to reserved bits in a PTE, report that as
  the page fault type rather than a protection violation.

MFC after:	1 month
2016-11-19 01:36:44 +00:00
bdrewery
30f99dbeef Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
kib
9cc20ad665 Handle pmap_enter() over an existing 4/2M page in KVA on i386.
The userspace case was already handled by pmap_allocpte().  For kernel
VA, page table page must exist, and demote cannot fail, so we need to
just call pmap_demote_pde().  Also note that due to the machine AS
layout, promotions in the KVA on i386 are highly unlikely, so this
change is mostly for completeness.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8323
2016-10-28 11:53:22 +00:00
jhb
95a3814f21 MFamd64: Add bounds checks on addresses used with /dev/mem.
Reject attempts to read from or memory map offsets in /dev/mem that are
beyond the maximum-supported physical address of the current CPU.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7408
2016-10-27 21:23:14 +00:00
jhb
c3c885eb65 Enable EFER_NXE properly on APs.
EFER_NXE is set in the EFER MSR by initializecpu() and must be set on all
CPUs in the system.  When PG_NX support was added to PAE on i386, the
block to enable EFER_NXE was placed in a section of initializecpu() that
only runs if 'cpu == CPU_686'.  During early boot, locore does an
initial pass to set cpu that sets it to CPU_686 on all CPUs later than
a Pentium.  Later, printcpuinfo() adjusts the 'cpu' variable on
PII and later CPUs to one of CPU_PII, CPU_PIII, or CPU_P4.  However,
printcpuinfo() is called after initializecpu() on the BSP, so the BSP
would enable EFER_NXE and pg_nx.  The APs execute initializecpu() much
later after printcpuinfo() has run.  The end result on a modern CPU was
that cpu was set to CPU_PIII when the APs invoked initializecpu(), so
they did not enable EFER_NXE.  As a result, the APs would fault when
trying to access any pages marked with PG_NX set.

When booting a 2 CPU PAE kernel in bhyve this manifested as a hang before
single user mode.  The attempt to execute /bin/init tried to copy out
the exec strings (argv, etc.) to a non-executable mapping while running
on the AP.  The instruction kept faulting due to invalid bits in the PTE
in an infinite loop.

Fix this by moving the code to enable EFER_NXE out of the switch statement
on 'cpu' and always doing it if 'amd_feature' supports AMDID_NX.

MFC after:	2 weeks
2016-10-26 18:47:47 +00:00
kib
45100446da Follow-up to r307866:
- Make !KDB config buildable.
- Simplify interface to nmi_handle_intr() by evaluating panic_on_nmi
  in one place, namely nmi_call_kdb().  This allows to remove do_panic
  argument from the functions, and to remove i386/amd64 duplication of
  the variable and sysctl definitions.  Note that now NMI causes
  panic(9) instead of trap_fatal() reporting and then panic(9),
  consistently for NMIs delivered while CPU operated in ring 0 and 3.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-24 20:47:46 +00:00
kib
a04db702cd Handle broadcast NMIs.
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.

When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb.  All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work.  One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".

Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.

Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs.  If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD.  More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.

Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.

The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8249
2016-10-24 16:40:27 +00:00
jkim
229f578eb8 Implement BPF_MOD and BPF_XOR instructions.
These two ALU instructions first appeared on Linux.  Then, libpcap adopted
and made them available since 1.6.2.  Now more platforms including NetBSD
have them in kernel.  So do we.
 --이 줄 이하는 자동으로 제거됩니다--
> Description of fields to fill in above:                     76 columns --|
> PR:                       If and which Problem Report is related.
> Submitted by:             If someone else sent in the change.
> Reported by:              If someone else reported the issue.
> Reviewed by:              If someone else reviewed your modification.
> Approved by:              If you needed approval for this commit.
> Obtained from:            If the change is from a third party.
> MFC after:                N [day[s]|week[s]|month[s]].  Request a reminder email.
> MFH:                      Ports tree branch name.  Request approval for merge.
> Relnotes:                 Set to 'yes' for mention in release notes.
> Security:                 Vulnerability reference (one per line) or description.
> Sponsored by:             If the change was sponsored by an organization.
> Differential Revision:    https://reviews.freebsd.org/D### (*full* phabric URL needed).
> Empty fields above will be automatically removed.

M    share/man/man4/bpf.4
M    sys/amd64/amd64/bpf_jit_machdep.c
M    sys/amd64/amd64/bpf_jit_machdep.h
M    sys/i386/i386/bpf_jit_machdep.c
M    sys/i386/i386/bpf_jit_machdep.h
M    sys/net/bpf_filter.c
2016-10-21 06:55:07 +00:00
jkim
42d876c52e Redude code for conditional jumps. 2016-10-21 06:09:30 +00:00
jkim
b35ec131f8 Fix compiler warnings for user land. 2016-10-21 06:06:54 +00:00
jhb
f689fd5a63 Drop support for using mmap() with /dev/kmem.
Using the device pager with /dev/kmem is not stable since KVA mappings
are transient, but the device pager caches the PA associated with a
given offset forever.  Interestingly, mips' implementation of
memmap() already refused requests for /dev/kmem.

Note that kvm_read/kvm_write do not use mmap, but use read and write on
/dev/kmem, so this should not affect libkvm users.

Reviewed by:	kib
MFC after:	2 months
2016-10-14 20:01:07 +00:00
imp
081e8d8587 Fix building on i386 and arm. But 'public domain' headers on the files
with no creative content. Include "lost" changes from git:
o Use /dev/efi instead of /dev/efidev
o Remove redundant NULL checks.

Submitted by: kib@, dim@, zbb@, emaste@
2016-10-13 06:56:23 +00:00
jtl
62030781cd In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
imp
41d4ddad0f Create /dev/efidev to provide an ioctl interface to
userland.  It supports userland interfaces to UEFI Runtime Services. This is
indended to the the MI portion of EFI RuntimeServices support.

Differential Revision: https://reviews.freebsd.org/D8128
Reviewed by: kib@, wblock@, Ganael Laplanche
2016-10-11 22:24:30 +00:00
kib
559623d89a Re-apply r306516 (by cem):
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags

Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.

On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one.  (Running a basic file
server workload.)

Submitted by:	Anton Rang <rang at acm.org>
Reviewed by:	cem (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8041
2016-10-04 17:01:24 +00:00
hselasky
5e41da7ccd Move the ConnectX-3 and ConnectX-2 driver from sys/ofed into sys/dev/mlx4
like other PCI network drivers. The sys/ofed directory is now mainly
reserved for generic infiniband code, with exception of the mthca driver.

- Add new manual page, mlx4en(4), describing how to configure and load
mlx4en.

- All relevant driver C-files are now prefixed mlx4, mlx4_en and
mlx4_ib respectivly to avoid object filename collisions when compiling
the kernel. This also fixes an issue with proper dependency file
generation for the C-files in question.

- Device mlxen is now device mlx4en and depends on device mlx4, see
mlx4en(4). Only the network device name remains unchanged.

- The mlx4 and mlx4en modules are now built by default on i386 and
amd64 targets. Only building the mlx4ib module depends on
WITH_OFED=YES .

Sponsored by:	Mellanox Technologies
2016-09-30 08:23:06 +00:00
bde
009ff78eec Minor fixes for 160-bit disassembly:
(1) Print the default segment %ss before adresses relative to %bp.
    This is too cluttered for me, but so is printing some other default
    prefixes, and this is a reasonable reminder that %ss is quite
    likely to be different from %ds in 16-bit mode.

    db_disasm still handles prefixes poorly, by trying to discard
    redundant ones.  This loses information, and sometimes the result
    is wrong or misleading.

    Clean up nearby initializations and dead code.

(2) Fix decoding of operand and address size prefixes in 16-bit mode.
    They reverse the default in all modes.

Obtained from:            (1) is partly from r1.4 (2003/11/08) in DFlyBSD (?)
2016-09-25 18:39:24 +00:00
tijl
3f32edbd77 MFamd64: r266901
Allocate a zeroed LDT.

Failing to do this might result in the LDT appearing to run out of free
descriptors because of random junk in the descriptor's 'sd_type' field.

http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html

PR:		212639
Submitted by:	wheelcomplex@gmail.com
MFC after:	2 weeks
2016-09-25 18:29:02 +00:00
bde
7d8ccb0c00 Determine the operand/address size of %cs in a new function
db_segsize().

Use db_segsize() to set the default operand/address size for
disassembling.  Allow overriding this with the "alternate" display
format /I.  The API of db_disasm() should be debooleanized to pass a
more general request (amd64 needs overrides to sizes of 16, 32, and
64, but this commit doesn't implement anything for amd64 since much
larger changes are needed to restore the amd64 disassmbler's support
for non-default sizes).

Fix db_print_loc_and_inst() to ask for the normal format and not the
alternate in normal operation.

This is most useful for vm86 mode, but also works for 16-bit protected
mode.

Use db_segsize() to avoid trying to print a garbage stack trace if %cs
is 16 bits.  Print something like the stack trace termination message
for a trap boundary instead.

Document that the alternate format is now useful on i386.
2016-09-25 16:30:29 +00:00
bde
94a237c17c Fix vm86 initialization, part 3 of 2 and a half. (Actually, just fix
early printfs and debugging of vm86 initialization and some other early
initialization in some cases.)  Add an option debug.late_console (with
default 1=off) to move console and kdb initialization back where it was.
Do the same for amd64 although there is no vm86 there.

On my test system, debug.late_console=0 works for the syscons, sio and
uart console drivers on amd64 and i386, and for vt on i386 but not on
amd64.

The early printfs fixed by debug.late_console=0 are:
- on i386, the message about lost memory above 4G
- with -v in otherwise normal use, about 20 printfs for SMAP
- other debugging messages for memory sizing.  Mostly under -v and
  not printed in normal use.

Document in a comment how much earlier the initialization and early
printf()s can be.  That is very early for the console.  Not much more
than curthread is needed.  kdb use obviously needs to be not so early,
since it needs IDT initialization and that is done relatively late
for convenience and historical reasons.
2016-09-25 14:56:24 +00:00
markj
f2a1dd4de8 Regenerate syscall provider argument strings. 2016-09-22 04:50:03 +00:00
bde
74861d13fa Remove all kernel uses of pcb_psl, but keep in in the struct to
preserve the ABI and API for applications.  It was removed in the port
to amd64, but was remained as garbage giving a micro-pessimization and
spurious single-step traps on i386.

pcb_psl was intended to be used just to do a context switch of PSL_I,
but this context switch was null in most or all versions, and
mis-switching of PSL_T was done instead.

Some history:
- in 386BSD-0.0, cpu_switch() ran at splhigh() and splhigh() did too
  much interrupt disabling, so interrupts were hard-disabled across
  cpu_switch() and too many other places
- in 386BSD-0.0-patchkit through FreeBSD-4 and FreeBSD-5 before
  SMPng, splhigh() did soft interrupt masking, and cpu_switch() was
  excessively cautious and did a cli at the start and a sti at the
  end to hard-disable interrupts across the switch
- SMPng replaced the spl's and cli's by spinlocks (just sched_lock?),
  so interrupts were hard-disabled across cpu_switch() and too many
  other places again
- initial attempts to fix this intended to restore some soft
  interrupt disabling, but to support variations in this cpu_switch()
  used pushfl/popfl into pcb_psl to avoid hard-coding the assumption
  that the initial and final states have PSL_I enabled.  But the
  version with soft interrupt disabling wasn't used for long, or was
  never committed, (except I always used my different version of it
  for UP) so the pushfl/popl and pcb_psl to hold them have been doing
  less than nothing for about 14 years.
2016-09-17 14:00:52 +00:00
bde
d6a5db2944 (1) Ifdef the new dr6 variable for KDB.
While here, avoid using the old variable 'code' and remove it
in trap().  ('code' was meant for holding things like %dr6,
but is too small to hold %dr6 on amd64 and was reduced to an
obfuscation of tf_err, with early truncation on amd64.)

Submitted by:	Michael Butler (imb@...)
2016-09-16 04:58:37 +00:00
bde
bf8d177543 Abort single stepping in ddb if the trap is not for single-stepping.
This is not very easy to do, since ddb didn't know when traps are
for single-stepping.  It more or less assumed that traps are either
breakpoints or single-step, but even for x86 this became inadequate
with the release of the i386 in ~1986, and FreeBSD passes it other
trap types for NMIs and panics.

On x86, teach ddb when a trap is for single stepping using the %dr6
register.  Unknown traps are now treated almost the same as breakpoints
instead of as the same as single-steps.  Previously, the classification
of breakpoints was almost correct and everything else was unknown so
had to be treated as a single-step.  Now the classification of single-
steps is precise, the classification of breakpoints is almost correct
(as before) and everything else is unknown and treated like a
breakpoint.

This fixes:
- breakpoints not set by ddb, including the main one in kdb_enter(),
  were treated as single-steps and not stopped on when stepping
  (except for the usual, simple case of a step with residual count 1).
  As special cases, kdb_enter() didn't stop for fatal traps or panics
- similarly for "hardware breakpoints".

Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify
single-steps.  This is excessively complicated for bug-for-bug and
backwards compatibilty.  Design errors apparently started in Mach
in ~1990 or perhaps in the FreeBSD interface in ~1993.  Common trap
types like single steps should have a unique MI code (like the TRAP*
codes for user SIGTRAP) so that debuggers don't need macros like
IS_SSTEP_TRAP() to decode them.  But 'type' is actually an ambiguous
MD trap number, and code was always 0 (now it is (int)%dr6 on x86).
So it was impossible to determine the trap type from the args.
Global variables had to be used.

There is already a classification macro db_pc_is_single_step(), but
this just gets in the way.  It is only used to recover from bugs in
IS_BREAKPOINT_TRAP().  On some arches, IS_BREAKPOINT_TRAP() just
duplicates the ambiguity in 'type' and misclassifies single-steps as
breakpoints.  It defaults to 'false', which is the opposite of what is
needed for bug-for-bug compatibility.

When this is cleaned up, MI classification bits should be passed in
'code'.  This could be done now for positive-logic bits, since 'code'
was always 0, but some negative logic is needed for compatibility so
a simple MI classificition is not usable yet.

After reading %dr6, clear the single-step bit in it so that the type
of the next debugger trap can be decoded.  This is a little
ddb-specific.  ddb doesn't understand the need to clear this bit and
doing it before calling kdb is easiest.  gdb would need to reverse
this to support hardware breakpoints, but it just doesn't support
them now since gdbstub doesn't support %dr*.

Fix a bug involving %dr6: when emulating a single-step trap for vm86,
set the bit for it in %dr6.  Userland debuggers need this.  ddb now
needs this for vm86 bios calls.  The bit gets copied to 'code' then
cleared again.

Fix related style bugs:
- when clearing bits for hardware breakpoints in %dr6, spell the mask
  as ~0xf on both amd64 and i386 to get the correct number of bits
  using sign extension and not need a comment about using the wrong
  mask on amd64 (amd64 traps for invalid results but clearing the
  reserved top bits didn't trap since they are 0).
- rewrite my old wrong comments about using %dr6 for ddb watchpoints.
2016-09-15 17:24:23 +00:00