Commit Graph

13626 Commits

Author SHA1 Message Date
kib
4adce57d6f Add kernel support for Intel userspace protection keys feature on
Skylake Xeons.

See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:51:13 +00:00
imp
2fa70908ad Remove drm from LINT kernels
drm was accidentally left in the LINT kernels.

Pointy hat to: imp
2019-02-19 21:20:50 +00:00
kib
7fc374eaa9 Provide convenience C wrappers for RDPKRU and WRPKRU instructions.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-19 19:17:20 +00:00
kib
53fdbf40d8 Provide userspace versions of do_cpuid() and cpuid_count() on i386.
Some older compilers, when generating PIC code, cannot handle inline
asm that clobbers %ebx (because %ebx is used as the GOT offset
register).  Userspace versions avoid clobbering %ebx by saving it to
stack before executing the CPUID instruction.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-14 13:53:11 +00:00
kib
08849e56ba Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings.  It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory.  This
implementation is relatively small and happens at the correct
architectural level.  Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection.  It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation.  The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in.  The initial location for the anon group range
is, of course, randomized.  This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386.  This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs.  Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
  to force ASLR off for the given binary.  (A tool to edit the feature
  control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR:	208580 (exp runs)
Exp-runs done by:	antoine
Reviewed by:	markj (previous version)
Discussed with:	emaste
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
kib
219b0b3431 Port sysctl kern.elf32.read_exec from amd64 to i386.
Make it more comprehensive on i386, by not setting nx bit for any
mapping, not just adding PF_X to all kernel-loaded ELF segments.  This
is needed for the compatibility with older i386 programs that assume
that read access implies exec, e.g. old X servers with hand-rolled
module loader.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-07 02:17:34 +00:00
kib
1ff881a6a6 Fix resume on i386 PAE.
It was broken before PAE/no-PAE merge, but since now PAE is the
default, resume is apparently becomes for all machines.

The corrected issues:
- the trampoline page is not mapped executable, so machine faults when
  paging is on;
- MSR.EFER and %cr4 both should be loaded before paging is enabled,
  otherwise paging structures are invalid (cr4.PAE and EFER.NX).
- MSR.EFER and %cr4 should be only loaded if present.  I attempt to handle
  this by not touching the registers if the value is zero.

There are some more bits still not quite correct, e.g. unconditional
access to %cr4 in resumectx.

Reported and debugging help by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-07 02:09:34 +00:00
emaste
a9793ab05c Retire SPX_HACK option unused after r342244 2019-02-06 17:21:25 +00:00
kib
5b8ac2d701 Make it possible to override PAE mode on boot.
Initialize the static kenv in pmap_cold() and fetch user opinion on
vm.pmap.pae_mode tunable if hardware is capable.  Note that the static
environment is reinitilized in init386() later when paging is enabled.

Reviewed by:	bde
Discussed with:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2019-02-05 20:09:31 +00:00
kib
920678e9dc Remove pointless initial value for i386 vm.pmap.pat_works sysctl definition.
The OID is served by external data.

Submitted by:	bde
MFC after:	3 days
2019-02-05 20:02:16 +00:00
dim
5aac66d368 Use NLDT to get number of LDTs on i386
Compiling a GENERIC kernel for i386 with clang 8.0 results in the
following warning:

/usr/src/sys/i386/i386/sys_machdep.c:542:40: error: 'sizeof ((ldt))' will return the size of the pointer, not the array itself [-Werror,-Wsizeof-pointer-div]
        nldt = pldt != NULL ? pldt->ldt_len : nitems(ldt);
                                              ^~~~~~~~~~~
/usr/src/sys/sys/param.h:299:32: note: expanded from macro 'nitems'
#define nitems(x)       (sizeof((x)) / sizeof((x)[0]))
                         ~~~~~~~~~~~ ^

Indeed, 'ldt' is declared as 'union descriptor *', so nitems() is not
the right way to determine the number of LDTs.  Instead, the NLDT define
from sys/x86/include/segments.h should be used.

Reviewed by:	kib
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D19074
2019-02-04 18:07:03 +00:00
kib
38b15fd590 i386: Do not ever store to other-CPU counter64 slot.
On CPUs supporting cmpxchg8b, fetch is performed by cmpxchg8b on
corresponding CPU slot, which unconditionally write to the slot.  If
for that slot, the owner CPU increments it, then both CPUs might run
the cmpxchg8b instruction concurrently and this might race and
override the incremental write.  So the counter update would be lost.

Fix it by implementing fetch as IPI and accumulation of result.  It is
acceptable for rare counter64 fetch operation to be more expensive.

Diagnosed and tested by:	Andreas Longwitz <longwitz@incore.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2019-02-03 21:28:58 +00:00
kib
c0d866da71 Disable boot-time memory test on i386 be default.
With the current 24G memory limit for GENERIC, the boot time test
causes quite visible delay, amplified by the default
debug.late_console = 0.

The comment text is copied from the same setting explanation for
amd64.

Suggested by:	bde
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2019-02-01 21:09:36 +00:00
kib
1b7795b92a Make iflib a loadable module.
iflib is already a module, but it is unconditionally compiled into the
kernel.  There are drivers which do not need iflib(4), and there are
situations where somebody might not want iflib in kernel because of
using the corresponding driver as module.

Reviewed by:	marius
Discussed with:	erj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D19041
2019-01-31 19:05:56 +00:00
kib
06e4052f35 Remove duplicate declarations.
Submitted by:	bde
MFC after:	2 months
2019-01-30 16:29:15 +00:00
kib
af881ec390 i386: Merge PAE and non-PAE pmaps into same kernel.
Effectively all i386 kernels now have two pmaps compiled in: one
managing PAE pagetables, and another non-PAE. The implementation is
selected at cold time depending on the CPU features. The vm_paddr_t is
always 64bit now. As result, nx bit can be used on all capable CPUs.

Option PAE only affects the bus_addr_t: it is still 32bit for non-PAE
configs, for drivers compatibility. Kernel layout, esp. max kernel
address, low memory PDEs and max user address (same as trampoline
start) are now same for PAE and for non-PAE regardless of the type of
page tables used.

Non-PAE kernel (when using PAE pagetables) can handle physical memory
up to 24G now, larger memory requires re-tuning the KVA consumers and
instead the code caps the maximum at 24G. Unfortunately, a lot of
drivers do not use busdma(9) properly so by default even 4G barrier is
not easy. There are two tunables added: hw.above4g_allow and
hw.above24g_allow, the first one is kept enabled for now to evaluate
the status on HEAD, second is only for dev use.

i386 now creates three freelists if there is any memory above 4G, to
allow proper bounce pages allocation. Also, VM_KMEM_SIZE_SCALE changed
from 3 to 1.

The PAE_TABLES kernel config option is retired.

In collaboarion with: pho
Discussed with:	emaste
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18894
2019-01-30 02:07:13 +00:00
avos
fad9eaa301 Garbage collect AH_SUPPORT_AR5416 config option.
It does nothing since r318857.
2019-01-25 13:48:40 +00:00
avos
8f57d8136c Remove IEEE80211_AMPDU_AGE config option.
It is noop since r297774.
2019-01-20 15:17:56 +00:00
fsu
8aa0301c63 Fix errno values returned from DUMMY_XATTR linuxulator calls
Reported by: weiss@uni-mainz.de
Reviewed by: markj
MFC after: 1 day
Differential Revision: https://reviews.freebsd.org/D18812
2019-01-11 07:58:25 +00:00
kib
9b56128030 Fix i386 LINT build after r342769.
It seems that libkern/mcount.c is the only consumer of vm/pmap.h that
does not include machine/atomic.h.  Make it work by bringing
machine/atomic.h when pmap.h is used for kernel non-asm .c file.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-01-04 19:10:46 +00:00
kib
11fd3ebd3f i386: Use atomic 64bit load to read PDE value from PAE pagetables in
pmap_kextract().

pmap_kextract() can race with promotion/demotion on the kernel page
table, in which case current non-atomic 64bit read would see torn
value, breaking pmap_kextract().  pmap_kextract() would correctly
handle either promoted or demoted PDE, but not a mix where one word
is from a different state.

It requires PAE and > 4G memory to reproduce.  We observed this in
real loads, both for intensive use of malloc(9)/free(9) where
vtoslab() returned invalid pointer to the slab, and with the use of
busdma_bounce, where incorrect page was bounced.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18714
2019-01-04 17:33:07 +00:00
kib
8e4e3f4845 x86: Report per-cpu IPI TLB shootdown generation in ddb 'show pcpu' output.
It is useful for inspecting tlb shootdown hangs.  The smp_tlb_generation value
is available using regular ddb data inspection commands.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-01-04 17:25:47 +00:00
kib
b562244a11 Fix typo in r342710.
Noted by:	lidl
MFC after:	3 days
2019-01-03 19:35:07 +00:00
markj
7a67214424 Avoid setting PG_U unconditionally in pmap_enter_quick_locked().
This KPI may in principle be used to create kernel mappings, in which
case we certainly should not be setting PG_U.  In any case, PG_U must be
set on all layers in the page tables to grant user mode access, and we
were only setting it on leaf entries.  Thus, this change should have no
functional impact.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-01-02 15:36:35 +00:00
kib
6b3a1317f5 More references to pmap_cold().
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-12-31 18:11:04 +00:00
kib
2e788e7f15 Update comments: paging is initialized in pmap_cold().
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-12-31 18:05:48 +00:00
kib
987c6b35ca i386: Fix allocation of the KVA frame for pmap_quick_enter_page().
Due to the typo, it shared the frame with the CMAP1 transient mapping.

In collaboration with:	pho
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation (kib)
2018-12-29 15:49:03 +00:00
mjg
51d134d072 Remove iBCS2, part3: the implementation
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 22:02:49 +00:00
mjg
c39e5a0486 Remove iBCS2, part2: general kernel
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 21:57:58 +00:00
dim
07b9c9ba27 Merge ^/head r340918 through r341763. 2018-12-09 11:39:45 +00:00
kib
ea29016877 Fix PAE boot.
With the introduction of M_EXEC support for kmem_malloc(), some kernel
mappings start having NX bit set in the paging structures early, for
PAE kernels on machines with NX support, i.e. practically on all
machines.  In particular, AP trampoline and initialization needs to
access pages which translations has NX bit set, before initializecpu()
is called.

Check for CPUID NX feature and enable EFER.NXE before we enable paging
in mp boot trampoline.  This allows the CPU to use the kernel page
table instead of generating page fault due to reserved bit set.

PR:	233819
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-12-08 22:12:57 +00:00
markj
53d55c9af9 Plug memory disclosures via ptrace(2).
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory.  The same
issue existed with the register files in procfs.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Security:	kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18421
2018-12-03 20:54:17 +00:00
kib
df0eed217c Correct the tunable name in the message.
Submitted by:	 Andre Albsmeier <mail@fbsd.e4m.org>
PR:	231577
MFC after:	1 week
2018-12-01 16:43:18 +00:00
vangyzen
60ccc6987d Remove superfluous bzero in getcontext/swapcontext/sendsig
We zero the whole structure; we don't need to zero the __spare__ field again.

Remove trailing whitespace.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2018-11-26 20:56:05 +00:00
dim
b4fff6918d Merge ^/head r340368 through r340426. 2018-11-14 06:46:44 +00:00
zeising
8f330921b9 Add evdev support to amd64 and i386 kernels
Include evdev support and drivers in the amd64 and i386 GENERIC and MINIMAL
kernels.  Evdev is used by X and wayland to handle input devices, and this
change, together with upcomming changes in ports will make us handle input
devices better in graphical UIs.

Reviewed by:	wulf, bapt, imp
Approved by:	imp
Differential Revision:	https://reviews.freebsd.org/D17912
2018-11-12 21:01:28 +00:00
dim
b45b4d8aa3 Merge ^/head r340126 through r340212. 2018-11-07 18:52:28 +00:00
jhb
ed96335a07 Add a custom implementation of cpu_lock_delay() for x86.
Avoid using DELAY() since it can try to use spin locks on CPUs without
a P-state invariant TSC.  For cpu_lock_delay(), always use the TSC if
it exists (even if it is not P-state invariant) to delay for a
microsecond.  If the TSC does not exist, read from I/O port 0x84 to
delay instead.

PR:		228768
Reported by:	Roger Hammerstein <cheeky.m@live.com>
Reviewed by:	kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D17851
2018-11-05 22:54:03 +00:00
jhb
81a93c8824 Add a KPI for the delay while spinning on a spin lock.
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI.  Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms.  However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.

Reviewed by:	kib
MFC after:	3 days
2018-11-05 21:34:17 +00:00
dim
4b4bc3c457 Merge ^/head r339813 through r340125. 2018-11-04 15:49:06 +00:00
jhb
d180d56f38 Don't enter DDB for fatal traps before panic by default.
Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic'
and make the calls to kdb_trap() in MD fatal trap handlers prior to
calling panic() conditional on this new knob instead of
'debugger_on_panic'.  Disable the new knob by default.  Developers who
wish to recover from a fatal fault by adjusting saved register state
and retrying the faulting instruction can still do so by enabling the
new knob.  However, for the more common case this makes the user
experience for panics due to a fatal fault match the user experience
for other panics, e.g. 'c' in DDB will generate a crash dump and
reboot the system rather than being stuck in an infinite loop of fatal
fault messages and DDB prompts.

Reviewed by:	kib, avg
MFC after:	2 months
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D17768
2018-11-01 21:34:17 +00:00
kevans
371f3d1fe1 i386/MINIMAL: VERBOSE_SYSINIT=0 for consistency
MFC after:	never
2018-10-31 22:55:43 +00:00
kevans
17b90b88be Compile in VERBOSE_SYSINIT support by default, remain silent by default
The loader tunable 'debug.verbose_sysinit' may be used to toggle verbosity.
This is added to the debugging section of these kernconfs to be turned off
in stable branches for clarity of intent.

MFC after:	never
2018-10-31 22:38:19 +00:00
markj
f931b753dd Add malloc_domainset(9) and _domainset variants to other allocator KPIs.
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain.  The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted.  Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with:	gallatin, jeff
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17418
2018-10-30 18:26:34 +00:00
dim
6c8ec16fbe Merge ^/head r339015 through r339669. 2018-10-23 21:09:37 +00:00
imp
d44887731c Remove the ncr(4) drive.
This driver has been obsolete since the FreeBSD 4.x. It should have
been removed then since the sym(4) driver had subsumed it. The driver
was commented out of GENERIC in 2000.

RelNotes: Yes
2018-10-22 02:36:18 +00:00
imp
c912dbecce Remove stg(4) driver
stg(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame. It was also only enabled on i386.

Relnote: Yes
2018-10-22 02:35:50 +00:00
imp
2c200a05c0 Remove nsp(4) driver
nsp(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame. It was also only enabled on i386.

Relnote: Yes
2018-10-22 02:35:38 +00:00
imp
bd9eaea6c1 Remove ncv(4) driver
ncv(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame..

Relnote: Yes
2018-10-22 02:35:26 +00:00
imp
a5023bb106 Retire dpt(4)
Marked as gone in 12 and not relevant since the early 90s. No
sightings in nycbug's dmesg database.

Relnotes: yes
2018-10-22 02:35:12 +00:00