Commit Graph

13028 Commits

Author SHA1 Message Date
Konstantin Belousov
349216589d A different fix for the issue from r323722.
Split the handlers for pop of invalid selectors from the trap frame
into usermode and kernel variants.  Usermode handler is kept as is, it
restores the already loaded parts of the trap frame and jumps to set
up a signal delivery to the user process.

New kernel part of the handler emulates IRET treatment of the segments
which would violate access right.  It loads NUL selector in the
segment register which load causes the fault, and then continues the
return to interrupted kernel code.  Since invalid selectors in the
segment registers in the kernel mode can only exist while kernel still
enters or exits from userspace, we only zero invalid userspace
selectors.  If userspace tries to use the segment register, it gets a
signal, as if the processor segment descriptor cache was reloaded.

Reported by:	Maxime Villard <max@m00nbsd.net>
Suggested and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 09:01:28 +00:00
Konstantin Belousov
053e8ce538 Restore a part of r323722.
Do not return from interrupt using the POP_FRAME;iret instruction
sequence, always jump to doreti.

The user segments selectors saved on the stack might become invalid
because userspace manipulated LDT in a parallel thread.  trap() is
aware of such issue, but it is only prepared to handle it at iret and
segment registers load operations in doreti path.

Also remove POP_FRAME macro because it is no longer used.

Reviewed by:	bde, jhb (as part of r323722)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:46:15 +00:00
Konstantin Belousov
d3c968bf84 Revert r323722. A better fix will be committed shortly, as well as
some still useful bits of the reverted revision.

The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers.  If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.

Fixing the problem is complicated.  Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:38:24 +00:00
Josh Paetzel
c77037f16f Fix indentation for r323068
PR:	220170
Reported by:	lidl
MFC after:	3 days
Pointyhat to:	jpaetzel
2017-09-19 20:40:05 +00:00
Konstantin Belousov
3cabd93e26 Do not do torn writes to active LDTs.
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12413
2017-09-19 17:57:04 +00:00
Konstantin Belousov
5efe338f3d Fix handling of the segment registers on i386.
Suppose that userspace is executing with the non-standard segment
descriptors.  Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers.  If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to.  As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.

More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them.  In this case kernel panics.

The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.

Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate.  Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers.  Second, we only load the
segment registers from the trap frame when returning to usermode.  For
the later, all interrupt return paths must happen through the doreti
common code.

There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls.  Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.

All the measures make the segment register handling similar to that of
amd64.  We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho (the non-lcall part)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12402
2017-09-18 20:22:42 +00:00
Josh Paetzel
9d0ec2a920 Revert r323087
This needs more thinking out and consensus, and the commit message
was wrong AND there was a typo in the commit.

pointyhat:	jpaetzel
2017-09-01 17:03:48 +00:00
Josh Paetzel
0be04b100c Take options IPSEC out of GENERIC
PR:	220170
Submitted by:	delphij
Reviewed by:	ae, glebius
MFC after:	2 weeks
Differential Revision:	D11806
2017-09-01 15:54:53 +00:00
Josh Paetzel
3b65550eec Allow kldload tcpmd5
PR:	220170
MFC after:	2 weeks
2017-08-31 20:16:28 +00:00
Alexander Motin
ed9652da5f Add NTB driver for PLX/Avago/Broadcom PCIe switches.
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too.  It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells.  There are also 4 DMA engines
in those chips, but they are not yet supported.

While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-08-30 21:16:32 +00:00
Conrad Meyer
2744a0b69b Drop CACHE_LINE_SIZE to 64 bytes on x86
The actual cache line size has always been 64 bytes.

The 128 number arose as an optimization for Core 2 era Intel processors.  By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently.  Newer CPUs prefetch more intelligently.

The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington").  If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.

Reported by:	mjg
Reviewed by:	np
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2017-08-28 22:28:41 +00:00
Konstantin Belousov
5e52cd4324 MFamd64 r322720, r322723:
Simplify i386 trap().

- Use more relevant name 'signo' instead of 'i' for the local variable
  which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
  of the trap() function.  Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:12:25 +00:00
Konstantin Belousov
aefdf821cd Remove unused code.
The machdep.uprintf_signal sysctl replaced it in more convenient way,
not requiring recompilation to use and providing more information on
fault.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:09:27 +00:00
Konstantin Belousov
325b406c49 MFamd64 r322718:
Use ANSI C declaration for trap_pfault().  Style.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:06:29 +00:00
Konstantin Belousov
8ce5520255 MFamd64 r322719:
Trim excessive 'extern'.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:04:29 +00:00
Konstantin Belousov
cef5e2bd91 Use the known valid segment when accessing memory in #UD handler.
Make sure that %eflags.D flag is cleared for hook.
Improve comments.

When #UD dtrace code checks for a registered hook before checking that
the exception was raised from kernel mode, we might run with the user
%ds, trapping on access.  Exception entry from userspace automatically
load valid %ss, which we can use there instead.

Noted and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-19 21:00:02 +00:00
Konstantin Belousov
828304899d When checking that #UD comes from kernel mode, check that the
exception did not happen in vm86 mode.  A vm86 userland process could
have a %cs that matches GSEL_KPL, while dtrace cannot hook it.

Submitted by:	Maxime Villard <max@m00nbsd.net>
MFC after:	3 days
2017-08-18 17:11:15 +00:00
Mark Johnston
01938d3666 Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
Mark Johnston
50ef60dabe Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
Conrad Meyer
dc6a82801d x86: Add dynamic interrupt rebalancing
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs.  Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources.  Reduced latency
was proven by, e.g., SPEC2008.

Submitted by:	jeff@ (earlier version)
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10435
2017-08-16 18:48:53 +00:00
Jung-uk Kim
b5669d0aa8 Split identify_cpu() into two functions for amd64 as we do for i386. This
reduces diff between amd64 and i386.  Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.

Reported by:	dexuan
Tested by:	dexuan
2017-08-09 18:09:09 +00:00
Jung-uk Kim
0105034487 Detect hypervisors early. We used to set lower hz on hypervisors by default
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().

MFC after:	3 days
2017-08-05 06:56:46 +00:00
Conrad Meyer
6f240e18b5 x86: Tag some intrinsics with __pure2
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__.  Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-03 22:28:30 +00:00
Konstantin Belousov
6632a4330f Do not call trapsignal() after handling usermode fault or interrupt,
when a signal is not intended to be sent.

The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.

For NMI, a return to usermode without ast processing is done.  On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.

Reported by:	Nils Beyer <nbe@renzel.net>
PR:	221151
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-02 10:12:10 +00:00
Mark Johnston
2375aaa8e9 Batch updates to v_wire_count when freeing page table pages on x86.
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.

Reviewed by:	alc, kib (previous revision)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11790
2017-08-01 05:26:30 +00:00
Konstantin Belousov
9adf30b0c3 Remove unused symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-30 21:52:22 +00:00
Dmitry Chagin
c151945c86 Avoid using [LINUX_]SHAREDPAGE constant directly in the vdso code.
This is needed for https://reviews.freebsd.org/D11780.

Reported by:	kib@
2017-07-30 21:24:20 +00:00
Konstantin Belousov
f2c18deb0e Fix handling of one more possible exception on return to usermode.
If %ss is loaded with a segment pointing to a non-present descriptor
by the IRETD instruction, a kernel-mode #SS exception is generated.
Resulting T_STKFLT trap must be checked against doreti_iret_fault
location and handled, otherwise userspace may panic the kernel.

Note that this is i386 variant of FreeBSD-SA-15:21.amd64, but unlike
amd64, there is no swapgs on i386 and the issue is arguably not
exploitable.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-08 11:07:39 +00:00
Sean Bruno
6ef9566177 Garbage collect kernel option TWA_FLASH_FIRMWARE
Submitted by:	 kevin.bowling0kev009.com
Differential Revision:	https://reviews.freebsd.org/D11387
2017-07-03 19:33:50 +00:00
Dmitry Chagin
a0c59c7afd Add support for musl consumers to the Linuxulator.
PR:		213809
Submitted by:	Yonas Yanfa
Reported by:	Yonas Yanfa
MFC after:	1 week
Relnotes:	yes
2017-07-03 10:24:49 +00:00
Alan Cox
510cdf22a2 When "force" is specified to pmap_invalidate_cache_range(), the given
start address is not required to be page aligned.  However, the loop
within pmap_invalidate_cache_range() that performs the actual cache
line invalidations requires that the starting address be truncated to
a multiple of the cache line size.  This change corrects an error in
that truncation.

Submitted by:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-07-01 16:42:09 +00:00
Jason A. Harmening
eb36b1d0bc Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00
Konstantin Belousov
f7df80f4ed Fix indent.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-06-24 10:19:06 +00:00
Konstantin Belousov
746e20fdb1 Correct translations between abridged and full x87 tags.
Reported and tested by:	karnajit wangkhem <karnajitw@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-17 11:25:31 +00:00
Konstantin Belousov
2d88da2f06 Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
Konstantin Belousov
43f41dd393 Make struct syscall_args visible to userspace compilation environment
from machine/proc.h, consistently on all architectures.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 20:53:44 +00:00
John Baldwin
9d24f98ca8 Remove the BSD/OS 2.1 system call gate LDT entry.
An extra copy of the system call gate was added to the default LDT back
in 1996 (r18513 / r18514).  However, the ability to run BSD/OS 2.1
i386 binaries under FreeBSD's native ABI is most likely no longer
needed.

Discussed with:	kib
2017-05-23 22:34:18 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Jung-uk Kim
af1973281e Use kmem_malloc() instead of malloc(9) for the native amd64 filter.
r316767 broke the BPF JIT compiler for amd64 because malloc()'d space is no
longer executable.

Discussed with:	kib, alc
2017-04-17 22:02:09 +00:00
Jung-uk Kim
e329e330d4 Move declarations for a machine-dependent function to the header file. 2017-04-17 21:51:26 +00:00
Jung-uk Kim
3a0e8b7373 Reduce diff with amd64 version. 2017-04-17 21:46:54 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
75c4b0b5ac Remove unused assembly symbols pointing to vmmeter. 2017-04-17 17:18:07 +00:00
Patrick Kelsey
67d955aab4 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
Tai-hwa Liang
7ece126ed8 Trying to be more compatible with Linux if.h definitions:
- renaming l_ifreq::ifru_metric to l_ifreq::ifru_ivalue;
	- adding a definition for ifr_ifindex which points to l_ifreq::ifru_ivalue.

A quick search indicates that Linux already got the above changes since 2.1.14.

Reviewed by:	kib, marcel, dchagin
MFC after:	1 week
2017-04-08 14:41:39 +00:00
Mark Johnston
5788c2bde1 Adjust the constraint for "src" in atomic_(f)cmpset_8.
"r" is not sufficient to prevent the use of invalid byte-width registers
with at least gcc.

Reported and reviewed by:	bde
X-MFC-With:	r315718
2017-03-27 16:18:19 +00:00
Andriy Gapon
978f3da16f revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
Bruce Evans
f434f3515b Fix printing of negative offsets (typically from frame pointers) again.
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before.  i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386.  powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.

Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.

-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.

The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.

Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
2017-03-26 18:46:35 +00:00
Andriy Gapon
a7b4c009e1 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
Dmitry Chagin
6c2a934b79 Implement Linux mincore() system call.
This is necessary for the upcoming drm-next.

Suggested by:	hselasky@
MFC after:	1 month
2017-03-25 15:47:29 +00:00