Commit Graph

8130 Commits

Author SHA1 Message Date
mjg
420c428175 amd64: annotate the syscall return address check with __predict_false
before:
   0xffffffff80b03ebb <+2059>:	mov    0x460(%r14),%rax
   0xffffffff80b03ec2 <+2066>:	mov    0x98(%rax),%rax
   0xffffffff80b03ec9 <+2073>:	shr    $0x2f,%rax
   0xffffffff80b03ecd <+2077>:	je     0xffffffff80b03edd <amd64_syscall+2093>
   0xffffffff80b03ecf <+2079>:	mov    0x3f8(%r14),%rax
   0xffffffff80b03ed6 <+2086>:	orl    $0x1,0xc8(%rax)
   0xffffffff80b03edd <+2093>:	add    $0xf8,%rsp

after:
   0xffffffff80b03ebb <+2059>:	mov    0x460(%r14),%rax
   0xffffffff80b03ec2 <+2066>:	mov    0x98(%rax),%rax
   0xffffffff80b03ec9 <+2073>:	shr    $0x2f,%rax
   0xffffffff80b03ecd <+2077>:	jne    0xffffffff80b03eef <amd64_syscall+2111>
   0xffffffff80b03ecf <+2079>:	add    $0xf8,%rsp

Reviewed by:	kib
MFC after:	1 week
2017-08-02 11:25:38 +00:00
kib
38f0d940ba Do not call trapsignal() after handling usermode fault or interrupt,
when a signal is not intended to be sent.

The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.

For NMI, a return to usermode without ast processing is done.  On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.

Reported by:	Nils Beyer <nbe@renzel.net>
PR:	221151
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-02 10:12:10 +00:00
truckman
43c5cd502a Lower the amd64 shared page, which contains the signal trampoline,
from the top of user memory to one page lower on machines with the
Ryzen (AMD Family 17h) CPU.  This pushes ps_strings and the stack
down by one page as well.  On Ryzen there is some sort of interaction
between code running at the top of user memory address space and
interrupts that can cause FreeBSD to either hang or silently reset.
This sounds similar to the problem found with DragonFly BSD that
was fixed with this commit:
  https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
but our signal trampoline location was already lower than the address
that DragonFly moved their signal trampoline to.  It also does not
appear to be related to SMT as described here:
  https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498

  "Hi, Matt Dillon here. Yes, I did find what I believe to be a
   hardware issue with Ryzen related to concurrent operations. In a
   nutshell, for any given hyperthread pair, if one hyperthread is
   in a cpu-bound loop of any kind (can be in user mode), and the
   other hyperthread is returning from an interrupt via IRETQ, the
   hyperthread issuing the IRETQ can stall indefinitely until the
   other hyperthread with the cpu-bound loop pauses (aka HLT until
   next interrupt). After this situation occurs, the system appears
   to destabilize. The situation does not occur if the cpu-bound
   loop is on a different core than the core doing the IRETQ. The
   %rip the IRETQ returns to (e.g. userland %rip address) matters a
   *LOT*. The problem occurs more often with high %rip addresses
   such as near the top of the user stack, which is where DragonFly's
   signal trampoline traditionally resides. So a user program taking
   a signal on one thread while another thread is cpu-bound can cause
   this behavior. Changing the location of the signal trampoline
   makes it more difficult to reproduce the problem. I have not
   been because the able to completely mitigate it. When a cpu-thread
   stalls in this manner it appears to stall INSIDE the microcode
   for IRETQ. It doesn't make it to the return pc, and the cpu thread
   cannot take any IPIs or other hardware interrupts while in this
   state."
since the system instability has been observed on FreeBSD with SMT
disabled.  Interrupts to appear to play a factor since running a
signal-intensive process on the first CPU core, which handles most
of the interrupts on my machine, is far more likely to trigger the
problem than running such a process on any other core.

Also lower sv_maxuser to prevent a malicious user from using mmap()
to load and execute code in the top page of user memory that was made
available when the shared page was moved down.

Make the same changes to the 64-bit Linux emulator.

PR:		219399
Reported by:	nbe@renzel.net
Reviewed by:	kib
Reviewed by:	dchagin (previous version)
Tested by:	nbe@renzel.net (earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11780
2017-08-02 01:43:35 +00:00
markj
b23f6b9b03 Batch updates to v_wire_count when freeing page table pages on x86.
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.

Reviewed by:	alc, kib (previous revision)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11790
2017-08-01 05:26:30 +00:00
kib
030abb1986 Remove unused symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-30 21:52:22 +00:00
dchagin
93ea6317f4 Avoid using [LINUX_]SHAREDPAGE constant directly in the vdso code.
This is needed for https://reviews.freebsd.org/D11780.

Reported by:	kib@
2017-07-30 21:24:20 +00:00
alc
621c14d3f9 Add support for pmap_enter(..., psind=1) to the amd64 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping.  (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.

Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).

Submitted by:	Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by:	kib, markj
Tested by:	pho
MFC after:	10 days
Differential Revision:	https://reviews.freebsd.org/D11556
2017-07-23 06:33:58 +00:00
rlibby
dfe1112fa8 __pcpu: gcc -Wredundant-decls
Pollution from counter.h made __pcpu visible in amd64/pmap.c.  Delete
the existing extern decl of __pcpu in amd64/pmap.c and avoid referring
to that symbol, instead accessing the pcpu region via PCPU_SET macros.
Also delete an unused extern decl of __pcpu from mp_x86.c.

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11666
2017-07-21 17:11:36 +00:00
rlibby
1eb6bbbb6e efi: restrict visibility of EFIABI_ATTR-declared functions
In-tree gcc (4.2) doesn't understand __attribute__((ms_abi))
(EFIABI_ATTR).  Avoid declaring functions with that attribute when the
compiler is detected to be gcc < 4.4.

Reviewed by:	kib, imp (previous version)
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11636
2017-07-20 06:47:06 +00:00
alc
91e58f1e61 Style-only change: Consistently use the variable name "pdpg" throughout
this file.  Previously, half of the pointers to a vm_page being used as
a page directory page were named "pdpg" and the rest were named "mpde".

Discussed with:	kib
MFC after:	1 week
2017-07-15 16:42:55 +00:00
alc
cebebf1a21 Extract the innermost loop of pmap_remove() out into its own function,
pmap_remove_ptes().  (This new function will also be used by an upcoming
change to pmap_enter() that adds support for psind == 1 mappings.)

Submitted by:	Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by:	kib, markj
MFC after:	1 week
2017-07-15 01:49:54 +00:00
kib
4021fe5d78 Fix size argument to vm_pager_allocate(), it is in bytes, not in pages.
It is believed to be only cosmetic.

Noted by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-13 08:23:37 +00:00
kib
7fe4b6ffe3 Revert r320936 to recommit with the correct log message. 2017-07-13 08:23:12 +00:00
kib
c5a4bd67a3 It is believed to be only cosmetic.
Noted by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-13 08:19:50 +00:00
ian
242599409b Protect access to the AT realtime clock with its own mutex.
The mutex protecting access to the registered realtime clock should not be
overloaded to protect access to the atrtc hardware, which might not even be
the registered rtc. More importantly, the resettodr mutex needs to be
eliminated to remove locking/sleeping restrictions on clock drivers, and
that can't happen if MD code for amd64 depends on it. This change moves the
protection into what's really being protected: access to the atrtc date and
time registers.

This change also adds protection when the clock is accessed from
xentimer_settime(), which bypasses the resettodr locking.

Differential Revision:	https://reviews.freebsd.org/D11483
2017-07-12 02:42:57 +00:00
imp
a87c7a85be An MMC/SD/SDIO stack using CAM
Implement the MMC/SD/SDIO protocol within a CAM framework. CAM's
flexible queueing will make it easier to write non-storage drivers
than the legacy stack. SDIO drivers from both the kernel and as
userland daemons are possible, though much of that functionality will
come later.

Some of the CAM integration isn't complete (there are sleeps in the
device probe state machine, for example), but those minor issues can
be improved in-tree more easily than out of tree and shouldn't gate
progress on other fronts. Appologies to reviews if specific items
have been overlooked.

Submitted by: Ilya Bakulin
Reviewed by: emaste, imp, mav, adrian, ian
Differential Review: https://reviews.freebsd.org/D4761

merge with first commit, various compile hacks.
2017-07-09 16:57:24 +00:00
rlibby
2160513c84 amd-vi: gcc build errors
amdvi_cmp_wait: gcc complained about a malformed string behind an ifdef.

struct amdvi_dte: widen the type of the first reserved bitfield so that
the packed representation would not cross an alignment boundary for that
type. Apparently that causes in-tree gcc (4.2) to insert padding
(despite packed, resulting in a wrong structure definition), and causes
more modern gcc to emit a warning.

ivrs_hdr_iterate_tbl: delete a misleading check about header length
being less than 0 (the type is unsigned) and replace it with a check
that the length doesn't exceed the table size.

Reviewed by:	anish, grehan
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11485
2017-07-07 06:37:19 +00:00
sbruno
12de4d18c8 Garbage collect kernel option TWA_FLASH_FIRMWARE
Submitted by:	 kevin.bowling0kev009.com
Differential Revision:	https://reviews.freebsd.org/D11387
2017-07-03 19:33:50 +00:00
dchagin
a0f4bac287 Add support for musl consumers to the Linuxulator.
PR:		213809
Submitted by:	Yonas Yanfa
Reported by:	Yonas Yanfa
MFC after:	1 week
Relnotes:	yes
2017-07-03 10:24:49 +00:00
alc
bf2a2bba1b When "force" is specified to pmap_invalidate_cache_range(), the given
start address is not required to be page aligned.  However, the loop
within pmap_invalidate_cache_range() that performs the actual cache
line invalidations requires that the starting address be truncated to
a multiple of the cache line size.  This change corrects an error in
that truncation.

Submitted by:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-07-01 16:42:09 +00:00
jah
d1caaa9300 Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00
kib
3c908eddcb Translate between abridged and full x87 tags for compat32
ptrace(PT_GETFPREGS).

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-24 11:38:31 +00:00
kib
e2a14c603f Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
kib
7b6fe97487 Make struct syscall_args visible to userspace compilation environment
from machine/proc.h, consistently on all architectures.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 20:53:44 +00:00
alc
56e961bd47 Eliminate duplication of the pmap and pv list unlock operations in
pmap_enter() by implementing a single return path.  Otherwise, the
duplication will only increase with the upcoming support for psind == 1.

Reviewed by:	kib (some time ago)
2017-06-03 17:24:13 +00:00
dchagin
8ab86e050d In r246085 some bits that are MI movied out into headers in compat/linux,
but I missed that when I commited x86_64 Linuxulator. So remove the duplicates.

MFC after:	1 week
2017-05-28 08:46:57 +00:00
trasz
31b7c45db9 Bump default MAXTSIZ (kern.maxtsiz) from 128MB to 32GB. The old limit
prevents one from running eg clang built with debug; the new one is
arbitrary (equal to MAXDSIZ) and... well, should be quite future-proof.

Same fix might be applicable to other 64 bit architectures; I'll ask
their respective maintainers to make sure it won't break anything.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10758
2017-05-17 08:38:41 +00:00
emaste
1901c3e1f2 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
cem
5c7d65801e Correct page frame mask constant used in pmap_change_attr_locked
This was introduced in r290156.  It's present in 11.0, but not any 10.x
release unless someone decided to MFC it.

It affects ordinary pages right above the DMAP limit, which is effectively
system memory rounded up to a 1 GB (3rd level superpage) boundary (or up to
a minimum of 4 GB, on small systems).

Reported by:	vangyzen
Reviewed by:	kib, alc
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D4030
2017-05-16 16:20:22 +00:00
kib
542fd28222 Ensure that resume path on amd64 only accesses page tables for normal
operation after processor is configured to allow all required
features.

In particular, NX must be enabled in EFER, otherwise load of page
table element with nx bit set causes reserved bit page fault.  Since
malloc uses direct mapping for small allocations, in particular for
the suspension pcbs, and DMAP is nx after r316767, this commit tripped
fault on resume path.

Restore complete state of EFER while wakeup code is still executing
with custom page table, before calling resumectx, instead of trying to
guess which features might be needed before resumectx restored EFER on
its own.

Bisected and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-05-15 20:52:43 +00:00
sephe
1e56a84cc0 pcicfg: Fix direct calls of pci_cfg{read,write} on systems w/o PCI host bridge.
Reported by:	dexuan@
Reviewed by:	jhb@
MFC after:	1 week
Sponsored by:	Microsoft
Differential Revision: https://reviews.freebsd.org/D10564
2017-05-04 05:28:46 +00:00
anish
6bb9bc6652 Add AMD IOMMU/AMD-Vi support in bhyve for passthrough/direct assignment to VMs. To enable AMD-Vi, set hw.vmm.amdvi.enable=1.
Reviewed by:bcr
Approved by:grehan
Tested by:rgrimes
Differential Revision:https://reviews.freebsd.org/D10049
2017-04-30 02:08:46 +00:00
jkim
9445e9d6e1 Use kmem_malloc() instead of malloc(9) for the native amd64 filter.
r316767 broke the BPF JIT compiler for amd64 because malloc()'d space is no
longer executable.

Discussed with:	kib, alc
2017-04-17 22:02:09 +00:00
jkim
22babe5387 Move declarations for a machine-dependent function to the header file. 2017-04-17 21:51:26 +00:00
glebius
21ead51d79 - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
glebius
0266d4259e Remove unused assembly symbols pointing to vmmeter. 2017-04-17 17:18:07 +00:00
glebius
5763443023 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
kib
6eae6fe2b3 Map DMAP as nx.
Demotions preserve PG_NX, so it is enough to set nx bit for initial
lowest-level paging entries.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-04-13 15:49:55 +00:00
pkelsey
33064e92a2 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
avatar
93f11e526d Trying to be more compatible with Linux if.h definitions:
- renaming l_ifreq::ifru_metric to l_ifreq::ifru_ivalue;
	- adding a definition for ifr_ifindex which points to l_ifreq::ifru_ivalue.

A quick search indicates that Linux already got the above changes since 2.1.14.

Reviewed by:	kib, marcel, dchagin
MFC after:	1 week
2017-04-08 14:41:39 +00:00
avg
7a52acd8b3 revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
bde
254458ab34 Fix printing of negative offsets (typically from frame pointers) again.
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before.  i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386.  powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.

Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.

-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.

The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.

Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
2017-03-26 18:46:35 +00:00
avg
04ec8ce247 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
dchagin
7e4bbabbee Implement Linux mincore() system call.
This is necessary for the upcoming drm-next.

Suggested by:	hselasky@
MFC after:	1 month
2017-03-25 15:47:29 +00:00
bde
27ac811b7a Remove buggy adjustment of page tables in db_write_bytes().
Long ago, perhaps only on i386, kernel text was mapped read-only and
it was necessary to change the mapping to read-write to set breakpoints
in kernel text.  Other writes by ddb to kernel text were also allowed.
This write protection is harder to implement with 4MB pages, and was
lost even for 4K pages when 4MB pages were implemented.  So changing
the mapping became useless.  It was actually worse than useless since
it followed followed various null and otherwise garbage pointers to
not change random memory instead of the mapping.  (On i386s, the
pointers became good in pmap_bootstrap(), and on amd64 the pointers
became bad in pmap_bootstrap() if not before.)

Another bug broke detection of following of null pointers on i386,
except early in boot where not detecting this was a feature.  When
I fixed the bug, I accidentally broke the feature and soon got traps
in db_write_bytes().  Setting breakpoints early in ddb was broken.

kib pointed out that a clean way to do the adjustment would be to use
a special [sub]map giving a small window on the bytes to be written.

The trap handler didn't know how to fix up errors for pagefaults
accessing the map itself.  Such errors rarely need fixups, since most
traps for the map are for the first access which is a read.

Reviewed by:	kib
2017-03-24 17:34:55 +00:00
ed
63254ceea6 Stop providing the compat_3_brand.
As of r315860, the ELF image activator works fine for CloudABI without it.

Reviewed by:	kib
MFC after:	2 weeks
2017-03-23 14:12:21 +00:00
kib
da63ef1f60 Update r315753 with the proper flag name.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:28:13 +00:00
kib
a22b5a3135 Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only
matches static binaries.

Interpretation of the 'static' there is that the binary must not
specify an interpreter.  In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.

This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.

Revert r315701.

Discussed with and tested by:	ed (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:23:01 +00:00
markj
c4e0ff355e Add support for 8- and 16-bit atomic_(f)cmpset to x86.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10068
2017-03-22 17:29:04 +00:00
ed
fc95dfd2e5 Set the interpreter path to /nonexistent.
CloudABI executables are statically linked and don't have an
interpreter. Setting the interpreter path to NULL used to work
previously, but r314851 introduced code that checks the string
unconditionally. Running CloudABI executables now causes a null pointer
dereference.

Looking at the rest of imgact_elf.c, it seems various other codepaths
already leaned on the fact that the interpreter path is set. Let's just
go ahead and pick an obviously incorrect interpreter path to appease
imgact_elf.c.

MFC after:	1 week
2017-03-22 07:05:27 +00:00
dchagin
69ea87350f Implement getrandom() syscall.
Note. GRND_RANDOM option is not supported for now.

MFC after:	1 month
2017-03-18 18:34:29 +00:00
dchagin
82392d7947 To reduce code duplication move socket defines to the MI path.
MFC after:	1 week
2017-03-18 18:23:30 +00:00
bde
02d7df1d66 Don't access the reserved registers %dr4 and %dr5 on i386.
On the original i386, %dr[4-5] were unimplemented but not very clearly
reserved, so debuggers read them to print them.  i386 was still doing
this.

On the original athlon64, %dr[4-5] are documented as reserved but are
aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps.

On 2 of my systems, accessing %dr[4-5] trapped sometimes.  On my Haswell
system, the apparent randomness was because the boot CPU starts with
CR4_DE set while all other CPUs start with CR4_DE clear.  FreeBSD
doesn't support the data breakpoints enabled by CR4_DE and it never
changes this flag, so the flag remains different across CPUs and
the behaviour seemed inconsistent except while booting when the CPU
doesn't change.

The invalid accesses broke:
- read access for printing the registers in ddb "show watches" on CPUs
  with CR4_DE set
- read accesses in fill_dbregs() on CPUs with CR4_DE set.  This didn't
  implement panic(3) since the user case always skipped %dr[4-5].
- write accesses in set_dbregs().  This also didn't affect userland.
  When it didn't trap, the aliasing made it fragile.

Don't print the dummy (zero) values of %dr[4-5] in "show watches" for
i386 or amd64.  Fix style bugs near this printing.

amd64 also has space in the dbregs struct for the reserved %dr[8-15]
and already didn't print the dummy values for these, and never accessed
any of the 10 reserved debug registers.

Remove cpufuncs for making the invalid accesses.  Even amd64 had these.
2017-03-17 13:49:05 +00:00
grehan
071fd9390c Hide the AMD MONITORX/MWAITX capability.
Otherwise, recent Linux guests will use these instructions, resulting
in #UD exceptions since bhyve doesn't implement MONITOR/MWAIT exits.

This fixes boot-time hangs in recent Linux guests on Ryzen CPUs
(and probably Bulldozer aka AMD FX as well).

Reviewed by:	kib
MFC after:	1 week
2017-03-16 03:21:42 +00:00
manu
044a23ec3c Remove i915drm and radeondrm from NOTES and conf.
This unbreak LINT kernel.

Reported by:	lwhsu
2017-03-12 00:52:16 +00:00
dchagin
d7b4f21065 Reduce code duplication between MD Linux code by moving SYSV IPC 64-bit
related struct definitions out into the MI path.

Invert the native ipc structs to the Linux ipc structs convesion logic.
Since 64-bit variant of ipc structs has more precision convert native ipc
structs to the 64-bit Linux ipc structs and then truncate 64-bit values
into the non 64-bit if needed. Unlike Linux, return EOVERFLOW if the
values do not fit.

Fix SYSV IPC for 64-bit Linuxulator which never sets IPC_64 bit.

MFC after:	1 month
2017-03-07 17:07:16 +00:00
mmokhi
d2f0c04c1f Regenerated Linuxulator syscall tables for r314782
Approved by:	dchagin
MFC after:	1 month
2017-03-06 18:20:37 +00:00
mmokhi
4c583e7b8e Add UNIMPLEMENTED() placeholder macro for
the syscalls that are not implemented in Linux kernel itself.
Cleanup DUMMY() macros.

Reviewed by:	dchagin, trasz
Approved by:	dchagin
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9804
2017-03-06 18:11:38 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
kib
06450fe6c4 Initialize pcb_save for thread0.
Otherwise kernel traps on NULL dereference if fpu_kern(9) is used from the
thread0 context.

Reported by:	cem
Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-28 22:54:52 +00:00
glebius
745bcd6fba Remove SVR4 (System V Release 4) binary compatibility support.
UNIX System V Release 4 is operating system released in 1988. It ceased
to exist in early 2000-s.
2017-02-28 05:14:42 +00:00
dchagin
948e5bcce0 Regen for r314312 (Linux epoll_pwait).
MFC after:	1 month
2017-02-26 19:59:28 +00:00
dchagin
8a318fc47b Change Linux epoll_pwait syscall definition to match Linux actual one.
MFC after:	1 month
2017-02-26 19:57:18 +00:00
alc
c062fef7f7 Refine the fix from r312954. Specifically, add a new PDE-only flag,
PG_PROMOTED, that indicates whether lingering 4KB page mappings might
need to be flushed on a PDE change that restricts or destroys a 2MB
page mapping.  This flag allows the pmap to avoid range invalidations
that are both unnecessary and costly.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9665
2017-02-26 19:54:02 +00:00
dchagin
396e17522f Implement timerfd family syscalls.
MFC after:	1 month
2017-02-26 09:48:18 +00:00
dchagin
d9e839a5db Regen after r314291 (timerfd definition).
MFC after:	1 month
2017-02-26 09:37:25 +00:00
dchagin
d7d0ab04b1 Change Linuxulator timerfd syscalls definition to match actual Linux one.
MFC after:	1 month
2017-02-26 09:35:44 +00:00
trasz
aaee258648 Fix linux_fstatfs() to return proper value for f_frsize. Without it,
linux df(1) binary from Xenial shows garbage.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9692
2017-02-25 20:32:37 +00:00
mmokhi
bf114a4c4a Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 20:04:02 +00:00
dchagin
8f19e32110 Revert r314217. Commit is not match that I have approved. 2017-02-24 19:47:27 +00:00
mmokhi
0de5ce3724 Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 19:22:17 +00:00
pfg
077418d939 sys: Replace zero with NULL for pointers.
Found with:	devel/coccinelle
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9694
2017-02-22 02:35:59 +00:00
markj
4f689e04e8 ddb show pte: use pmap of kdb_thread
show pte from the pmap of the process of the current DDB thread, instead
of necessarily the PCPU pmap.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9645
2017-02-21 21:06:12 +00:00
trasz
1bc1c21a5b Reimplement linux_arch_prctl() as a wrapper around sysarch(2).
This also adds support for LINUX_ARCH_SET_GS.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9372
2017-02-20 16:13:40 +00:00
alc
881c4340ef In pmap_enter(), set the PG_MANAGED flag on the new PTE in one place,
rather two places, and do so before the pmap lock is acquired.

Submitted by:	Yufeng Zhou <yz70@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-02-19 18:00:57 +00:00
dchagin
ca1a1a9d60 Implement rt_tgsigqueueinfo system call used by glibc for pthread_sigqueue(3).
MFC after:	2 week
2017-02-19 07:38:11 +00:00
kib
b22278d483 Microoptimize amd64/pmap.c pmap_protect_pde().
For the loop that dirties vm_pages in case superpage was written to,
check the complete condition before the loop.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-19 03:33:20 +00:00
jah
85257d4d56 Bring back r313037, with fixes for mips:
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.

Reviewed by:	andreast, kan, lidl
Tested by:	lidl(mips, sparc64), andreast(powerpc)
Differential Revision:	https://reviews.freebsd.org/D9587
2017-02-19 02:03:09 +00:00
kib
5c280e36bd Merge i386 and amd64 mtrr drivers.
Reviewed by:	royger, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9648
2017-02-17 21:08:32 +00:00
royger
bf0e949003 x86: fix MTRR initialization if EARLY_AP_STARTUP is used
MTRR handlers are set in {amd64/i686}_mem_drvinit, which is called at
SI_SUB_DRIVERS, and that's too late when EARLY_AP_STARTUP is set because APs
have already started at this point. {amd64/i686}_mrinit is also called too late
for the BSP, since that happens when the memory device is attached, also after
APs have already started.

Move the position to SI_SUB_CPU, and also initialize the state for the BSP, so
that the APs can correctly get to the same state as the BSP.

Sponsored by:		Citrix Systems R&D
MFC after:		1 week
Reviewed by:		jhb, kib
Differential Revision:	https://reviews.freebsd.org/D9630
2017-02-17 12:47:51 +00:00
trasz
db2e4d79bb Implement linux version of ptrace(2). It's nowhere near complete,
but it allows to use 64 bit linux strace(1) on 64 bit linux binaries.

Reviewed by:	dchagin (earlier version)
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9406
2017-02-16 13:32:15 +00:00
trasz
93717f4b2a Regen after r313769.
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-15 14:25:50 +00:00
trasz
f11d19c42a Fix definition of linux64 ptrace syscall.
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-15 14:12:39 +00:00
jhb
e80fc50712 Regenerate all the system call tables to drop "created from" lines.
One of the ibcs2 files contains some actual changes (new headers) as
it hasn't been regenerated after older changes to makesyscalls.sh.
2017-02-10 19:45:02 +00:00
erj
1edbd9d1a6 ixl(4): Update to 1.7.12-k
Refresh upstream driver before impending conversion to iflib.

Major new features:

- Support for Fortville-based 25G adapters
- Support for I2C reads/writes

(To prevent getting or sending corrupt data, you should set
dev.ixl.0.debug.disable_fw_link_management=1 when using I2C
[this will disable link!], then set it to 0 when done. The driver implements
the SIOCGI2C ioctl, so ifconfig -v works for reading I2C data,
but there are read_i2c and write_i2c sysctls under the .debug sysctl tree
[the latter being useful for upper page support in QSFP+]).

- Addition of an iWARP client interface (so the future iWARP driver for
  X722 devices can communicate with the base driver).
  - Compiling this option in is enabled by default, with "options IXL_IW" in
    GENERIC.

Differential Revision:	https://reviews.freebsd.org/D9227
Reviewed by:	sbruno
MFC after:	2 weeks
Sponsored by:	Intel Corporation
2017-02-10 01:04:11 +00:00
dchagin
ef67426db9 Regen after r313284.
MFC after:	2 week
2017-02-05 14:19:19 +00:00
dchagin
1ba976c2fa Update syscall.master to 4.10-rc6. Also fix comments, a typo,
and wrong numbering for a few unimplemented syscalls.

For 32-bit Linuxulator, socketcall() syscall was historically
the entry point for the sockets API. Starting in Linux 4.3, direct
syscalls are provided for the sockets API. Enable it.

The initial version of patch was provided by trasz@ and extended by me.

Submitted by:	trasz
MFC after:	2 week
Differential Revision:	https://reviews.freebsd.org/D9381
2017-02-05 14:17:09 +00:00
jah
f5659d40d3 Revert r313037
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.

Reported by:	kan, lidl
2017-02-04 06:24:49 +00:00
jah
fc31303beb Implement get_pcpu() for the remaining architectures and use it to
replace pcpu_find(curcpu) in MI code.
2017-02-01 03:32:49 +00:00
trasz
69861f6d1f Replace sys_ftruncate() with kern_ftruncate() in various compats.
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9368
2017-01-30 11:50:54 +00:00
kib
82ecdbf179 Do not leave stale 4K TLB entries on pde (superpage) removal or
protection change.

On superpage promotion, x86 pmaps do not invalidate existing 4K
entries for the superpage range, because they are compatible with the
promoted 2/4M entry.  But the invalidation on superpage removal or
protection change only did single INVLPG with the base address of the
superpage.  This reliably flushed superpage TLB entry, and 4K entry
for the first page of the superpage, potentially leaving other 4K TLB
entries lingering.  Do the invalidation of the whole superpage range
to correct the problem.

Note that the precise invalidation is done by x86 code for kernel_pmap
only, for user pmaps whole (per-AS) TLB is flushed.  This made the bug
well hidden, because promotions of the kernel mappings require
specific load.

Reported and tested by:	Jonathan Looney <jtl@netflix.com> (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-29 19:14:48 +00:00
bapt
bd0b52fc1f Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
bapt
02ac05d572 Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
tijl
dab9980fd3 Apply r210555 to 64 bit linux support:
The interpreter name should no longer be treated as a buffer that can be
overwritten.

PR:		216346
MFC after:	3 days
2017-01-24 16:13:59 +00:00
kib
d52cc80daf Use SFENCE for ordering CLFLUSHOPT.
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-01-20 19:08:44 +00:00
avg
fa73e5b5c7 vmm_dev: work around a bogus error with gcc 6.3.0
The error is:
vmm_dev.c: In function 'alloc_memseg':
vmm_dev.c:261:11: error: null argument where non-null required (argument 1) [-Werror=nonnull]

Apparently, the gcc is unable to figure out that if a ternary operator
produced a non-NULL value once, then the operator with exactly the same
operands would produce the same value again.

MFC after:	1 week
2017-01-20 13:21:27 +00:00
ed
be72efbdd4 Catch up with changes to structure member names.
Pointer/length pairs are now always named ${name} and ${name}_len.
2017-01-17 22:05:52 +00:00
cem
6d240aee91 Fix a variety of cosmetic typos and misspellings
No functional change.

PR:		216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110
Reported by:	Bulat <bltsrc at mail.ru>
Sponsored by:	Dell EMC Isilon
2017-01-15 18:00:45 +00:00
markj
d8a11d2a36 Coalesce TLB shootdowns of global PTEs in pmap_advise() on x86.
We would previously invalidate such entries individually, resulting in more
IPIs than necessary.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9094
2017-01-10 21:52:48 +00:00
sbruno
efab05d612 Migrate e1000 to the IFLIB framework:
- em(4) igb(4) and lem(4)
- deprecate the igb device from kernel configurations
- create a symbolic link in /boot/kernel from if_em.ko to if_igb.ko

Devices tested:
- 82574L
- I218-LM
- 82546GB
- 82579LM
- I350
- I217

Please report problems to freebsd-net@freebsd.org

Partial review from jhb and suggestions on how to *not* brick folks who
originally would have lost their igbX device.

Submitted by:	mmacy@nextbsd.org
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks and Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8299
2017-01-10 03:23:22 +00:00
mjg
50a477c2ee amd64: add atomic_fcmpset
Reviewed by:	kib, jhb
2017-01-03 21:00:24 +00:00
kib
149e8d3d06 Fix typo. Remove spurious blank line.
MFC after:	3 days
2016-12-18 09:32:23 +00:00
jhb
203d8b88f7 Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default.
PR:		199321, 203682
MFC after:	2 months
Sponsored by:	Netflix
2016-12-16 21:10:37 +00:00
kib
64b6cb0328 Provide non-final but valid PCB pointer for thread0 for duration of
hammer_time().  This makes assembler exception handlers not fault
itself when setting PCB flags, and allow normal kernel trap handler to
get control.  The pointer is reset after FPU parameters are obtained.

Set thread0.td_critnest to 1 for duration of hammer_time() as well.
In particular, page faults at that early stage panic immediately
instead of trying to call not yet operational VM to resolve it.

As result, faults during second half of the hammer_time() execution
have a chance to be reported instead of silent machine reboot or hang.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-14 11:40:31 +00:00
def
f63c437216 Add support for encrypted kernel crash dumps.
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.

A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.

dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable.  Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.

When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore

A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.

Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.

savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.

decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.

Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.

EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.

Designed by:	def, pjd
Reviewed by:	cem, oshogbo, pjd
Partial review:	delphij, emaste, jhb, kib
Approved by:	pjd (mentor)
Differential Revision:	https://reviews.freebsd.org/D4712
2016-12-10 16:20:39 +00:00
imp
3459557545 Permit loading of efirt module even when there's no EFI to call. The
module loading is successful, but attempts to use it will not be
successful. This is similar to what we do (did?) with ACPI on non-ACPI
systems. We succeed if we can't find the necessary information to hook
into EFI, but still fail if we're unable to allocate resources if we
do find EFI.

Not Objected to by: kib@
MFC Afer: 3 days
2016-12-09 23:37:11 +00:00
markj
8bb19c4929 Add a COMPAT_FREEBSD11 kernel option.
Use it wherever COMPAT_FREEBSD10 is currently specified.

Reviewed by:	glebius, imp, jhb
Differential Revision:	https://reviews.freebsd.org/D8736
2016-12-09 18:54:12 +00:00
glebius
f8eae77f98 Treat R_X86_64_PLT32 relocs as R_X86_64_PC32.
If we load a binary that is designed to be a library, it produces
relocatable code via assembler directives in the assembly itself
(rather than compiler options).  This emits R_X86_64_PLT32 relocations,
which are not handled by the kernel linker.

Submitted by:	gallatin
Reviewed by:	kib
2016-12-09 18:07:28 +00:00
alc
7571ef95c1 Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index.  However, with this
change, vm_radix_remove() no longer panics.  Instead, it returns NULL
if there is no vm_page_t at the specified index.  Otherwise, it
returns the vm_page_t.  The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations.  Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D8708
2016-12-08 04:29:29 +00:00
jhb
f987575fac Report page faults due to reserved bits in PTEs as a separate fault type.
Rather than reporting a page fault due to a bad PTE as a protection
violation with the "rsv" flag, treat these faults as a separate type of
fault altogether.

MFC after:	1 month
2016-11-19 01:34:12 +00:00
bdrewery
30f99dbeef Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
cem
7ae132fee1 Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8366
2016-10-31 23:09:52 +00:00
jhb
1018a82d1b Move declarations of invpcid_works and pmap_pcid_enabled to pmap.h.
Previously these were only declared under #ifdef SMP in <machine/smp.h>.
However, these variables are defind in pmap.c unconditionally, and efirt.c
references them unconditionally.  This fixes non-SMP kernel builds.

Discussed with:	kib
MFC after:	1 week
2016-10-31 18:37:05 +00:00
avg
9127df4438 fix a syntax error in r308039 ...
that I somehow introduced between testing the change
iand committing it.

MFC after:	1 week
X-MFC with:	r307903
2016-10-28 15:57:55 +00:00
avg
6310254212 vmm: another take at maximmum address passed to contigmalloc
Just using vm_paddr_t value with all bits set.
That should work as long as the type is unsigned.

While there, fix a couple of whitespace issues nearby.

MFC after:	1 week
X-MFC with:	r307903
2016-10-28 14:38:01 +00:00
jhb
95a3814f21 MFamd64: Add bounds checks on addresses used with /dev/mem.
Reject attempts to read from or memory map offsets in /dev/mem that are
beyond the maximum-supported physical address of the current CPU.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7408
2016-10-27 21:23:14 +00:00
glebius
856adf7415 The argument validation in r296956 was not enough to close all possible
overflows in sysarch(2).

Submitted by:	Kun Yang <kun.yang chaitin.com>
Patch by:	kib
Security:	SA-16:15
2016-10-25 17:13:46 +00:00
avg
a2af253a41 fix up r307903, use correct max address definition
MFC after:	1 week
X-MFC with:	r307903
2016-10-25 10:59:21 +00:00
avg
7d3b940604 vmm/svm: iopm_bitmap and msr_bitmap must be contiguous in physical memory
To achieve that the whole svm_softc is allocated with contigmalloc now.
It would be more effient to de-embed those arrays and allocate only them
with contigmalloc.

Previously, if malloc(9) used non-contiguous pages for the arrays, then
random bits in physical pages next to the first page would be used to
determine permissions for I/O port and MSR accesses.  That could result
in a guest dangerously modifying the host hardware configuration.

One example is that sometimes NMI watchdog driver in a Linux guest
would be able to configure a performance counter on a host system.
The counter would generate an interrupt and if hwpmc(4) driver is loaded
on the host, then the interrupt would be delivered as an NMI.

Discussed with:	jhb
Reviewed by:	grehan
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D8321
2016-10-25 10:34:14 +00:00
kib
45100446da Follow-up to r307866:
- Make !KDB config buildable.
- Simplify interface to nmi_handle_intr() by evaluating panic_on_nmi
  in one place, namely nmi_call_kdb().  This allows to remove do_panic
  argument from the functions, and to remove i386/amd64 duplication of
  the variable and sysctl definitions.  Note that now NMI causes
  panic(9) instead of trap_fatal() reporting and then panic(9),
  consistently for NMIs delivered while CPU operated in ring 0 and 3.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-24 20:47:46 +00:00
kib
a04db702cd Handle broadcast NMIs.
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.

When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb.  All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work.  One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".

Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.

Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs.  If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD.  More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.

Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.

The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8249
2016-10-24 16:40:27 +00:00
jkim
229f578eb8 Implement BPF_MOD and BPF_XOR instructions.
These two ALU instructions first appeared on Linux.  Then, libpcap adopted
and made them available since 1.6.2.  Now more platforms including NetBSD
have them in kernel.  So do we.
 --이 줄 이하는 자동으로 제거됩니다--
> Description of fields to fill in above:                     76 columns --|
> PR:                       If and which Problem Report is related.
> Submitted by:             If someone else sent in the change.
> Reported by:              If someone else reported the issue.
> Reviewed by:              If someone else reviewed your modification.
> Approved by:              If you needed approval for this commit.
> Obtained from:            If the change is from a third party.
> MFC after:                N [day[s]|week[s]|month[s]].  Request a reminder email.
> MFH:                      Ports tree branch name.  Request approval for merge.
> Relnotes:                 Set to 'yes' for mention in release notes.
> Security:                 Vulnerability reference (one per line) or description.
> Sponsored by:             If the change was sponsored by an organization.
> Differential Revision:    https://reviews.freebsd.org/D### (*full* phabric URL needed).
> Empty fields above will be automatically removed.

M    share/man/man4/bpf.4
M    sys/amd64/amd64/bpf_jit_machdep.c
M    sys/amd64/amd64/bpf_jit_machdep.h
M    sys/i386/i386/bpf_jit_machdep.c
M    sys/i386/i386/bpf_jit_machdep.h
M    sys/net/bpf_filter.c
2016-10-21 06:55:07 +00:00
jkim
42d876c52e Redude code for conditional jumps. 2016-10-21 06:09:30 +00:00
jkim
b35ec131f8 Fix compiler warnings for user land. 2016-10-21 06:06:54 +00:00
stevek
8cc74c4a97 Add sysctl to make amd64 minidump retry count tunable at runtime.
PR:		213462
Submitted by:	RaviPrakash Darbha <rdarbha@juniper.net>
Reviewed by:	cemi, markj
Approved by:	sjg (mentor)
Obtained from:	Juniper Networks
Differential Revision:	https://reviews.freebsd.org/D8254
2016-10-17 22:57:41 +00:00
kib
eb0078e23d Do not try to create /dev/efi device node before devfs is initialized.
Split efirt.ko initialization into early stage where runtime services
KPI environment is created, to be used e.g. for RTC, and the later
devfs node creation stage, per module.

Switch the efi device to use make_dev_s(9) instead of make_dev(9).  At
least, this gracefully handles the duplicated device name issue.

Remove ARGSUSED comment from efidev_ioctl(), all unused arguments are
annotated with __unused attribute.

Reported by:	ambrisko, O. Hartmann <ohartman@zedat.fu-berlin.de>
Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-16 06:07:43 +00:00
jhb
f689fd5a63 Drop support for using mmap() with /dev/kmem.
Using the device pager with /dev/kmem is not stable since KVA mappings
are transient, but the device pager caches the PA associated with a
given offset forever.  Interestingly, mips' implementation of
memmap() already refused requests for /dev/kmem.

Note that kvm_read/kvm_write do not use mmap, but use read and write on
/dev/kmem, so this should not affect libkvm users.

Reviewed by:	kib
MFC after:	2 months
2016-10-14 20:01:07 +00:00
jtl
62030781cd In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
imp
41d4ddad0f Create /dev/efidev to provide an ioctl interface to
userland.  It supports userland interfaces to UEFI Runtime Services. This is
indended to the the MI portion of EFI RuntimeServices support.

Differential Revision: https://reviews.freebsd.org/D8128
Reviewed by: kib@, wblock@, Ganael Laplanche
2016-10-11 22:24:30 +00:00
kib
559623d89a Re-apply r306516 (by cem):
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags

Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.

On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one.  (Running a basic file
server workload.)

Submitted by:	Anton Rang <rang at acm.org>
Reviewed by:	cem (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8041
2016-10-04 17:01:24 +00:00
cem
de42bf751c Revert r306516 for now, it is incomplete on i386
Noted by:	kib
2016-09-30 18:58:50 +00:00
cem
22e3a710d0 Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.

On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one.  (Running a basic file
server workload.)

Submitted by:	Anton Rang <rang at acm.org>
Reviewed by:	cem (earlier version), kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8041
2016-09-30 18:12:16 +00:00
hselasky
5e41da7ccd Move the ConnectX-3 and ConnectX-2 driver from sys/ofed into sys/dev/mlx4
like other PCI network drivers. The sys/ofed directory is now mainly
reserved for generic infiniband code, with exception of the mthca driver.

- Add new manual page, mlx4en(4), describing how to configure and load
mlx4en.

- All relevant driver C-files are now prefixed mlx4, mlx4_en and
mlx4_ib respectivly to avoid object filename collisions when compiling
the kernel. This also fixes an issue with proper dependency file
generation for the C-files in question.

- Device mlxen is now device mlx4en and depends on device mlx4, see
mlx4en(4). Only the network device name remains unchanged.

- The mlx4 and mlx4en modules are now built by default on i386 and
amd64 targets. Only building the mlx4ib module depends on
WITH_OFED=YES .

Sponsored by:	Mellanox Technologies
2016-09-30 08:23:06 +00:00
kib
7dd86df41b Handle TLB shootdown IPI during the EFI runtime calls, on SandyBridge
and IvyBridge machines, which support PCID but do not have INVPCID
instruction.

MFC after:	1 week
2016-09-26 17:25:25 +00:00
kib
3deaeb8d22 For machines which support PCID but not have INVPCID instruction,
i.e. SandyBridge and IvyBridge, correct a race between pmap_activate()
and invltlb_pcid_handler().

Reported by and tested by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after:	1 week
2016-09-26 17:22:44 +00:00
bde
94a237c17c Fix vm86 initialization, part 3 of 2 and a half. (Actually, just fix
early printfs and debugging of vm86 initialization and some other early
initialization in some cases.)  Add an option debug.late_console (with
default 1=off) to move console and kdb initialization back where it was.
Do the same for amd64 although there is no vm86 there.

On my test system, debug.late_console=0 works for the syscons, sio and
uart console drivers on amd64 and i386, and for vt on i386 but not on
amd64.

The early printfs fixed by debug.late_console=0 are:
- on i386, the message about lost memory above 4G
- with -v in otherwise normal use, about 20 printfs for SMAP
- other debugging messages for memory sizing.  Mostly under -v and
  not printed in normal use.

Document in a comment how much earlier the initialization and early
printf()s can be.  That is very early for the console.  Not much more
than curthread is needed.  kdb use obviously needs to be not so early,
since it needs IDT initialization and that is done relatively late
for convenience and historical reasons.
2016-09-25 14:56:24 +00:00
imp
5720aec884 Change the efi_get_table interface to a void ** so we can return the
pointer by dereferencing the pointer.

Reviewed by: kib@
MFC After: 2 weeks
Sponsored by: Netflix, Inc
2016-09-22 19:04:51 +00:00
markj
f2a1dd4de8 Regenerate syscall provider argument strings. 2016-09-22 04:50:03 +00:00
kib
df422cbea3 Add kernel interfaces to call EFI Runtime Services.
Runtime services require special execution environment for the call.
Besides that, OS must inform firmware about runtime virtual memory map
which will be active during the calls, with the SetVirtualAddressMap()
runtime call, done while the 1:1 mapping is still used.  There are two
complication: the SetVirtualAddressMap() effectively must be done from
loader, which needs to know kernel address map in advance.  More,
despite not explicitely mentioned in the specification, both 1:1 and
the map passed to SetVirtualAddressMap() must be active during the
SetVirtualAddressMap() call.  Second, there are buggy BIOSes which
require both mappings active during runtime calls as well, most likely
because they fail to identify all relocations to perform.

On amd64, we can get rid of both problems by providing 1:1 mapping for
the duration of runtime calls, by temprorary remapping user addresses.
As result, we avoid the need for loader to know about future kernel
address map, and avoid bugs in BIOSes.  Typically BIOS only maps
something in low 4G.  If not runtime bugs, we would take advantage of
the DMAP, as previous versions of this patch did.

Similar but more complicated trick can be used even for i386 and 32bit
runtime, if and when the EFI boot on i386 is supported.  We would need
a trampoline page, since potentially whole 4G of VA would be switched
on calls, instead of only userspace portion on amd64.

Context switches are disabled for the duration of the call, FPU access
is granted, and interrupts are not disabled.  The later is possible
because kernel is mapped during calls.

To test, the sysctl mib debug.efi_time is provided, setting it to 1
makes one call to EFI get_time() runtime service, on success the efitm
structure is printed to the control terminal.  Load efirt.ko, or add
EFIRT option to the kernel config, to enable code.

Discussed with:	emaste, imp
Tested by:	emaste (mac, qemu)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-21 11:31:58 +00:00
kib
eaff6cf773 Rename efi_systbl to efi_systbl_phys, the variable contains the
physical address of the EFI System Table.  Add _KERNEL guard around
its declaration in sys/efi.h.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-21 10:55:28 +00:00
kib
0b0178b3a6 Add a way for the architecture to specify the calling ABI for methods
in the EFI Runtime Services Table.  On amd64, the calling conventions
are MS.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-21 10:35:44 +00:00
kib
ff88ae0aef Add amd64 functions to load/store GDT register, store IDT and TR registers.
Note that lgdt() name is already used for function which, besides
loading GDT, also reloads segment descriptors cache, thus new function
is named bare_lgdt().

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-21 10:10:36 +00:00
kib
5aff5acf9c Export the pmap_cache_bits() and pmap_pinit_pml4() functions from the
amd64 pmap.

The new pmap_pinit_pml4() function initializes the level 4 page table
with entries for the kernel mappings.  Both functions are needed for
upcoming EFI Runtime Services support.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-21 10:05:51 +00:00
kib
e9e29e2686 MFC r305939:
Remove trailing space.
2016-09-21 08:14:55 +00:00
kib
3c5296b91a Move pmap_p*e_index() inline functions from pmap.c to pmap.h.
They are already used in minidump code.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-20 09:38:07 +00:00
emaste
a8b2a6847f Catch up to sys/capability.h rename to sys/capsicum.h in r263232
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-09-19 18:44:43 +00:00
kib
90ee5f51f2 Consolidate four efi_next_descriptor() definitions.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-18 17:38:02 +00:00
kib
3fd792803a Remove trailing space.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-09-18 17:33:49 +00:00
bde
b8aaa2c367 Fix decoding of tf_rsp on amd64, and move TF_HAS_STACKREGS() to the
i386-only section, and fix a comment about the amd64 kernel trapframe
not having stackregs.

tf_rsp doesn't need decoding on amd64, but had an old clone of i386
code to do this in 1 place, and since the amd64 kernel trapframe does
have stackregs, the result was an off-by-16 error for %rsp in an error
message.
2016-09-16 07:09:35 +00:00
bde
d6a5db2944 (1) Ifdef the new dr6 variable for KDB.
While here, avoid using the old variable 'code' and remove it
in trap().  ('code' was meant for holding things like %dr6,
but is too small to hold %dr6 on amd64 and was reduced to an
obfuscation of tf_err, with early truncation on amd64.)

Submitted by:	Michael Butler (imb@...)
2016-09-16 04:58:37 +00:00
bde
634d4e4a33 Decode some REX prefixes in inst_call(). This makes the 'next' and
'until' commands work in more cases.
2016-09-15 18:30:53 +00:00
bde
bf8d177543 Abort single stepping in ddb if the trap is not for single-stepping.
This is not very easy to do, since ddb didn't know when traps are
for single-stepping.  It more or less assumed that traps are either
breakpoints or single-step, but even for x86 this became inadequate
with the release of the i386 in ~1986, and FreeBSD passes it other
trap types for NMIs and panics.

On x86, teach ddb when a trap is for single stepping using the %dr6
register.  Unknown traps are now treated almost the same as breakpoints
instead of as the same as single-steps.  Previously, the classification
of breakpoints was almost correct and everything else was unknown so
had to be treated as a single-step.  Now the classification of single-
steps is precise, the classification of breakpoints is almost correct
(as before) and everything else is unknown and treated like a
breakpoint.

This fixes:
- breakpoints not set by ddb, including the main one in kdb_enter(),
  were treated as single-steps and not stopped on when stepping
  (except for the usual, simple case of a step with residual count 1).
  As special cases, kdb_enter() didn't stop for fatal traps or panics
- similarly for "hardware breakpoints".

Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify
single-steps.  This is excessively complicated for bug-for-bug and
backwards compatibilty.  Design errors apparently started in Mach
in ~1990 or perhaps in the FreeBSD interface in ~1993.  Common trap
types like single steps should have a unique MI code (like the TRAP*
codes for user SIGTRAP) so that debuggers don't need macros like
IS_SSTEP_TRAP() to decode them.  But 'type' is actually an ambiguous
MD trap number, and code was always 0 (now it is (int)%dr6 on x86).
So it was impossible to determine the trap type from the args.
Global variables had to be used.

There is already a classification macro db_pc_is_single_step(), but
this just gets in the way.  It is only used to recover from bugs in
IS_BREAKPOINT_TRAP().  On some arches, IS_BREAKPOINT_TRAP() just
duplicates the ambiguity in 'type' and misclassifies single-steps as
breakpoints.  It defaults to 'false', which is the opposite of what is
needed for bug-for-bug compatibility.

When this is cleaned up, MI classification bits should be passed in
'code'.  This could be done now for positive-logic bits, since 'code'
was always 0, but some negative logic is needed for compatibility so
a simple MI classificition is not usable yet.

After reading %dr6, clear the single-step bit in it so that the type
of the next debugger trap can be decoded.  This is a little
ddb-specific.  ddb doesn't understand the need to clear this bit and
doing it before calling kdb is easiest.  gdb would need to reverse
this to support hardware breakpoints, but it just doesn't support
them now since gdbstub doesn't support %dr*.

Fix a bug involving %dr6: when emulating a single-step trap for vm86,
set the bit for it in %dr6.  Userland debuggers need this.  ddb now
needs this for vm86 bios calls.  The bit gets copied to 'code' then
cleared again.

Fix related style bugs:
- when clearing bits for hardware breakpoints in %dr6, spell the mask
  as ~0xf on both amd64 and i386 to get the correct number of bits
  using sign extension and not need a comment about using the wrong
  mask on amd64 (amd64 traps for invalid results but clearing the
  reserved top bits didn't trap since they are 0).
- rewrite my old wrong comments about using %dr6 for ddb watchpoints.
2016-09-15 17:24:23 +00:00
jhb
bc4a384597 Remove 'cpu' and 'cpu_class' on amd64.
The 'cpu' and 'cpu_class' variables were always set to the same value
on amd64 and are legacy holdovers from i386.  Remove them entirely on
amd64.

Reviewed by:	imp, kib (older version)
Differential Revision:	https://reviews.freebsd.org/D7888
2016-09-15 17:05:54 +00:00
bz
de4b915bfb Try to fix LINT builds after r305807. Seems to be a simple s&r error
I missed while reading through the 1st time as well.
2016-09-14 16:08:23 +00:00
bde
d58cd5baa4 Use the MI macro TRAPF_USERMODE() instead of open-coded checks for
SEL_UPL and sometimes PSL_VM.  This is just a style change on amd64,
but on i386 it fixes 1 unimportant place where the PSL_VM check was
missing and starts fixing 1 important place where the PSL_VM check
had a logic error.

Fix logic errors in treating vm86 bioscall mode as kernel mode.  The
main place checked all the necessary flags, but put the necessary
parentheses for the PSL_VM and PCB_VM86CALL checks in the wrong
place.  The broken case is only reached if a vm86 bioscall uses a
%cs which is nonzero mod 4, but that is unusual -- most bios calls
start with %cs = 0xc000 or 0xf000 and rarely change it.  Another
place was missing the check for PCB_VM86CALL, but was only reachable
if there are bugs virtualizing PSL_I.

Add a macro TF_HAS_STACKREGS() and use this instead of converting
open-coded checks of SEL_UPL, etc. to TRAPF_USERMODE() when we only
care about whether the frame has stack registers.  This fixes 3
places in my recent fix for register variables in vm86 mode where I
messed up the PSL_VM check and cleans up other places.
2016-09-14 12:57:40 +00:00
kib
cb80ddd115 Add FPU_KERN_NOCTX flag to the fpu_kern_enter() function on amd64.
The flag specifies that the block which uses FPU must be executed in
critical section, i.e. take no context switches, and does not need an
FPU save area during the execution.

It is intended to be applied around fast and short code pathes where
save area allocation is impossible or undesirable, due to context or
due to the relative cost of calculation vs. allocation.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-11 09:14:07 +00:00
alc
44f29780e8 Various changes to pmap_ts_referenced()
Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and
into vm/pmap.h, and describe what its purpose is.  Eliminate the archaic
"XXX" comment about its value.  I don't believe that its exact value, e.g.,
5 versus 6, matters.

Update the arm64 and riscv pmap implementations of pmap_ts_referenced()
to opportunistically update the page's dirty field.

On amd64, use the PDE value already cached in a local variable rather than
dereferencing a pointer again and again.

Reviewed by:	kib, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7836
2016-09-10 16:49:25 +00:00
jhb
7380cda6f5 MFC 303713: Correct assertion on vcpuid argument to vm_gpa_hold().
PR:		208168
2016-09-09 20:30:36 +00:00
jhb
a40f879158 MFC 304637: Fix build for !SMP kernels after the Xen MSIX workaround.
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.

PR:		212014
2016-09-09 19:57:32 +00:00
avg
e62f31d754 work around AMD erratum 793 for family 16h, models 00h-0Fh 2016-09-07 14:24:29 +00:00
jhb
b83d0562bd Reset PCI pass through devices via PCI-e FLR during VM start and end.
Add routines to trigger a function level reset (FLR) of a PCI-express
device via the PCI-express device control register.  This also includes
support routines to wait for pending transactions to complete as well
as calculating the maximum completion timeout permitted by a device.

Change the ppt(4) driver to reset pass through devices before attaching
to a VM during startup and before detaching from a VM during shutdown.

Reviewed by:	imp, wblock (earlier version)
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7751
2016-09-06 21:15:35 +00:00
jhb
9b7bf59c96 Update the I/O MMU in bhyve when PCI devices are added and removed.
When the I/O MMU is active in bhyve, all PCI devices need valid entries
in the DMAR context tables. The I/O MMU code does a single enumeration
of the available PCI devices during initialization to add all existing
devices to a domain representing the host. The ppt(4) driver then moves
pass through devices in and out of domains for virtual machines as needed.
However, when new PCI devices were added at runtime either via SR-IOV or
HotPlug, the I/O MMU tables were not updated.

This change adds a new set of EVENTHANDLERS that are invoked when PCI
devices are added and deleted. The I/O MMU driver in bhyve installs
handlers for these events which it uses to add and remove devices to
the "host" domain.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7667
2016-09-06 20:17:54 +00:00
jhb
31bd6f147b Remove remnants of PERFMON and I586_PMC_GUPROF from amd64.
These options were never fully ported over from i386.
2016-09-06 19:25:32 +00:00
jhb
5f56f30076 Leave ppt devices in the host domain when they are not attached to a VM.
This allows a pass through device to be reset to a normal device driver
on the host and reused on the host.  ppt devices are now always active in
some I/O MMU domain when the I/O MMU is active, either the host domain
or the domain of a VM they are attached to.

Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7666
2016-09-06 18:53:17 +00:00
markj
fb5804c98d Remove support for idle page zeroing.
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by:	alc, imp, kib
Differential Revision:	https://reviews.freebsd.org/D7714
2016-09-03 20:38:13 +00:00
alc
24a2d27767 As an optimization to the machine-independent layer, change the machine-
dependent pmap_ts_referenced() so that it updates the page's dirty field
if a modified bit is found while counting reference bits.  This
opportunistic update can be performed at low cost and can eliminate the
need for some future calls to pmap_is_modified() by the machine-
independent layer.

Reviewed by:	kib, markj
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7722
2016-09-01 15:57:44 +00:00
bde
aa589d0481 Shorten banal comments about zeroing and copying pages. Don't give
implementation details that last echoed the code 15-20 years ago.
But add a detail about pagezero() on i386.  Switch from Mach style
to BSD style.
2016-08-29 14:38:31 +00:00
bde
eb85aca715 On amd64, declare sse2_pagezero() and start using it again, but only
for zeroing pages in idle where nontemporal writes are clearly best.
This is almost a no-op since zeroing in idle works does nothing good
and is off by default.  Fix END() statement forgotten in previous
commit.

Align the loop in sse2_pagezero().  Since it writes to main memory,
the loop doesn't have to be very carefully written to keep up.
Unrolling it was considered useless or harmful and was not done on
i386, but that was too careless.

Timing for i386: the loop was not unrolled at all, and moved only 4
bytes/iteration.  So on a 2GHz CPU, it needed to run at 2 cycles/
iteration to keep up with a memory speed of just 4GB/sec.  But when
it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/
iteration so it gave a maximum speed of 2.67GB/sec and couldn't even
keep up with PC3200 memory.  Fix the alignment so that it keep up with
4GB/sec memory, and unroll once to get nearer to 8GB/sec.  Further
unrolling might be useless or harmful since it would prevent the loop
fitting in 16-bytes.  My test system with an old CPU and old DDR1 only
needed 5+ GB/sec.  My test system with a new CPU and DDR3 doesn't need
any changes to keep up ~16GB/sec.

Timing for amd64: with 8-byte accesses and newer faster CPUs it is
easy to reach 16GB/sec but not so easy to go much faster.  The
alignment doesn't matter much if the CPU is not very old.  The loop
was already unrolled 4 times, but needs 32 bytes and uses a fancy
method that doesn't work for 2-way unrolling in 16 bytes.  Just
align it to 32-bytes.
2016-08-29 13:07:21 +00:00
bde
61cf0d838a Restore the nontemporal pagezero() under the name sse2_pagezero() (the
same name as for i386).  It is not reconnected yet.

Which method is better is too machine-dependent and system-dependent
to replace the old method unconditionally.
2016-08-29 06:07:43 +00:00
jhb
731da97ac4 Enable I/O MMU when PCI pass through is first used.
Rather than enabling the I/O MMU when the vmm module is loaded,
defer initialization until the first attempt to pass a PCI device
through to a guest.  If the I/O MMU fails to initialize or is not
present, than fail the attempt to pass a PCI device through to a
guest.

The hw.vmm.force_iommu tunable has been removed since the I/O MMU is
no longer enabled during boot.  However, the I/O MMU support can be
disabled by setting the hw.vmm.iommu.enable tunable to 0 to prevent
use of the I/O MMU on any systems where it is buggy.

Reviewed by:	grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D7448
2016-08-26 20:15:22 +00:00
ed
d81be03d3f Make execution of 32-bit CloudABI executables work on amd64.
A nice thing about requiring a vDSO is that it makes it incredibly easy
to provide full support for running 32-bit processes on 64-bit systems.
Instead of letting the kernel be responsible for composing/decomposing
64-bit arguments across multiple registers/stack slots, all of this can
now be done in the vDSO. This means that there is no need to provide
duplicate copies of certain system calls, like the sys_lseek() and
freebsd32_lseek() we have for COMPAT_FREEBSD32.

This change imports a new vDSO from the CloudABI repository that has
automatically generated code in it that copies system call arguments
into a buffer, padding them to eight bytes and zero-extending any
pointers/size_t arguments. After returning from the kernel, it does the
inverse: extracting return values, in the process truncating
pointers/size_t values to 32 bits.

Obtained from:	https://github.com/NuxiNL/cloudabi
2016-08-24 10:51:33 +00:00
ed
c0aa6fd209 Convert pointers obtained from the threadattr_t structure with TO_PTR().
In all of these source files, the userspace pointer size corresponds
with the kernelspace pointer size, meaning that casting directly works.
As I'm planning on making 32-bit execution on 64-bit systems work as
well, use TO_PTR() here as well, so that the changes between source
files remain minimal.
2016-08-24 10:13:18 +00:00
jhb
4e659fa057 Fix build for !SMP kernels after the Xen MSIX workaround.
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.

PR:		212014
Reported by:	Glyn Grinstead <glyn@grinstead.org>
MFC after:	3 days
2016-08-22 21:23:17 +00:00
jhb
e24281ea43 Remove the si(4) driver and sicontrol(8) for Specialix serial cards.
The si(4) driver supported multiport serial adapters for ISA, EISA, and
PCI buses.  This driver does not use bus_space, instead it depends on
direct use of the pointer returned by rman_get_virtual().  It is also
still locked by Giant and calls for patch testing to convert it to use
bus_space were unanswered.

Relnotes:	yes
2016-08-19 21:14:27 +00:00
kib
e07c032881 MFC r303913:
Unconditionally perform checks that FPU region was entered, when #NM
exception is caught in kernel mode.
2016-08-17 07:09:22 +00:00
avg
5461062c89 MFC r302835: fix-up for configuration of AMD Family 10h processors
borrowed from Linux
2016-08-15 09:04:31 +00:00
kib
04bce34a47 The pmap_delayed_invl_wait() function blocks on turnstile, it does not
spin, in the committed version.  Remove stray '*' in the text.

Sponsored by:	The FreeBSD Foundation.
MFC after:	3 days
2016-08-11 12:37:11 +00:00
ed
cc2c089a3f Provide the CloudABI vDSO to its executables.
CloudABI executables already provide support for passing in vDSOs. This
functionality is used by the emulator for OS X to inject system call
handlers. On FreeBSD, we could use it to optimize calls to
gettimeofday(), etc.

Though I don't have any plans to optimize any system calls right now,
let's go ahead and already pass in a vDSO. This will allow us to
simplify the executables, as the traditional "syscall" shims can be
removed entirely. It also means that we gain more flexibility with
regards to adding and removing system calls.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7438
2016-08-10 21:02:41 +00:00
kib
acae466016 Unconditionally perform checks that FPU region was entered, when #NM
exception is caught in kernel mode.  There are third-party modules
which trigger the issue, and since the problem causes usermode state
corruption at least, panic in production kernels as well.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-10 13:44:03 +00:00
jhb
fbbd0c6298 MFC 302181,302635: Disable MSI-X migration on older Xen hypervisors.
302181:
Add a tunable to disable migration of MSI-X interrupts.

The new 'machdep.disable_msix_migration' tunable can be set to 1 to
disable migration of MSI-X interrupts.

Xen versions prior to 4.6.0 do not properly handle updates to MSI-X
table entries after the initial write.  In particular, the operation
to unmask a table entry after updating it during migration is not
propagated to the "real" table for passthrough devices causing the
interrupt to remain masked.  At least some systems in EC2 are
affected by this bug when using SRIOV.  The tunable can be set in
loader.conf as a workaround.

302635:
xen: automatically disable MSI-X interrupt migration

If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.

It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.
2016-08-05 17:13:25 +00:00
jhb
1c2702c041 Don't permit mappings of invalid physical addresses on amd64 via /dev/mem.
Discussed with:	kib
2016-08-04 17:55:23 +00:00
jhb
b1de97ff2b Correct assertion on vcpuid argument to vm_gpa_hold().
PR:		208168
Submitted by:	Dave Cameron <daverabbitz@ihug.co.nz>
Reviewed by:	grehan
MFC after:	1 month
2016-08-03 15:20:10 +00:00
kib
9a5f028012 Merge i386 and amd64 variants of mp_watchdog.c into x86/, there is no
difference between files.
For pc98, put x86/mp_x86.c into the same place as used by i386 file list.
Fix typo in comment.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-03 13:51:53 +00:00
mjg
b584a8e1ae amd64: implement pagezero using rep stos
The current implementation uses non-temporal writes. This turns out to
be detrimental to performance if the page is used shortly after, which
is the typical case with page faults.

Switch to rep stos.

Reviewed by:	kib
MFC after:	1 week
2016-07-31 11:34:08 +00:00
brooks
017f31c108 Don't create pointless backups of generated files in "make sysent".
Any sensible workflow will include a revision control system from which
to restore the old files if required.  In normal usage, developers just
have to clean up the mess.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7353
2016-07-28 21:29:04 +00:00
mav
fcb5c9368b Add more UEFI/e820 memory types from latest specifications.
This is only cosmetics.

MFC after:	2 weeks
2016-07-24 09:15:11 +00:00
dchagin
98960de4ab MFC r302517:
Fix a copy/paste bug introduced during X86_64 Linuxulator work.
FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation
use READ_IMPLIES_EXEC flag, introduced in r302515.

While here move common part of mmap() and mprotect() code to the files in compat/linux
to reduce code dupcliation between Linuxulator's.

MFC r302518, r302626:

Add linux_mmap.c to the appropriate conf/files.
2016-07-17 15:23:32 +00:00
dchagin
1886392a37 Regen for r302962 (Linux personality), record mergeinfo for r320516. 2016-07-17 15:11:23 +00:00
dchagin
8576a4ebaa MFC r302515:
Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag.
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.

READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
2016-07-17 15:07:33 +00:00
mav
480936abd1 Increase number of I/O APIC pins from 24 to 32 to give PCI up to 16 IRQs.
Move HPET to the top of the supported 0-31 range.

Proposed by:	jhb@, grehan@
2016-07-14 14:35:25 +00:00
avg
00ea714475 remove a stray change from r302834
MFC after:	3 weeks
X-MFC with:	r302834
2016-07-14 11:13:26 +00:00
avg
555e531ac1 fix-up for configuration of AMD Family 10h processors borrowed from Linux
http://lxr.free-electrons.com/source/arch/x86/kernel/cpu/amd.c#L643
BIOS may configure Family 10h processors to convert WC+ cache type
to CD.  That can hurt performance of guest VMs using nested paging.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D6059
2016-07-14 11:03:05 +00:00
badger
5908cb719e Add explicit detection of KVM hypervisor
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.

Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.).  (Spotted
by: vangyzen).

Differential revision:	https://reviews.freebsd.org/D7172
Sponsored by:	Dell Inc.
Approved by:	kib (mentor), vangyzen (mentor)
Reviewed by:	alc
MFC after:	4 weeks
2016-07-13 19:19:18 +00:00
royger
844ce8697a xen: automatically disable MSI-X interrupt migration
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.

It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.

Sponsored by:		Citrix Systems R&D
MFC after:		3 days
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D7148
2016-07-12 08:43:09 +00:00
dchagin
c93d4a7bde Fix a copy/paste bug introduced during X86_64 Linuxulator work.
FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation
use READ_IMPLIES_EXEC flag, introduced in r302515.

While here move common part of mmap() and mprotect() code to the files in compat/linux
to reduce code dupcliation between Linuxulator's.

Reported by:    Johannes Jost Meixner, Shawn Webb

MFC after:	1 week
XMFC with:	r302515, r302516
2016-07-10 08:22:04 +00:00
dchagin
7acd3da18d Regen for r302215 (Linux personality). 2016-07-10 08:17:16 +00:00
dchagin
50efd461d3 Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag.
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.

READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
2016-07-10 08:15:50 +00:00
ed
887bfdc0a4 Don't forget to set sa->narg for CloudABI system calls.
It turns out that this value is not used within the system call code
under normal conditions, except when using tracing tools like ktrace.
If we forget to set this value, it is set to random garbage. This may
cause ktrace to hang indefinitely, making it impossible to kill.

Reported by: Michael Plass
PR: 210800
MFC before: 11.0-RELEASE
2016-07-08 20:09:21 +00:00
nwhitehorn
89d01c24d1 Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
sephe
ad3696213e MFC 301015
hyperv/vmbus: Rename ISR functions

    MFC after:  1 week
    Sponsored by:       Microsoft OSTC
    Differential Revision:      https://reviews.freebsd.org/D6601
2016-06-24 01:20:33 +00:00