Commit Graph

13439 Commits

Author SHA1 Message Date
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Ed Maste
d162bbc1d3 Temporarily disable VIMAGE on i386
An lld-linked i386 kernel hangs on boot, after em(4) starts.  This seems
similar to the issue that prompted us to disable VIMAGE on arm64 in
r326179.  Disable VIMAGE on i386 for now while we investigate.

PR:		225077
Sponsored by:	The FreeBSD Foundation
2018-01-11 19:08:43 +00:00
Konstantin Belousov
c751f90c0c Fix grammar.
Submitted by:	alc
MFC after:	3 days
2018-01-11 16:50:03 +00:00
Konstantin Belousov
6da5c56ae5 Remove redundand CLD instructions.
We already clear %RFLAGS.DF on the kernel entry due to the compiler's
ABI requirements.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-11 13:22:13 +00:00
Konstantin Belousov
3ee6e65875 Update comment explaining the check, to reality.
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-01-11 12:07:24 +00:00
Conrad Meyer
e6fcf7898d x86: Document purpose of _safe variants of {rd,wr}msr()
Sponsored by:	Dell EMC Isilon
2018-01-10 22:41:00 +00:00
Warner Losh
a333cdf369 inittodr(0) actually sets the time, so there's no need to call
tc_setclock(). It's redundant. Tweak UPDATING based on code review of
past releases.

Relnotes: yes (for the removal of pmtimer)
2018-01-10 17:25:08 +00:00
Warner Losh
41add9e267 Move prof_machdep.c to it's more traditional place under i386/i386. 2018-01-10 16:51:55 +00:00
Warner Losh
695d254365 Retire pmtimer driver. Move time fixing into apm driver. Move
Iwasaki-san's copyright over. Remove FIXME code that couldn't possibly
work. Call tc_settime() with our estimate of the delta we've been
alseep (the one we print) to adjust the time. Not sure what to do
about callouts, so keep the small #ifdef in place there.

Differential Revision: https://reviews.freebsd.org/D13823
2018-01-10 14:59:19 +00:00
Warner Losh
9c5a114886 Remove ccbque.h from i386/isa.
inline ccbque.h into scsi_low.h. The file isn't MD, so shouldn't live
in i386/isa. It's only used by scsi_low, so move it there so no new
clients accidentally grow. scsi_low may not even still work, and the
locking here is still SPL based. CAM should do the right thing, but
I've received no reports of these cards still working. At least it
compiles still and there's one fewer files in sys/i386/isa. While I'm
here, ansify and de-splize. CCB_MWANTED appears to be a clear-only
flag, but I've not changed that.

Differential Review: https://reviews.freebsd.org/D13672
2018-01-09 16:11:33 +00:00
Konstantin Belousov
84874cc151 Avoid re-check of usermode condition.
It does not change anything in the behavior of trap_pfault(), while
eliminating obfuscation of jumping to the code which checks for the
condition reversed of the goto cause.  Also avoid force initialize the
rv variable, since it is now only accessed after storing vm_fault()
return value.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13725
2018-01-01 20:47:03 +00:00
Konstantin Belousov
abbfe9e5d1 Move i386/isa/elink.[hc] to dev/ep.
The ep(4) driver is the only consumer of the two functions from
elink.c.  I removed the standalone module as well, and most likely,
the module metadata is not needed anywhere, but this is for later
cleanup.

Discussed with:	imp, jhb
Sponsored by:	The FreeBSD Foundation
2017-12-30 11:42:49 +00:00
Konstantin Belousov
0cc997d237 Move i386/isa/npx.c to i386i386/npx.c.
The i386 FPU (AKA npx) code does not depend on ISA devices at all,
after the support for IRQ13 FPU exceptions was removed.  Put the file
into the expected place in the kernel source tree.

Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
2017-12-30 11:33:04 +00:00
Warner Losh
d14e21a831 Remove two stray references to wl driver, removed some time ago. 2017-12-30 07:59:32 +00:00
John Baldwin
9a1bf495b1 Remove a few more dangling references to ie(4). 2017-12-28 21:35:53 +00:00
Pedro F. Giffuni
aa0e6abb41 sys/i386/isa/elink*: sync with (older) NetBSD.
Make it easier to identify the point where we started diverging from
NetBSD. Newer versions of these files have been updated to a new license
so this information may become useful some day.

Obtained from:	NetBSD
2017-12-28 14:14:35 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Konstantin Belousov
30d4f9e888 Add atomic_load(9) and atomic_store(9) operations.
They provide relaxed-ordered atomic access semantic.  Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses.  The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.

The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations.  It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.

Suggested by:	jhb
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 09:59:20 +00:00
Bruce Evans
d5d5600725 Also forgotten in the previous that removed the permanent double mapping
of low physical memory:

Update the comment about leaving the permanent mapping in place.  This
also improves the wording of the comment.  PTD 0 is still left alone
because it is fairly important that it was unmapped earlier, and the
comment now describes the unmapping of the other low PTDs that the code
actually does.

Reviewed by:	kib
2017-12-18 14:29:48 +00:00
Bruce Evans
2ba6fe0009 Remove the permanent double mapping of low physical memory and replace
it by a transient double mapping for the one instruction in ACPI wakeup
where it is needed (and for many surrounding instructions in ACPI resume).
Invalidate the TLB as soon as convenient after undoing the transient
mapping.  ACPI resume already has the strict ordering needed for this.

This fixes the non-trapping of null pointers and other garbage pointers
below NBPDR (except transiently).  NBPDR is quite large (4MB, or 2MB for
PAE).

This fixes spurious traps at the first instruction in VM86 bioscalls.
The traps are for transiently missing read permission in the first
VM86 page (physical page 0) which was just written to at KERNBASE in
the kernel.  The mechanism is unknown (it is not simply PG_G).

locore uses a similar but larger transient double mapping and needs
it for 2 instructions instead of 1.  Unmap the first PDE in it after
the 2 instructions to detect most garbage pointers while bootstrapping.
pmap_bootstrap() finishes the unmapping.

Remove the avoidance of the double mapping for a recently fixed special
case.  ACPI resume could use this avoidance (made non-special) to avoid
any problems with the transient double mapping, but no such problems
are known.

Update comments in locore.  Many were for old versions of FreeBSD which
tried to map low memory r/o except for special cases, or might have
allowed access to low memory via physical offsets.  Now all kernel
maps are r/w, and removal of of the double map disallows use of physical
offsets again.
2017-12-18 13:53:22 +00:00
Bruce Evans
4a5eb9ac99 Fix the undersupported option KERNLOAD, part 2: fix crashes in locore
when KERNLOAD is smaller than NBPDR (not the default) and PG_G is
enabled (the default if the CPU supports it).  This case has relatively
minor problems with coherency of the permanent double mapping, but the
fix in r167869 to improve coherency creates page tables with 3 different
errors so never worked.

The permanent double mapping is fundamentally broken and will be removed
soon.  It fundamentally breaks trapping for null pointers and requires
complications to avoid cache coherency bugs.  It is currently used for
only a single instruction in ACPI resume,

Many fixes VM86 and/or ACPI and/or the double map were attempted near
r1200000.  r167869 attempted to fix cache coherency bugs in an unusual
case, but the bugs were unreachable because older errors in page tables
caused a crash first.

This commit just makes r167869 work as intended.  Part 1 of these fixes
fixed the other errors, but also stopped mapping the PDE for KERNBASE
as a large page, so double mapping of this PDE only causes the same
problems as when KERNLOAD is the default.  Except for the problem of
trapping null pointers, r167869 could be used to fix these problems,
but it is inactive in usual cases.  The only known other problem is
that incoherent permissions for page 0 cause spurious traps in VM86
BIOS calls.

Reviewed by:	kib
2017-12-18 11:57:05 +00:00
Bruce Evans
3f21dc2985 Fix the undersupported option KERNLOAD, part 1: fix crashes in locore
when KERNLOAD is not a multiple of NBPDR (not the default) and PSE is
enabled (the default if the CPU supports it).  Addresses in PDEs must
be a multiple of NBPDR in the PSE case, but were not so in the crashing
case.

KERNLOAD defaults to NBPDR.  NBPDR is 4 MB for !PAE and 2 MB for PAE.
The default can be changed by editing i386/include/vmparam.h or using
makeoptions.  It can be changed to less than NBPDR to save real and
virtual memory at a small cost in time, or to more than NBPDR to waste
real and virtual memory.  It must be larger than 1 MB and a multiple of
PAGE_SIZE.  When it is less than NBPDR, it is necessarily not a multiple
of NBPDR.  This case has much larger bugs which will be fixed in part 2.

The fix is to only use PSE for physical addresses above <KERNLOAD
rounded _up_ to an NBPDR boundary>.  When the rounding is non-null,
this leaves part of the kernel not using large pages.  Rounding down
would avoid this pessimization, but would break setting of PAT bits
on i/o pages if it goes below 1MB.  Since rounding down always goes
below 1MB when KERNLOAD < NBPDR and the KERNLOAD > NBPDR case is not
useful, never round down.

Fix related style bugs (e.g., wrong literal values for NBPDR in comments).

Reviewed by:	kib
2017-12-18 09:32:56 +00:00
Bruce Evans
c93ea91c54 Minor cleanups found while fixing a bug involving double mapping of low
memory:

Load the kernel eflags less magically, as in locore.  The magic increased
when I removed eflags from the pcb in r305899.

Remove a jump to low memory that became garbage when the i386 version was
mostly replaced by the amd64 version in r235622.

The amd64 version is very similar.  It still loads the flags magically,
but is not missing comments about using the special page table.

Reviewed by:	kib
2017-12-15 03:05:14 +00:00
Mark Johnston
5bab623438 Pass the trap frame to fasttrap hooks.
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.

MFC after:	2 weeks
2017-12-11 19:21:39 +00:00
Conrad Meyer
3f6867ef63 i386: Bump KSTACK_PAGES default to match amd64
Logically, extend r286288 to cover all threads, by default.

The world has largely moved on from i386.  Most FreeBSD users and developers
test on amd64 hardware.  For better or worse, we have written a non-trivial
amount of kernel code that relies on stacks larger than 8 kB, and it "just
works" on amd64, so there has been little incentive to shrink it.

amd64 had its KSTACK_PAGES bumped to 4 back in Peter's initial AMD64 commit,
r114349, in 2003.  Since that time, i386 has limped along on a stack half
the size.  We've even observed the stack overflows years ago, but neglected
to fix the issue; see the 20121223 and 20150728 entries in UPDATING.

If anyone is concerned with this change, I suggest they configure their
AMD64 kernels with KSTACK_PAGES 2 and fix the fallout there first.  Eugene
has identified a list of high stack usage functions in the first PR below.

PR:		219476, 224218
Reported by:	eugen@, Shreesh Holla <hshreesh AT yahoo.com>
Relnotes:	maybe
Sponsored by:	Dell EMC Isilon
2017-12-11 04:32:37 +00:00
Bruce Evans
fb3cc1c37d Move instantiation of msgbufp from 9 MD files to subr_prf.c.
This variable should be pure MI except possibly for reading it in MD
dump routines.  Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998.  There were many imperfections in
r36441.  This commit fixes only a small one, to simplify fixing the
others 1 arch at a time.  (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
2017-12-07 07:55:38 +00:00
Pedro F. Giffuni
64de3fdd58 SPDX: use the Beerware identifier. 2017-11-30 20:33:45 +00:00
Scott Long
c15269ccb8 It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from
the standard kernels.  They are still available as custom compile
options.
2017-11-29 23:41:49 +00:00
Brooks Davis
5cd667e65f Disable vim syntax highlighting.
Vim's default pick doesn't understand that ';' is a comment character
and the result looks horrible.

Reviewed by:	emaste
2017-11-28 18:23:17 +00:00
Fedor Uporov
cd76ee1ee3 Remap ENOATTR to ENODATA in the linuxulator.
In the linux ENOADATA is frequently #defined as ENOATTR.
The change is required for an xattrs support implementation.

MFC after: 1 week
Discussed with: netchild
Approved by: pfg

Differential Revision: https://reviews.freebsd.org/D13221
2017-11-27 17:03:11 +00:00
Pedro F. Giffuni
83ef78be95 sys/i386: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:08:52 +00:00
Ed Schouten
ee13ffbe03 Use TO_PTR() to convert integers to pointers.
For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in
this place. Also use it for all of the other source files, so that the
difference remains as minimal as possible.

MFC after:	2 weeks
2017-11-26 14:45:56 +00:00
Hans Petter Selasky
8a53e1340f Merge ^/head r326132 through r326161. 2017-11-24 12:13:27 +00:00
Hans Petter Selasky
322f006ecc Implement atomic_fetchadd_64() for i386. This function is needed by the
atomic64 header file in the LinuxKPI for i386.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2017-11-24 12:10:42 +00:00
Ed Schouten
814629dd64 Don't let cpu_set_syscall_retval() clobber exec_setregs().
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).

Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.

Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.

To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.

Reviewed by:	kib, jhibbits
Differential Revision:	https://reviews.freebsd.org/D13180
2017-11-24 07:35:08 +00:00
Hans Petter Selasky
82725ba9bf Merge ^/head r325999 through r326131. 2017-11-23 14:28:14 +00:00
Konstantin Belousov
383f241dce Remove lint support from system headers and MD x86 headers.
Reviewed by:	dim, jhb
Discussed with:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13156
2017-11-23 11:40:16 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Hans Petter Selasky
937d37fc6c Merge ^/head r325842 through r325998. 2017-11-19 12:36:03 +00:00
Pedro F. Giffuni
df57947f08 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
Konstantin Belousov
4e421792ec Remove i386 XBOX support.
It is for console presented at 2001 and featuring Pentium III
processor.  Even if any of them are still alive and run FreeBSD, we do
not have any sign of life from their users.  While removing another
dozens of #ifdefs from the i386 sources reduces the aversion from
looking at the code and improves the platform vitality.

Reviewed by:	cem, pfg, rink (XBOX support author)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13016
2017-11-16 14:27:02 +00:00
Hans Petter Selasky
8dee9a7a44 Remove no longer supported mthca driver.
Sponsored by:	Mellanox Technologies
2017-11-13 10:59:38 +00:00
Konstantin Belousov
b1499f0e77 Remove useless DEBUG printfs in i386 sendsig() implementations.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-08 13:05:14 +00:00
Konstantin Belousov
5b9a3721e6 x86: Do not emit unused TD_TID symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-04 10:51:52 +00:00
Konstantin Belousov
ec1d28b8df Eliminate unused load.
Based on github pull request:	#117
Submitted by:	Wuyang-Chung@github
MFC after:	1 week
2017-11-04 10:50:47 +00:00
Konstantin Belousov
aa788cc387 Consistently ensure that we do not load MXCSR with reserved bits set.
Some callers of fpusetregs()/npxsetregs(), most importantly
set_fpcontext(), clear reserved bits.  But some did not.  Do the
clearing in fpusetregs() and remove now redundand operation from
set_fpcontext().

Reported by:	Maxime Villard <max@m00nbsd.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-01 10:32:44 +00:00
Tijl Coosemans
f236378b54 Set the return address for stack entry points to zero.
Stack unwinders treat zero as a stop condition.  The value on the stack can
be non-zero because thread stacks may be arbitrary memory provided via
pthread_attr_setstack(3) or may be recycled from previous threads.

Reference:
https://lists.freebsd.org/pipermail/freebsd-current/2017-August/066855.html
https://lists.freebsd.org/pipermail/freebsd-current/2017-October/067254.html

Discussed with:	kib
MFC after:	1 week
2017-10-31 11:51:34 +00:00
Eitan Adler
a2aef24aa3 Update several more URLs
- Primarily http -> https
- Primarily FreeBSD project URLs
2017-10-29 08:17:03 +00:00
Mark Johnston
5fca1d90c1 Fix the VM_NRESERVLEVEL == 0 build.
Add VM_NRESERVLEVEL guards in the pmaps that implement transparent
superpage promotion using reservations.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12764
2017-10-23 15:34:05 +00:00
Bjoern A. Zeeb
8e94025b41 With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into
HEAD.  Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.

Disable building LINT-VIMAGE with VIMAGE being default.

This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.

Requested by:		many
Reviewed by:		kristof, emaste, hiren
X-MFC after:		never
Relnotes:		yes
Differential Revision:	https://reviews.freebsd.org/D12639
2017-10-20 21:40:59 +00:00
Mark Johnston
46fcd1af63 Move kernel dump offset tracking into MI code.
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.

This is needed to implement support for kernel dump compression in the
MI kernel dump code.

Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.

Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11722
2017-10-18 15:38:05 +00:00
Konstantin Belousov
00d37313f9 Change i386_get_ldt() to return 'EOF' when the requested range of
descriptors does not fit into currently allocated LDT, or trim the
return if the range fits partially.  Before, the function returned
EINVAL.

Fix two bugs in r324366: use capped num counter for malloc size, and
do not leak allocated buffer on EINVAL (by handling EINVAL case as
normal, see above).

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 16:19:26 +00:00
Konstantin Belousov
13a5d9112c Improvements to set_user_ldt().
Remove mtx_owned() checks from set_user_ldt().  Split the function
into _locked() version which requires the dt_lock spinlock owned, and
make set_user_ldt() a wrapper.  Add a comment in swtch.s noting that
the call to the new set_user_ldt() cannot recurse on dt_lock.

Remove #ifdef SMP block, the addend is always zero on UP.

Fix type of set_user_ldt_rv(), making it match the type used for
smb_rendezvous() callback, and remove the cast.  Use curproc.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 16:07:27 +00:00
Konstantin Belousov
787f835bf0 Reset the fs and gs bases on exec(2).
The values from the old address space do not make sense for the new
program.  In particular, gsbase might be the TLS base for the old
program but the new program has no TLS now.

amd64 already handles this correctly.

Reported and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 15:39:43 +00:00
Konstantin Belousov
4f7c931a11 More style.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-09 15:24:18 +00:00
Konstantin Belousov
a4730cdb82 Improve i386_get_ldt().
Provide consistent snapshot of the requested descriptors by preventing
other threads from modifying LDT while we fetch the data, lock dt_lock
around the read.  Copy the data into intermediate buffer, which is
copied out after the lock is dropped.

Comparing with the amd64 version, the read is done byte by byte, since
there is no atomic 64bit read (cmpxchg8b method is too heavy comparing
with the avoided issues).

Improve overflow checking for the descriptors range calculations and
remove unneeded casts.  Use unsigned types for sizes.

Allow zero num argument to i386_get_ldt() and i386_set_ldt().  This
case is handled naturally by the code flow.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-06 14:29:53 +00:00
Konstantin Belousov
57f99aede6 Remove unneeded cast.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-06 10:17:50 +00:00
Konstantin Belousov
98c2674e93 Style.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-06 10:16:57 +00:00
Konstantin Belousov
6cdf28b7ad Use ANSI C declarations.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 19:11:25 +00:00
Konstantin Belousov
1f32d2a7ba Correct format specifiers in the debug code. Style.
Requested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 18:58:28 +00:00
Konstantin Belousov
711d5183b1 Style.
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-05 18:42:13 +00:00
Konstantin Belousov
349216589d A different fix for the issue from r323722.
Split the handlers for pop of invalid selectors from the trap frame
into usermode and kernel variants.  Usermode handler is kept as is, it
restores the already loaded parts of the trap frame and jumps to set
up a signal delivery to the user process.

New kernel part of the handler emulates IRET treatment of the segments
which would violate access right.  It loads NUL selector in the
segment register which load causes the fault, and then continues the
return to interrupted kernel code.  Since invalid selectors in the
segment registers in the kernel mode can only exist while kernel still
enters or exits from userspace, we only zero invalid userspace
selectors.  If userspace tries to use the segment register, it gets a
signal, as if the processor segment descriptor cache was reloaded.

Reported by:	Maxime Villard <max@m00nbsd.net>
Suggested and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 09:01:28 +00:00
Konstantin Belousov
053e8ce538 Restore a part of r323722.
Do not return from interrupt using the POP_FRAME;iret instruction
sequence, always jump to doreti.

The user segments selectors saved on the stack might become invalid
because userspace manipulated LDT in a parallel thread.  trap() is
aware of such issue, but it is only prepared to handle it at iret and
segment registers load operations in doreti path.

Also remove POP_FRAME macro because it is no longer used.

Reviewed by:	bde, jhb (as part of r323722)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:46:15 +00:00
Konstantin Belousov
d3c968bf84 Revert r323722. A better fix will be committed shortly, as well as
some still useful bits of the reverted revision.

The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers.  If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.

Fixing the problem is complicated.  Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:38:24 +00:00
Josh Paetzel
c77037f16f Fix indentation for r323068
PR:	220170
Reported by:	lidl
MFC after:	3 days
Pointyhat to:	jpaetzel
2017-09-19 20:40:05 +00:00
Konstantin Belousov
3cabd93e26 Do not do torn writes to active LDTs.
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12413
2017-09-19 17:57:04 +00:00
Konstantin Belousov
5efe338f3d Fix handling of the segment registers on i386.
Suppose that userspace is executing with the non-standard segment
descriptors.  Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers.  If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to.  As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.

More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them.  In this case kernel panics.

The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.

Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate.  Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers.  Second, we only load the
segment registers from the trap frame when returning to usermode.  For
the later, all interrupt return paths must happen through the doreti
common code.

There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls.  Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.

All the measures make the segment register handling similar to that of
amd64.  We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho (the non-lcall part)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12402
2017-09-18 20:22:42 +00:00
Josh Paetzel
9d0ec2a920 Revert r323087
This needs more thinking out and consensus, and the commit message
was wrong AND there was a typo in the commit.

pointyhat:	jpaetzel
2017-09-01 17:03:48 +00:00
Josh Paetzel
0be04b100c Take options IPSEC out of GENERIC
PR:	220170
Submitted by:	delphij
Reviewed by:	ae, glebius
MFC after:	2 weeks
Differential Revision:	D11806
2017-09-01 15:54:53 +00:00
Josh Paetzel
3b65550eec Allow kldload tcpmd5
PR:	220170
MFC after:	2 weeks
2017-08-31 20:16:28 +00:00
Alexander Motin
ed9652da5f Add NTB driver for PLX/Avago/Broadcom PCIe switches.
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too.  It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells.  There are also 4 DMA engines
in those chips, but they are not yet supported.

While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-08-30 21:16:32 +00:00
Conrad Meyer
2744a0b69b Drop CACHE_LINE_SIZE to 64 bytes on x86
The actual cache line size has always been 64 bytes.

The 128 number arose as an optimization for Core 2 era Intel processors.  By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently.  Newer CPUs prefetch more intelligently.

The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington").  If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.

Reported by:	mjg
Reviewed by:	np
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2017-08-28 22:28:41 +00:00
Konstantin Belousov
5e52cd4324 MFamd64 r322720, r322723:
Simplify i386 trap().

- Use more relevant name 'signo' instead of 'i' for the local variable
  which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
  of the trap() function.  Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:12:25 +00:00
Konstantin Belousov
aefdf821cd Remove unused code.
The machdep.uprintf_signal sysctl replaced it in more convenient way,
not requiring recompilation to use and providing more information on
fault.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:09:27 +00:00
Konstantin Belousov
325b406c49 MFamd64 r322718:
Use ANSI C declaration for trap_pfault().  Style.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:06:29 +00:00
Konstantin Belousov
8ce5520255 MFamd64 r322719:
Trim excessive 'extern'.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-26 18:04:29 +00:00
Konstantin Belousov
cef5e2bd91 Use the known valid segment when accessing memory in #UD handler.
Make sure that %eflags.D flag is cleared for hook.
Improve comments.

When #UD dtrace code checks for a registered hook before checking that
the exception was raised from kernel mode, we might run with the user
%ds, trapping on access.  Exception entry from userspace automatically
load valid %ss, which we can use there instead.

Noted and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-19 21:00:02 +00:00
Konstantin Belousov
828304899d When checking that #UD comes from kernel mode, check that the
exception did not happen in vm86 mode.  A vm86 userland process could
have a %cs that matches GSEL_KPL, while dtrace cannot hook it.

Submitted by:	Maxime Villard <max@m00nbsd.net>
MFC after:	3 days
2017-08-18 17:11:15 +00:00
Mark Johnston
01938d3666 Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
Mark Johnston
50ef60dabe Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
Conrad Meyer
dc6a82801d x86: Add dynamic interrupt rebalancing
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.

The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs.  Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources.  Reduced latency
was proven by, e.g., SPEC2008.

Submitted by:	jeff@ (earlier version)
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10435
2017-08-16 18:48:53 +00:00
Jung-uk Kim
b5669d0aa8 Split identify_cpu() into two functions for amd64 as we do for i386. This
reduces diff between amd64 and i386.  Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.

Reported by:	dexuan
Tested by:	dexuan
2017-08-09 18:09:09 +00:00
Jung-uk Kim
0105034487 Detect hypervisors early. We used to set lower hz on hypervisors by default
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().

MFC after:	3 days
2017-08-05 06:56:46 +00:00
Conrad Meyer
6f240e18b5 x86: Tag some intrinsics with __pure2
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__.  Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-03 22:28:30 +00:00
Konstantin Belousov
6632a4330f Do not call trapsignal() after handling usermode fault or interrupt,
when a signal is not intended to be sent.

The variable holding the signal number to send is left uninitialized,
which sometimes triggers invalid signal checks.

For NMI, a return to usermode without ast processing is done.  On the
other hand, for spurious dtrace probe interrupt it is usermode which
triggered the interrupt, so handle it through userret() as any other
fault.

Reported by:	Nils Beyer <nbe@renzel.net>
PR:	221151
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-02 10:12:10 +00:00
Mark Johnston
2375aaa8e9 Batch updates to v_wire_count when freeing page table pages on x86.
The removed release stores are not needed since stores are totally
ordered on i386 and amd64.

Reviewed by:	alc, kib (previous revision)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11790
2017-08-01 05:26:30 +00:00
Konstantin Belousov
9adf30b0c3 Remove unused symbols.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-30 21:52:22 +00:00
Dmitry Chagin
c151945c86 Avoid using [LINUX_]SHAREDPAGE constant directly in the vdso code.
This is needed for https://reviews.freebsd.org/D11780.

Reported by:	kib@
2017-07-30 21:24:20 +00:00
Konstantin Belousov
f2c18deb0e Fix handling of one more possible exception on return to usermode.
If %ss is loaded with a segment pointing to a non-present descriptor
by the IRETD instruction, a kernel-mode #SS exception is generated.
Resulting T_STKFLT trap must be checked against doreti_iret_fault
location and handled, otherwise userspace may panic the kernel.

Note that this is i386 variant of FreeBSD-SA-15:21.amd64, but unlike
amd64, there is no swapgs on i386 and the issue is arguably not
exploitable.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-08 11:07:39 +00:00
Sean Bruno
6ef9566177 Garbage collect kernel option TWA_FLASH_FIRMWARE
Submitted by:	 kevin.bowling0kev009.com
Differential Revision:	https://reviews.freebsd.org/D11387
2017-07-03 19:33:50 +00:00
Dmitry Chagin
a0c59c7afd Add support for musl consumers to the Linuxulator.
PR:		213809
Submitted by:	Yonas Yanfa
Reported by:	Yonas Yanfa
MFC after:	1 week
Relnotes:	yes
2017-07-03 10:24:49 +00:00
Alan Cox
510cdf22a2 When "force" is specified to pmap_invalidate_cache_range(), the given
start address is not required to be page aligned.  However, the loop
within pmap_invalidate_cache_range() that performs the actual cache
line invalidations requires that the starting address be truncated to
a multiple of the cache line size.  This change corrects an error in
that truncation.

Submitted by:	Brett Gutstein <bgutstein@rice.edu>
Reviewed by:	kib
MFC after:	1 week
2017-07-01 16:42:09 +00:00
Jason A. Harmening
eb36b1d0bc Clean up MD pollution of bus_dma.h:
--Remove special-case handling of sparc64 bus_dmamap* functions.
  Replace with a more generic mechanism that allows MD busdma
  implementations to generate inline mapping functions by
  defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>.  This
  is currently useful for sparc64, x86, and arm64, which all
  implement non-load dmamap operations as simple wrappers
  around map objects which may be bus- or device-specific.

--Remove NULL-checked bus_dmamap macros.  Implement the
  equivalent NULL checks in the inlined x86 implementation.
  For non-x86 platforms, these checks are a minor pessimization
  as those platforms do not currently allow NULL maps.  NULL
  maps were originally allowed on arm64, which appears to have
  been the motivation behind adding arm[64]-specific barriers
  to bus_dma.h, but that support was removed in r299463.

--Simplify the internal interface used by the bus_dmamap_load*
  variants and move it to bus_dma_internal.h

--Fix some drivers that directly include sys/bus_dma.h
  despite the recommendations of bus_dma(9)

Reviewed by:	kib (previous revision), marius
Differential Revision:	https://reviews.freebsd.org/D10729
2017-07-01 05:35:29 +00:00
Konstantin Belousov
f7df80f4ed Fix indent.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-06-24 10:19:06 +00:00
Konstantin Belousov
746e20fdb1 Correct translations between abridged and full x87 tags.
Reported and tested by:	karnajit wangkhem <karnajitw@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-17 11:25:31 +00:00
Konstantin Belousov
2d88da2f06 Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
Konstantin Belousov
43f41dd393 Make struct syscall_args visible to userspace compilation environment
from machine/proc.h, consistently on all architectures.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 20:53:44 +00:00
John Baldwin
9d24f98ca8 Remove the BSD/OS 2.1 system call gate LDT entry.
An extra copy of the system call gate was added to the default LDT back
in 1996 (r18513 / r18514).  However, the ability to run BSD/OS 2.1
i386 binaries under FreeBSD's native ABI is most likely no longer
needed.

Discussed with:	kib
2017-05-23 22:34:18 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Jung-uk Kim
af1973281e Use kmem_malloc() instead of malloc(9) for the native amd64 filter.
r316767 broke the BPF JIT compiler for amd64 because malloc()'d space is no
longer executable.

Discussed with:	kib, alc
2017-04-17 22:02:09 +00:00
Jung-uk Kim
e329e330d4 Move declarations for a machine-dependent function to the header file. 2017-04-17 21:51:26 +00:00
Jung-uk Kim
3a0e8b7373 Reduce diff with amd64 version. 2017-04-17 21:46:54 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
75c4b0b5ac Remove unused assembly symbols pointing to vmmeter. 2017-04-17 17:18:07 +00:00
Patrick Kelsey
67d955aab4 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
Tai-hwa Liang
7ece126ed8 Trying to be more compatible with Linux if.h definitions:
- renaming l_ifreq::ifru_metric to l_ifreq::ifru_ivalue;
	- adding a definition for ifr_ifindex which points to l_ifreq::ifru_ivalue.

A quick search indicates that Linux already got the above changes since 2.1.14.

Reviewed by:	kib, marcel, dchagin
MFC after:	1 week
2017-04-08 14:41:39 +00:00
Mark Johnston
5788c2bde1 Adjust the constraint for "src" in atomic_(f)cmpset_8.
"r" is not sufficient to prevent the use of invalid byte-width registers
with at least gcc.

Reported and reviewed by:	bde
X-MFC-With:	r315718
2017-03-27 16:18:19 +00:00
Andriy Gapon
978f3da16f revert r315959 because it causes build problems
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.

Also, the change could be not as useful as I thought it was.

Reported by:	dchagin, Manfred Antar <null@pozo.com>, and many others
2017-03-27 12:34:29 +00:00
Bruce Evans
f434f3515b Fix printing of negative offsets (typically from frame pointers) again.
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before.  i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386.  powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.

Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.

-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.

The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.

Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
2017-03-26 18:46:35 +00:00
Andriy Gapon
a7b4c009e1 specific end of interrupt implementation for AMD Local APIC
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
2017-03-25 18:45:09 +00:00
Dmitry Chagin
6c2a934b79 Implement Linux mincore() system call.
This is necessary for the upcoming drm-next.

Suggested by:	hselasky@
MFC after:	1 month
2017-03-25 15:47:29 +00:00
Bruce Evans
4e501eb7cc Remove buggy adjustment of page tables in db_write_bytes().
Long ago, perhaps only on i386, kernel text was mapped read-only and
it was necessary to change the mapping to read-write to set breakpoints
in kernel text.  Other writes by ddb to kernel text were also allowed.
This write protection is harder to implement with 4MB pages, and was
lost even for 4K pages when 4MB pages were implemented.  So changing
the mapping became useless.  It was actually worse than useless since
it followed followed various null and otherwise garbage pointers to
not change random memory instead of the mapping.  (On i386s, the
pointers became good in pmap_bootstrap(), and on amd64 the pointers
became bad in pmap_bootstrap() if not before.)

Another bug broke detection of following of null pointers on i386,
except early in boot where not detecting this was a feature.  When
I fixed the bug, I accidentally broke the feature and soon got traps
in db_write_bytes().  Setting breakpoints early in ddb was broken.

kib pointed out that a clean way to do the adjustment would be to use
a special [sub]map giving a small window on the bytes to be written.

The trap handler didn't know how to fix up errors for pagefaults
accessing the map itself.  Such errors rarely need fixups, since most
traps for the map are for the first access which is a read.

Reviewed by:	kib
2017-03-24 17:34:55 +00:00
Gleb Smirnoff
061f01b16e Remove Solaris 2.6 syscalls selector.
Discussed with:	kib
2017-03-23 19:54:41 +00:00
Ed Schouten
ebfc28088b Stop providing the compat_3_brand.
As of r315860, the ELF image activator works fine for CloudABI without it.

Reviewed by:	kib
MFC after:	2 weeks
2017-03-23 14:12:21 +00:00
Konstantin Belousov
2274ab3d7b Update r315753 with the proper flag name.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:28:13 +00:00
Konstantin Belousov
1438fe3cf2 Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only
matches static binaries.

Interpretation of the 'static' there is that the binary must not
specify an interpreter.  In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.

This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.

Revert r315701.

Discussed with and tested by:	ed (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-22 22:23:01 +00:00
Mark Johnston
3d6732549d Add support for 8- and 16-bit atomic_(f)cmpset to x86.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10068
2017-03-22 17:29:04 +00:00
Ed Schouten
ae2373da91 Set the interpreter path to /nonexistent.
CloudABI executables are statically linked and don't have an
interpreter. Setting the interpreter path to NULL used to work
previously, but r314851 introduced code that checks the string
unconditionally. Running CloudABI executables now causes a null pointer
dereference.

Looking at the rest of imgact_elf.c, it seems various other codepaths
already leaned on the fact that the interpreter path is set. Let's just
go ahead and pick an obviously incorrect interpreter path to appease
imgact_elf.c.

MFC after:	1 week
2017-03-22 07:05:27 +00:00
Dmitry Chagin
b1ba0846f1 Implement getrandom() syscall.
Note. GRND_RANDOM option is not supported for now.

MFC after:	1 month
2017-03-18 18:34:29 +00:00
Dmitry Chagin
857129394d To reduce code duplication move socket defines to the MI path.
MFC after:	1 week
2017-03-18 18:23:30 +00:00
Bruce Evans
ff17a6773e Don't access the reserved registers %dr4 and %dr5 on i386.
On the original i386, %dr[4-5] were unimplemented but not very clearly
reserved, so debuggers read them to print them.  i386 was still doing
this.

On the original athlon64, %dr[4-5] are documented as reserved but are
aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps.

On 2 of my systems, accessing %dr[4-5] trapped sometimes.  On my Haswell
system, the apparent randomness was because the boot CPU starts with
CR4_DE set while all other CPUs start with CR4_DE clear.  FreeBSD
doesn't support the data breakpoints enabled by CR4_DE and it never
changes this flag, so the flag remains different across CPUs and
the behaviour seemed inconsistent except while booting when the CPU
doesn't change.

The invalid accesses broke:
- read access for printing the registers in ddb "show watches" on CPUs
  with CR4_DE set
- read accesses in fill_dbregs() on CPUs with CR4_DE set.  This didn't
  implement panic(3) since the user case always skipped %dr[4-5].
- write accesses in set_dbregs().  This also didn't affect userland.
  When it didn't trap, the aliasing made it fragile.

Don't print the dummy (zero) values of %dr[4-5] in "show watches" for
i386 or amd64.  Fix style bugs near this printing.

amd64 also has space in the dbregs struct for the reserved %dr[8-15]
and already didn't print the dummy values for these, and never accessed
any of the 10 reserved debug registers.

Remove cpufuncs for making the invalid accesses.  Even amd64 had these.
2017-03-17 13:49:05 +00:00
Emmanuel Vadot
aa6b345634 Remove i915drm and radeondrm from NOTES and conf.
This unbreak LINT kernel.

Reported by:	lwhsu
2017-03-12 00:52:16 +00:00
Dmitry Chagin
ab60bc8488 Reduce code duplication between MD Linux code by moving SYSV IPC 64-bit
related struct definitions out into the MI path.

Invert the native ipc structs to the Linux ipc structs convesion logic.
Since 64-bit variant of ipc structs has more precision convert native ipc
structs to the 64-bit Linux ipc structs and then truncate 64-bit values
into the non 64-bit if needed. Unlike Linux, return EOVERFLOW if the
values do not fit.

Fix SYSV IPC for 64-bit Linuxulator which never sets IPC_64 bit.

MFC after:	1 month
2017-03-07 17:07:16 +00:00
Mahdi Mokhtari
881b1219aa Regenerated Linuxulator syscall tables for r314782
Approved by:	dchagin
MFC after:	1 month
2017-03-06 18:20:37 +00:00
Mahdi Mokhtari
8049c6bfb8 Add UNIMPLEMENTED() placeholder macro for
the syscalls that are not implemented in Linux kernel itself.
Cleanup DUMMY() macros.

Reviewed by:	dchagin, trasz
Approved by:	dchagin
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9804
2017-03-06 18:11:38 +00:00
Pedro F. Giffuni
25ef829b03 Revert r314669, r314670:
Bring back the i486 option in GENERIC by default.

The code related to i386 CPU variants configuration has received many
changes in the last years: most of the features are detected automatically,
so there are no performance penalties from keeping the 486 support enabled.

Re-instate the 486 support: while the general configuration could still be
cleaned a bit, there is no advantage in removing it.

Differential Revision:	https://reviews.freebsd.org/D9879
2017-03-06 03:52:15 +00:00
Pedro F. Giffuni
a5730cc510 Drop i486 from the default i386 GENERIC kernel configuration.
80486 production was stopped by Intel on September 2007. Dropping the 486
configuration option from the GENERIC kernel improves performance
slightly.

Removing I486_CPU is consistent at this time: we don't support any
processor without a FPU and the PC-98 arch, which frequently involved i486
CPUs, is also gone so we don't test such platforms anymore.

Relnotes:	yes
MFC after:	2 weeks
https://reviews.freebsd.org/D9879
2017-03-04 15:04:17 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Konstantin Belousov
2e6e48fb59 Initialize pcb_save for thread0.
Otherwise kernel traps on NULL dereference if fpu_kern(9) is used from the
thread0 context.

Reported by:	cem
Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-28 22:54:52 +00:00
Gleb Smirnoff
efe3b0de14 Remove SVR4 (System V Release 4) binary compatibility support.
UNIX System V Release 4 is operating system released in 1988. It ceased
to exist in early 2000-s.
2017-02-28 05:14:42 +00:00
Dmitry Chagin
af68739567 Regen for r314312 (Linux epoll_pwait).
MFC after:	1 month
2017-02-26 19:59:28 +00:00
Dmitry Chagin
f8ae1bb64d Change Linux epoll_pwait syscall definition to match Linux actual one.
MFC after:	1 month
2017-02-26 19:57:18 +00:00
Alan Cox
0314966858 Refine the fix from r312954. Specifically, add a new PDE-only flag,
PG_PROMOTED, that indicates whether lingering 4KB page mappings might
need to be flushed on a PDE change that restricts or destroys a 2MB
page mapping.  This flag allows the pmap to avoid range invalidations
that are both unnecessary and costly.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9665
2017-02-26 19:54:02 +00:00
Dmitry Chagin
dd93b628e9 Implement timerfd family syscalls.
MFC after:	1 month
2017-02-26 09:48:18 +00:00
Dmitry Chagin
354aa2dd56 Regen after r314291 (timerfd definition).
MFC after:	1 month
2017-02-26 09:37:25 +00:00
Dmitry Chagin
1064d53fde Change Linuxulator timerfd syscalls definition to match actual Linux one.
MFC after:	1 month
2017-02-26 09:35:44 +00:00
Edward Tomasz Napierala
e801ac7852 Fix linux_fstatfs() to return proper value for f_frsize. Without it,
linux df(1) binary from Xenial shows garbage.

Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9692
2017-02-25 20:32:37 +00:00
Mahdi Mokhtari
bd911530b7 Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 20:04:02 +00:00
Dmitry Chagin
8665c4d9cd Revert r314217. Commit is not match that I have approved. 2017-02-24 19:47:27 +00:00
Mahdi Mokhtari
21d23e3249 Add linux_preadv() and linux_pwritev() syscalls to Linuxulator.
Reviewed by:	dchagin
Approved by:	dchagin, trasz (src committers)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9722
2017-02-24 19:22:17 +00:00
Dmitry Chagin
486a06bdf0 Implement rt_tgsigqueueinfo system call used by glibc for pthread_sigqueue(3).
MFC after:	2 week
2017-02-19 07:38:11 +00:00
Konstantin Belousov
dab486441f MFamd64 r313933: microoptimize pmap_protect_pde().
Noted by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-19 06:14:33 +00:00
Konstantin Belousov
b1fa987835 Merge i386 and amd64 mtrr drivers.
Reviewed by:	royger, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9648
2017-02-17 21:08:32 +00:00
Roger Pau Monné
43b00aeb88 x86: fix MTRR initialization if EARLY_AP_STARTUP is used
MTRR handlers are set in {amd64/i686}_mem_drvinit, which is called at
SI_SUB_DRIVERS, and that's too late when EARLY_AP_STARTUP is set because APs
have already started at this point. {amd64/i686}_mrinit is also called too late
for the BSP, since that happens when the memory device is attached, also after
APs have already started.

Move the position to SI_SUB_CPU, and also initialize the state for the BSP, so
that the APs can correctly get to the same state as the BSP.

Sponsored by:		Citrix Systems R&D
MFC after:		1 week
Reviewed by:		jhb, kib
Differential Revision:	https://reviews.freebsd.org/D9630
2017-02-17 12:47:51 +00:00
Warner Losh
86d99b6884 Remove EISA bus support for add-in cards. Remove related kernel and
compile options. Remove doxygen pointers to now deleted files. Remove
EISA and VME as examples in bus_space.9.

Retained EISA mode code for IO PIC and MPTABLES because that's not
EISA bus, per se, and some people have abused EISA to mean "EISA-like
behavior as opposed to ISA" rather than using it for EISA add-in
cards.

Relnotes: yes
2017-02-16 21:57:35 +00:00
Warner Losh
5625fe9246 Remove Micro Channel Architecture support. Of the commonly available
machines, only a few 486 machines that used it, and those haven't had
enough memory to run FreeBSD for quite some time (often limited to
16MB).

Not to be confused with the Machine Check Architecture, which is still
very much alive and used (and untouched by this commit).

No Objection From: arch@
2017-02-15 23:04:25 +00:00
John Baldwin
bb9b710477 Regenerate all the system call tables to drop "created from" lines.
One of the ibcs2 files contains some actual changes (new headers) as
it hasn't been regenerated after older changes to makesyscalls.sh.
2017-02-10 19:45:02 +00:00
Dmitry Chagin
12bc0fb56f Regen after r313284.
MFC after:	2 week
2017-02-05 14:19:19 +00:00
Dmitry Chagin
8b756d40a7 Update syscall.master to 4.10-rc6. Also fix comments, a typo,
and wrong numbering for a few unimplemented syscalls.

For 32-bit Linuxulator, socketcall() syscall was historically
the entry point for the sockets API. Starting in Linux 4.3, direct
syscalls are provided for the sockets API. Enable it.

The initial version of patch was provided by trasz@ and extended by me.

Submitted by:	trasz
MFC after:	2 week
Differential Revision:	https://reviews.freebsd.org/D9381
2017-02-05 14:17:09 +00:00
Konstantin Belousov
57f6622f92 For i386, remove config options CPU_DISABLE_CMPXCHG, CPU_DISABLE_SSE
and device npx.

This means that FPU is always initialized and handled when available,
and SSE+ register file and exception are handled when available.  This
makes the kernel FPU code much easier to maintain by the cost of
slight bloat for CPUs older than 25 years.

CPU_DISABLE_CMPXCHG outlived its usefulness, see the removed comment
explaining the original purpose.

Suggested by and discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2017-02-03 12:51:40 +00:00
Konstantin Belousov
9c16356ccd Use ANSI definitions for some i386 functions.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-02 22:02:10 +00:00
Mateusz Guzik
ed869ff019 i386: fixup fcmpset
An incorrect output specifier was used which worked with clang by accident,
but breaks with the in-tree gcc version.

While here plug a whitespace nit.

Reported by:	bde
2017-02-02 01:33:08 +00:00
Edward Tomasz Napierala
ae6b6ef6cb Replace sys_ftruncate() with kern_ftruncate() in various compats.
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9368
2017-01-30 11:50:54 +00:00
Mateusz Guzik
e7a98aef79 i386: add atomic_fcmpset
Tested by:	pho
2017-01-30 02:24:54 +00:00
Konstantin Belousov
a0f64f38a1 Do not leave stale 4K TLB entries on pde (superpage) removal or
protection change.

On superpage promotion, x86 pmaps do not invalidate existing 4K
entries for the superpage range, because they are compatible with the
promoted 2/4M entry.  But the invalidation on superpage removal or
protection change only did single INVLPG with the base address of the
superpage.  This reliably flushed superpage TLB entry, and 4K entry
for the first page of the superpage, potentially leaving other 4K TLB
entries lingering.  Do the invalidation of the whole superpage range
to correct the problem.

Note that the precise invalidation is done by x86 code for kernel_pmap
only, for user pmaps whole (per-AS) TLB is flushed.  This made the bug
well hidden, because promotions of the kernel mappings require
specific load.

Reported and tested by:	Jonathan Looney <jtl@netflix.com> (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-29 19:14:48 +00:00
Jason A. Harmening
d986450859 Implement get_pcpu() for i386 and use it to replace pcpu_find(curcpu)
in the i386 pmap.

The curcpu macro loads the per-cpu data pointer as its first step,
so the remaining steps of pcpu_find(curcpu) are circular.

get_pcpu() is already implemented for arm, arm64, and risc-v.
My plan is to implement it for the remaining architectures and use
it to replace several instances of pcpu_find(curcpu) in MI code.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9370
2017-01-29 16:54:55 +00:00
Yoshihiro Takahashi
f7c79dd679 Garbage collect the FPU_ERROR_BROKEN option.
It is for pc98 only.
2017-01-28 03:53:53 +00:00
Yoshihiro Takahashi
2b375b4edd Remove pc98 support completely.
I thank all developers and contributors for pc98.

Relnotes:	yes
2017-01-28 02:22:15 +00:00
Konstantin Belousov
5611aaa195 Use SFENCE for ordering CLFLUSHOPT.
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-01-20 19:08:44 +00:00
Ed Schouten
4423244072 Catch up with changes to structure member names.
Pointer/length pairs are now always named ${name} and ${name}_len.
2017-01-17 22:05:52 +00:00
Jason A. Harmening
86785b54a5 Add comment explaining relative order of sched_unpin() and mtx_unlock().
Suggested by:	alc
MFC after:	1 week
2017-01-14 19:35:36 +00:00
Jason A. Harmening
28699efd43 For i386 temporary mappings, unpin the thread before releasing
the cmap lock.  Releasing the lock first may result in the thread
being immediately rescheduled and bound to the same CPU, only to
unpin itself upon resuming execution.

Noted by:	skra (in review for armv6 equivalent)
MFC after:	1 week
2017-01-14 09:56:01 +00:00
Mark Johnston
bd7abab0c9 Coalesce TLB shootdowns of global PTEs in pmap_advise() on x86.
We would previously invalidate such entries individually, resulting in more
IPIs than necessary.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9094
2017-01-10 21:52:48 +00:00
Sean Bruno
f2d6ace4a6 Migrate e1000 to the IFLIB framework:
- em(4) igb(4) and lem(4)
- deprecate the igb device from kernel configurations
- create a symbolic link in /boot/kernel from if_em.ko to if_igb.ko

Devices tested:
- 82574L
- I218-LM
- 82546GB
- 82579LM
- I350
- I217

Please report problems to freebsd-net@freebsd.org

Partial review from jhb and suggestions on how to *not* brick folks who
originally would have lost their igbX device.

Submitted by:	mmacy@nextbsd.org
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks and Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8299
2017-01-10 03:23:22 +00:00
Konstantin Belousov
2f304845e2 Do not allocate struct statfs on kernel stack.
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable.  With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from:	ino64 work by gleb
Discussed with:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:19:26 +00:00
Jason A. Harmening
43aabbefd8 Move the objects used to create temporary mappings for i386 pmap zero and copy
operations to the MD PCPU region.  Change sysmap initialization to only
allocate KVA pages for CPUs that are actually present.  As a minor
optimization, this also prevents false sharing between adjacent sysmap objects
since the pcpu struct is already cacheline-aligned.

While here, move pc_qmap_addr initialization for the BSP into
pmap_bootstrap(), which allows use of pmap_quick* functions during early boot.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8833
2016-12-23 15:14:56 +00:00
John Baldwin
b663816443 Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default.
PR:		199321, 203682
MFC after:	2 months
Sponsored by:	Netflix
2016-12-16 21:10:37 +00:00
Konrad Witaszczyk
480f31c214 Add support for encrypted kernel crash dumps.
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.

A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.

dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable.  Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.

When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore

A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.

Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.

savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.

decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.

Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.

EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.

Designed by:	def, pjd
Reviewed by:	cem, oshogbo, pjd
Partial review:	delphij, emaste, jhb, kib
Approved by:	pjd (mentor)
Differential Revision:	https://reviews.freebsd.org/D4712
2016-12-10 16:20:39 +00:00
Mark Johnston
7f68a896dc Add a COMPAT_FREEBSD11 kernel option.
Use it wherever COMPAT_FREEBSD10 is currently specified.

Reviewed by:	glebius, imp, jhb
Differential Revision:	https://reviews.freebsd.org/D8736
2016-12-09 18:54:12 +00:00
Alan Cox
e94965d82e Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index.  However, with this
change, vm_radix_remove() no longer panics.  Instead, it returns NULL
if there is no vm_page_t at the specified index.  Otherwise, it
returns the vm_page_t.  The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations.  Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D8708
2016-12-08 04:29:29 +00:00
John Baldwin
3cf1f0c347 MFamd64: Various fatal page fault fixes.
- If a page fault is triggered due to reserved bits in a PTE, treat it
  as a fatal fault and panic.
- If PG_NX is in use, report whether a fatal page fault is due to an
  instruction fetch or a data access.
- If a fatal page fault is due to reserved bits in a PTE, report that as
  the page fault type rather than a protection violation.

MFC after:	1 month
2016-11-19 01:36:44 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Konstantin Belousov
d3e4d71f1d Handle pmap_enter() over an existing 4/2M page in KVA on i386.
The userspace case was already handled by pmap_allocpte().  For kernel
VA, page table page must exist, and demote cannot fail, so we need to
just call pmap_demote_pde().  Also note that due to the machine AS
layout, promotions in the KVA on i386 are highly unlikely, so this
change is mostly for completeness.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8323
2016-10-28 11:53:22 +00:00
John Baldwin
16dcd7734f MFamd64: Add bounds checks on addresses used with /dev/mem.
Reject attempts to read from or memory map offsets in /dev/mem that are
beyond the maximum-supported physical address of the current CPU.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7408
2016-10-27 21:23:14 +00:00
John Baldwin
726f4773ec Enable EFER_NXE properly on APs.
EFER_NXE is set in the EFER MSR by initializecpu() and must be set on all
CPUs in the system.  When PG_NX support was added to PAE on i386, the
block to enable EFER_NXE was placed in a section of initializecpu() that
only runs if 'cpu == CPU_686'.  During early boot, locore does an
initial pass to set cpu that sets it to CPU_686 on all CPUs later than
a Pentium.  Later, printcpuinfo() adjusts the 'cpu' variable on
PII and later CPUs to one of CPU_PII, CPU_PIII, or CPU_P4.  However,
printcpuinfo() is called after initializecpu() on the BSP, so the BSP
would enable EFER_NXE and pg_nx.  The APs execute initializecpu() much
later after printcpuinfo() has run.  The end result on a modern CPU was
that cpu was set to CPU_PIII when the APs invoked initializecpu(), so
they did not enable EFER_NXE.  As a result, the APs would fault when
trying to access any pages marked with PG_NX set.

When booting a 2 CPU PAE kernel in bhyve this manifested as a hang before
single user mode.  The attempt to execute /bin/init tried to copy out
the exec strings (argv, etc.) to a non-executable mapping while running
on the AP.  The instruction kept faulting due to invalid bits in the PTE
in an infinite loop.

Fix this by moving the code to enable EFER_NXE out of the switch statement
on 'cpu' and always doing it if 'amd_feature' supports AMDID_NX.

MFC after:	2 weeks
2016-10-26 18:47:47 +00:00
Konstantin Belousov
295f4b6cfe Follow-up to r307866:
- Make !KDB config buildable.
- Simplify interface to nmi_handle_intr() by evaluating panic_on_nmi
  in one place, namely nmi_call_kdb().  This allows to remove do_panic
  argument from the functions, and to remove i386/amd64 duplication of
  the variable and sysctl definitions.  Note that now NMI causes
  panic(9) instead of trap_fatal() reporting and then panic(9),
  consistently for NMIs delivered while CPU operated in ring 0 and 3.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-24 20:47:46 +00:00
Konstantin Belousov
835c2787be Handle broadcast NMIs.
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.

When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb.  All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work.  One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".

Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.

Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs.  If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD.  More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.

Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.

The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8249
2016-10-24 16:40:27 +00:00
Jung-uk Kim
69d410eeb1 Implement BPF_MOD and BPF_XOR instructions.
These two ALU instructions first appeared on Linux.  Then, libpcap adopted
and made them available since 1.6.2.  Now more platforms including NetBSD
have them in kernel.  So do we.
 --이 줄 이하는 자동으로 제거됩니다--
2016-10-21 06:55:07 +00:00
Jung-uk Kim
730b3be34f Redude code for conditional jumps. 2016-10-21 06:09:30 +00:00
Jung-uk Kim
99e3ae6839 Fix compiler warnings for user land. 2016-10-21 06:06:54 +00:00
John Baldwin
31dc1e9681 Drop support for using mmap() with /dev/kmem.
Using the device pager with /dev/kmem is not stable since KVA mappings
are transient, but the device pager caches the PA associated with a
given offset forever.  Interestingly, mips' implementation of
memmap() already refused requests for /dev/kmem.

Note that kvm_read/kvm_write do not use mmap, but use read and write on
/dev/kmem, so this should not affect libkvm users.

Reviewed by:	kib
MFC after:	2 months
2016-10-14 20:01:07 +00:00
Warner Losh
b2a7ac4802 Fix building on i386 and arm. But 'public domain' headers on the files
with no creative content. Include "lost" changes from git:
o Use /dev/efi instead of /dev/efidev
o Remove redundant NULL checks.

Submitted by: kib@, dim@, zbb@, emaste@
2016-10-13 06:56:23 +00:00
Jonathan T. Looney
bd79708dbf In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
Warner Losh
f79d484dff Create /dev/efidev to provide an ioctl interface to
userland.  It supports userland interfaces to UEFI Runtime Services. This is
indended to the the MI portion of EFI RuntimeServices support.

Differential Revision: https://reviews.freebsd.org/D8128
Reviewed by: kib@, wblock@, Ganael Laplanche
2016-10-11 22:24:30 +00:00
Konstantin Belousov
83c001d3c2 Re-apply r306516 (by cem):
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags

Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.

On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one.  (Running a basic file
server workload.)

Submitted by:	Anton Rang <rang at acm.org>
Reviewed by:	cem (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8041
2016-10-04 17:01:24 +00:00
Hans Petter Selasky
97549c34ec Move the ConnectX-3 and ConnectX-2 driver from sys/ofed into sys/dev/mlx4
like other PCI network drivers. The sys/ofed directory is now mainly
reserved for generic infiniband code, with exception of the mthca driver.

- Add new manual page, mlx4en(4), describing how to configure and load
mlx4en.

- All relevant driver C-files are now prefixed mlx4, mlx4_en and
mlx4_ib respectivly to avoid object filename collisions when compiling
the kernel. This also fixes an issue with proper dependency file
generation for the C-files in question.

- Device mlxen is now device mlx4en and depends on device mlx4, see
mlx4en(4). Only the network device name remains unchanged.

- The mlx4 and mlx4en modules are now built by default on i386 and
amd64 targets. Only building the mlx4ib module depends on
WITH_OFED=YES .

Sponsored by:	Mellanox Technologies
2016-09-30 08:23:06 +00:00
Bruce Evans
9eeaa0ea1f Minor fixes for 160-bit disassembly:
(1) Print the default segment %ss before adresses relative to %bp.
    This is too cluttered for me, but so is printing some other default
    prefixes, and this is a reasonable reminder that %ss is quite
    likely to be different from %ds in 16-bit mode.

    db_disasm still handles prefixes poorly, by trying to discard
    redundant ones.  This loses information, and sometimes the result
    is wrong or misleading.

    Clean up nearby initializations and dead code.

(2) Fix decoding of operand and address size prefixes in 16-bit mode.
    They reverse the default in all modes.

Obtained from:            (1) is partly from r1.4 (2003/11/08) in DFlyBSD (?)
2016-09-25 18:39:24 +00:00
Tijl Coosemans
81d7ca7761 MFamd64: r266901
Allocate a zeroed LDT.

Failing to do this might result in the LDT appearing to run out of free
descriptors because of random junk in the descriptor's 'sd_type' field.

http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html

PR:		212639
Submitted by:	wheelcomplex@gmail.com
MFC after:	2 weeks
2016-09-25 18:29:02 +00:00
Bruce Evans
808cf02c24 Determine the operand/address size of %cs in a new function
db_segsize().

Use db_segsize() to set the default operand/address size for
disassembling.  Allow overriding this with the "alternate" display
format /I.  The API of db_disasm() should be debooleanized to pass a
more general request (amd64 needs overrides to sizes of 16, 32, and
64, but this commit doesn't implement anything for amd64 since much
larger changes are needed to restore the amd64 disassmbler's support
for non-default sizes).

Fix db_print_loc_and_inst() to ask for the normal format and not the
alternate in normal operation.

This is most useful for vm86 mode, but also works for 16-bit protected
mode.

Use db_segsize() to avoid trying to print a garbage stack trace if %cs
is 16 bits.  Print something like the stack trace termination message
for a trap boundary instead.

Document that the alternate format is now useful on i386.
2016-09-25 16:30:29 +00:00
Bruce Evans
f5435b8bbe Fix vm86 initialization, part 3 of 2 and a half. (Actually, just fix
early printfs and debugging of vm86 initialization and some other early
initialization in some cases.)  Add an option debug.late_console (with
default 1=off) to move console and kdb initialization back where it was.
Do the same for amd64 although there is no vm86 there.

On my test system, debug.late_console=0 works for the syscons, sio and
uart console drivers on amd64 and i386, and for vt on i386 but not on
amd64.

The early printfs fixed by debug.late_console=0 are:
- on i386, the message about lost memory above 4G
- with -v in otherwise normal use, about 20 printfs for SMAP
- other debugging messages for memory sizing.  Mostly under -v and
  not printed in normal use.

Document in a comment how much earlier the initialization and early
printf()s can be.  That is very early for the console.  Not much more
than curthread is needed.  kdb use obviously needs to be not so early,
since it needs IDT initialization and that is done relatively late
for convenience and historical reasons.
2016-09-25 14:56:24 +00:00
Mark Johnston
bdaf6d6913 Regenerate syscall provider argument strings. 2016-09-22 04:50:03 +00:00
Bruce Evans
1d3c0fa7b2 Remove all kernel uses of pcb_psl, but keep in in the struct to
preserve the ABI and API for applications.  It was removed in the port
to amd64, but was remained as garbage giving a micro-pessimization and
spurious single-step traps on i386.

pcb_psl was intended to be used just to do a context switch of PSL_I,
but this context switch was null in most or all versions, and
mis-switching of PSL_T was done instead.

Some history:
- in 386BSD-0.0, cpu_switch() ran at splhigh() and splhigh() did too
  much interrupt disabling, so interrupts were hard-disabled across
  cpu_switch() and too many other places
- in 386BSD-0.0-patchkit through FreeBSD-4 and FreeBSD-5 before
  SMPng, splhigh() did soft interrupt masking, and cpu_switch() was
  excessively cautious and did a cli at the start and a sti at the
  end to hard-disable interrupts across the switch
- SMPng replaced the spl's and cli's by spinlocks (just sched_lock?),
  so interrupts were hard-disabled across cpu_switch() and too many
  other places again
- initial attempts to fix this intended to restore some soft
  interrupt disabling, but to support variations in this cpu_switch()
  used pushfl/popfl into pcb_psl to avoid hard-coding the assumption
  that the initial and final states have PSL_I enabled.  But the
  version with soft interrupt disabling wasn't used for long, or was
  never committed, (except I always used my different version of it
  for UP) so the pushfl/popl and pcb_psl to hold them have been doing
  less than nothing for about 14 years.
2016-09-17 14:00:52 +00:00
Bruce Evans
c2d4aad4e0 (1) Ifdef the new dr6 variable for KDB.
While here, avoid using the old variable 'code' and remove it
in trap().  ('code' was meant for holding things like %dr6,
but is too small to hold %dr6 on amd64 and was reduced to an
obfuscation of tf_err, with early truncation on amd64.)

Submitted by:	Michael Butler (imb@...)
2016-09-16 04:58:37 +00:00
Bruce Evans
bd20334ca0 Abort single stepping in ddb if the trap is not for single-stepping.
This is not very easy to do, since ddb didn't know when traps are
for single-stepping.  It more or less assumed that traps are either
breakpoints or single-step, but even for x86 this became inadequate
with the release of the i386 in ~1986, and FreeBSD passes it other
trap types for NMIs and panics.

On x86, teach ddb when a trap is for single stepping using the %dr6
register.  Unknown traps are now treated almost the same as breakpoints
instead of as the same as single-steps.  Previously, the classification
of breakpoints was almost correct and everything else was unknown so
had to be treated as a single-step.  Now the classification of single-
steps is precise, the classification of breakpoints is almost correct
(as before) and everything else is unknown and treated like a
breakpoint.

This fixes:
- breakpoints not set by ddb, including the main one in kdb_enter(),
  were treated as single-steps and not stopped on when stepping
  (except for the usual, simple case of a step with residual count 1).
  As special cases, kdb_enter() didn't stop for fatal traps or panics
- similarly for "hardware breakpoints".

Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify
single-steps.  This is excessively complicated for bug-for-bug and
backwards compatibilty.  Design errors apparently started in Mach
in ~1990 or perhaps in the FreeBSD interface in ~1993.  Common trap
types like single steps should have a unique MI code (like the TRAP*
codes for user SIGTRAP) so that debuggers don't need macros like
IS_SSTEP_TRAP() to decode them.  But 'type' is actually an ambiguous
MD trap number, and code was always 0 (now it is (int)%dr6 on x86).
So it was impossible to determine the trap type from the args.
Global variables had to be used.

There is already a classification macro db_pc_is_single_step(), but
this just gets in the way.  It is only used to recover from bugs in
IS_BREAKPOINT_TRAP().  On some arches, IS_BREAKPOINT_TRAP() just
duplicates the ambiguity in 'type' and misclassifies single-steps as
breakpoints.  It defaults to 'false', which is the opposite of what is
needed for bug-for-bug compatibility.

When this is cleaned up, MI classification bits should be passed in
'code'.  This could be done now for positive-logic bits, since 'code'
was always 0, but some negative logic is needed for compatibility so
a simple MI classificition is not usable yet.

After reading %dr6, clear the single-step bit in it so that the type
of the next debugger trap can be decoded.  This is a little
ddb-specific.  ddb doesn't understand the need to clear this bit and
doing it before calling kdb is easiest.  gdb would need to reverse
this to support hardware breakpoints, but it just doesn't support
them now since gdbstub doesn't support %dr*.

Fix a bug involving %dr6: when emulating a single-step trap for vm86,
set the bit for it in %dr6.  Userland debuggers need this.  ddb now
needs this for vm86 bios calls.  The bit gets copied to 'code' then
cleared again.

Fix related style bugs:
- when clearing bits for hardware breakpoints in %dr6, spell the mask
  as ~0xf on both amd64 and i386 to get the correct number of bits
  using sign extension and not need a comment about using the wrong
  mask on amd64 (amd64 traps for invalid results but clearing the
  reserved top bits didn't trap since they are 0).
- rewrite my old wrong comments about using %dr6 for ddb watchpoints.
2016-09-15 17:24:23 +00:00
John Baldwin
38605d7312 Remove 'cpu' and 'cpu_class' on amd64.
The 'cpu' and 'cpu_class' variables were always set to the same value
on amd64 and are legacy holdovers from i386.  Remove them entirely on
amd64.

Reviewed by:	imp, kib (older version)
Differential Revision:	https://reviews.freebsd.org/D7888
2016-09-15 17:05:54 +00:00
Bruce Evans
701ac88055 Use the MI macro TRAPF_USERMODE() instead of open-coded checks for
SEL_UPL and sometimes PSL_VM.  This is just a style change on amd64,
but on i386 it fixes 1 unimportant place where the PSL_VM check was
missing and starts fixing 1 important place where the PSL_VM check
had a logic error.

Fix logic errors in treating vm86 bioscall mode as kernel mode.  The
main place checked all the necessary flags, but put the necessary
parentheses for the PSL_VM and PCB_VM86CALL checks in the wrong
place.  The broken case is only reached if a vm86 bioscall uses a
%cs which is nonzero mod 4, but that is unusual -- most bios calls
start with %cs = 0xc000 or 0xf000 and rarely change it.  Another
place was missing the check for PCB_VM86CALL, but was only reachable
if there are bugs virtualizing PSL_I.

Add a macro TF_HAS_STACKREGS() and use this instead of converting
open-coded checks of SEL_UPL, etc. to TRAPF_USERMODE() when we only
care about whether the frame has stack registers.  This fixes 3
places in my recent fix for register variables in vm86 mode where I
messed up the PSL_VM check and cleans up other places.
2016-09-14 12:57:40 +00:00
Alan Cox
8cb0c1029d Various changes to pmap_ts_referenced()
Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and
into vm/pmap.h, and describe what its purpose is.  Eliminate the archaic
"XXX" comment about its value.  I don't believe that its exact value, e.g.,
5 versus 6, matters.

Update the arm64 and riscv pmap implementations of pmap_ts_referenced()
to opportunistically update the page's dirty field.

On amd64, use the PDE value already cached in a local variable rather than
dereferencing a pointer again and again.

Reviewed by:	kib, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7836
2016-09-10 16:49:25 +00:00
Bruce Evans
8b530941f4 Fix single-stepping of instructions emulated by vm86.
In vm86.c, fix 2 (rarely used) cases where the return code lost the
single-step indicator.  While here, fix 2 misspellings of PSL_T as
PSL_TF (TF is the CPU manufacturer's spelling, but we use T).

In trap.c, turn T_PROTFLT and T_STKFLT into T_TRCTRAP if
vm86_emulate() asked for this (it does this when the instruction is
being traced and was successully emulated).  In the kernel case, we
used to deliver the trap as SIGTRAP to the process, where it always
terminated the process; now we deliver the trap as T_TRCTRAP to kdb,
where it normally gives single-stepping.  In the user case, the only
difference is that we now clear PSL_T and initialize ucode properly.

Reviewed by:	kib
2016-09-08 14:43:39 +00:00
Mark Johnston
dbbaf04f1e Remove support for idle page zeroing.
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by:	alc, imp, kib
Differential Revision:	https://reviews.freebsd.org/D7714
2016-09-03 20:38:13 +00:00
Alan Cox
53aadae680 As an optimization to the machine-independent layer, change the machine-
dependent pmap_ts_referenced() so that it updates the page's dirty field
if a modified bit is found while counting reference bits.  This
opportunistic update can be performed at low cost and can eliminate the
need for some future calls to pmap_is_modified() by the machine-
independent layer.

Reviewed by:	kib, markj
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7722
2016-09-01 15:57:44 +00:00
Bruce Evans
ef209971e9 Shorten banal comments about zeroing and copying pages. Don't give
implementation details that last echoed the code 15-20 years ago.
But add a detail about pagezero() on i386.  Switch from Mach style
to BSD style.
2016-08-29 14:38:31 +00:00
Bruce Evans
1a5735873e On amd64, declare sse2_pagezero() and start using it again, but only
for zeroing pages in idle where nontemporal writes are clearly best.
This is almost a no-op since zeroing in idle works does nothing good
and is off by default.  Fix END() statement forgotten in previous
commit.

Align the loop in sse2_pagezero().  Since it writes to main memory,
the loop doesn't have to be very carefully written to keep up.
Unrolling it was considered useless or harmful and was not done on
i386, but that was too careless.

Timing for i386: the loop was not unrolled at all, and moved only 4
bytes/iteration.  So on a 2GHz CPU, it needed to run at 2 cycles/
iteration to keep up with a memory speed of just 4GB/sec.  But when
it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/
iteration so it gave a maximum speed of 2.67GB/sec and couldn't even
keep up with PC3200 memory.  Fix the alignment so that it keep up with
4GB/sec memory, and unroll once to get nearer to 8GB/sec.  Further
unrolling might be useless or harmful since it would prevent the loop
fitting in 16-bytes.  My test system with an old CPU and old DDR1 only
needed 5+ GB/sec.  My test system with a new CPU and DDR3 doesn't need
any changes to keep up ~16GB/sec.

Timing for amd64: with 8-byte accesses and newer faster CPUs it is
easy to reach 16GB/sec but not so easy to go much faster.  The
alignment doesn't matter much if the CPU is not very old.  The loop
was already unrolled 4 times, but needs 32 bytes and uses a fancy
method that doesn't work for 2-way unrolling in 16 bytes.  Just
align it to 32-bytes.
2016-08-29 13:07:21 +00:00
Bruce Evans
be1ed810b1 Fix vm86 initialization, part 1 of 2 and a half.
Early use of vm86 depends on the PIC being reset to mask interrupts,
but r286667 moved PIC initialization to after where vm86 may be first
used.

Move the PIC initialization up to immdiately before vm86 initialization.
All invocations of diff that I tried display this move poorly so that it
looks like PIC and vm86 initialization was moved later.

r286667 was to move console initialization later.  The diffs are again
unreadable -- they show a large move that doesn't seem to involve the
console.  The PIC initialization stayed just below the console
initialization where it could still be debugged but no longer works.

Later console initialization breaks mainly debugging vm86 initialization
and memory sizing using ddb and printf().  There are several printf()s
in the memory sizing that now go nowhere since message buffer
initialization has always been too late.  Memory sizing is done by loader
for most users, but the lost messages for this case are even more
interesting than for an auto-probe since they tell you what the loader
found.
2016-08-28 15:23:44 +00:00
Bruce Evans
441ead70cd Fix vm86 initialization, part 1 of 2 and a half.
vm86 uses the tss, but r273995 moved tss initialization to after where
it may be first used, just because tss_esp0 now depends on later
initializations and/or amd64 does it later.

vm86 is first used for memory sizing in cases where the loader can't
figure out the size or is not used.  Its initialization is placed
immediately before memory sizing to support this, and the tss was
initialized a little earlier.

Move everything in the tss initialization except for tss_esp0 back to
almost where it was, immediately before vm86 initialization (the
combined move is from before dblflt_tss initialization to after).  Add
only early initialization of tss_esp0, later reloading of the tss, and
comments.  The initial tss_esp0 no longer has space for the pcb since
initially the size of the pcb is not known and no pcb is needed.
(Later changes broke debugging at this point, so the nonexistent pcb
cannot be used by debuggers, and at the time of 273995 when ddb was
almost able to debug this problem it didn't need the pcb.)  The
iniitial tss_esp0 still has a magic 16 bytes reserved for vm86
although I think this is unused too.
2016-08-28 14:03:25 +00:00
Ed Schouten
48734c99d3 Convert pointers obtained from the threadattr_t structure with TO_PTR().
In all of these source files, the userspace pointer size corresponds
with the kernelspace pointer size, meaning that casting directly works.
As I'm planning on making 32-bit execution on 64-bit systems work as
well, use TO_PTR() here as well, so that the changes between source
files remain minimal.
2016-08-24 10:13:18 +00:00
John Baldwin
a47632d45b Fix build for !SMP kernels after the Xen MSIX workaround.
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.

PR:		212014
Reported by:	Glyn Grinstead <glyn@grinstead.org>
MFC after:	3 days
2016-08-22 21:23:17 +00:00
Ed Schouten
8b0a83cce2 Make CloudABI work on i386.
Copy over amd64's cloudabi64_sysvec.c into i386 and tailor it to work.
Again, we use a system call convention similar to FreeBSD, except that
there is no support for indirect system calls (%eax == 0).

Where i386 differs from amd64 is that we have to store thread/process
entry arguments on the stack instead of using registers. We also have to
put an extra pointer on the stack for TLS (for GSBASE). Place that
pointer in the empty slot that is normally used to hold return
addresses. That seems to keep the code simple.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7590
2016-08-22 17:37:31 +00:00
John Baldwin
21768fa9c0 Remove the ie(4) driver for Intel 82586 ISA Ethernet adapters.
This driver only supports 10Mb Ethernet using PIO (the hardware supports
DMA, but the driver only does PIO).  There are not any PCCard adapters
supported by this driver, only ISA cards.  In addition, it does not use
bus_space but instead uses bcopy with volatile pointers triggering a
host of warnings.  (if_ie.c is one of 3 files always built with
-Wno-error)

Relnotes:	yes
2016-08-20 00:49:29 +00:00
John Baldwin
354b6f0fd9 Remove the spic(4) driver for the Sony Vaoi Jogdial.
This hardware is not present on any modern systems.  The driver is quite
hackish (raw inb/outb instead of bus_space, and raw inb/outb to random
I/O ports to enable ACPI since it predated proper ACPI support).

Relnotes:	yes
2016-08-19 23:39:08 +00:00
John Baldwin
09b9789b28 Remove the wl(4) driver and wlconfig(8) utility.
The wl(4) driver supports pre-802.11 PCCard wireless adapters that
are slower than 802.11b.  They do not work with any of the 802.11
framework and the driver hasn't been reported to actually work in a
long time.

Relnotes:	yes
2016-08-19 22:27:14 +00:00
John Baldwin
c1c9764296 Remove the si(4) driver and sicontrol(8) for Specialix serial cards.
The si(4) driver supported multiport serial adapters for ISA, EISA, and
PCI buses.  This driver does not use bus_space, instead it depends on
direct use of the pointer returned by rman_get_virtual().  It is also
still locked by Giant and calls for patch testing to convert it to use
bus_space were unanswered.

Relnotes:	yes
2016-08-19 21:14:27 +00:00
Bruce Evans
5bd90da0ef Remove duplicate definition of get_pcb_td(). gcc works for detecting
this error.
2016-08-15 10:46:33 +00:00
Bruce Evans
258b53d151 Fix the variables $esp, $ds, $es, $fs, $gs and $ss in vm86 mode.
Fix PC_REGS() so that printing of instructions works in some useful
cases.  ddb only understands a single flat address space, but this
macro allows mapping $cs:$eip into vm86's flat address space well
enough for the MI parts of ddb.  This doesn't work for the MD parts
that do stack traces, and there are no similar macros for data addresses.

PC_REGS() has to use the trapframe pointer instead of the pcb for this.
For other CPUs, the trapframe pointer is not available except by tracing
back to it.  But tracing back through vm86 trapframes is broken even
starting with one.
2016-08-14 16:51:25 +00:00
Konstantin Belousov
e42f8233fc Unconditionally perform checks that FPU region was entered, when #NM
exception is caught in kernel mode.  There are third-party modules
which trigger the issue, and since the problem causes usermode state
corruption at least, panic in production kernels as well.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-10 13:44:03 +00:00
Konstantin Belousov
fa03524a9f Merge i386 and amd64 variants of mp_watchdog.c into x86/, there is no
difference between files.
For pc98, put x86/mp_x86.c into the same place as used by i386 file list.
Fix typo in comment.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-03 13:51:53 +00:00
Brooks Davis
40018b91dd Don't create pointless backups of generated files in "make sysent".
Any sensible workflow will include a revision control system from which
to restore the old files if required.  In normal usage, developers just
have to clean up the mess.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7353
2016-07-28 21:29:04 +00:00
Alexander Motin
fb112f72a8 Add more UEFI/e820 memory types from latest specifications.
This is only cosmetics.

MFC after:	2 weeks
2016-07-24 09:15:11 +00:00
John Baldwin
a5c6318fc3 Rename PTRACE_SYSCALL to LINUX_PTRACE_SYSCALL.
Suggested by:	kib
2016-07-16 00:54:46 +00:00
Eric Badger
fdb6320d45 Add explicit detection of KVM hypervisor
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.

Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.).  (Spotted
by: vangyzen).

Differential revision:	https://reviews.freebsd.org/D7172
Sponsored by:	Dell Inc.
Approved by:	kib (mentor), vangyzen (mentor)
Reviewed by:	alc
MFC after:	4 weeks
2016-07-13 19:19:18 +00:00
Jung-uk Kim
0aed566c32 Remove a tunable and always reset system clock while resuming with ACPI.
Requested by:	bde (long ago)
2016-07-13 19:16:32 +00:00
Roger Pau Monné
302244700f xen: automatically disable MSI-X interrupt migration
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.

It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.

Sponsored by:		Citrix Systems R&D
MFC after:		3 days
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D7148
2016-07-12 08:43:09 +00:00
Konstantin Belousov
b42dfd6dff Fill tf_trapno for trap frames created for syscall.
If tf_trapno contains garbage which appears to be equal to T_NMI,
e.g. due to thread previously entered kernel due to NMI, doreti
sequence skips ast, and does so until a trap or hardware interrupt
occur.

The visible effects of the issue are quite confusing.  First, signals
delivery is postponed in observable ways.  In particular, the
guarantee that unblocked async signals queue is flushed before a
return from syscall, is broken.  Second, if there are pending signals,
all interruptible sleeps of the stuck thread are aborted immediately.

Since modern CPUs are relatively fast and tickless kernel generates
low interrupt rate, the faulty condition might exist for long time (in
an application time scale).

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-11 15:52:52 +00:00
Dmitry Chagin
97d06da692 Fix a copy/paste bug introduced during X86_64 Linuxulator work.
FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation
use READ_IMPLIES_EXEC flag, introduced in r302515.

While here move common part of mmap() and mprotect() code to the files in compat/linux
to reduce code dupcliation between Linuxulator's.

Reported by:    Johannes Jost Meixner, Shawn Webb

MFC after:	1 week
XMFC with:	r302515, r302516
2016-07-10 08:22:04 +00:00
Dmitry Chagin
ab231b83ea Regen for r302215 (Linux personality). 2016-07-10 08:17:16 +00:00
Dmitry Chagin
23e8912c60 Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag.
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.

READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
2016-07-10 08:15:50 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Konstantin Belousov
5c2cf81845 Update comments for the MD functions managing contexts for new
threads, to make it less confusing and using modern kernel terms.

Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
  cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
  cpu_set_upcall -> cpu_copy_thread (for forks)
  cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:05:44 +00:00
Dmitry Chagin
5437e1d103 Add macro to convert errno and use it when appropriate.
MFC after:	1 week
2016-05-22 12:46:34 +00:00
Dmitry Chagin
f26a190f65 Regen after r300359 (struct l_sched_param removal).
MFC after:	1 week
2016-05-21 08:03:13 +00:00
Dmitry Chagin
8cc96fb43a Correct an argument param of linux_sched_* system calls as a struct l_sched_param
does not defined due to it's nature.

MFC after:	1 week
2016-05-21 08:01:14 +00:00
Konstantin Belousov
0bfad8e4a3 Check for overflow and return EINVAL if detected. Backport this and
r300305 to i386.

PR:	209661
Reported and reviewed by:	cturt
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-05-20 19:50:32 +00:00
Sepherosa Ziehau
dfdc9a05c6 atomic: Add testandclear on i386/amd64
Reviewed by:	kib
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6381
2016-05-16 07:19:33 +00:00
John Baldwin
8d791e5af1 Add a new bus method to fetch device-specific CPU sets.
bus_get_cpus() returns a specified set of CPUs for a device.  It accepts
an enum for the second parameter that indicates the type of cpuset to
request.  Currently two valus are supported:

 - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
   the device when DEVICE_NUMA is enabled)
 - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)

For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL.  INTR_CPUS is mapped to 'all_cpus'
by default.  The idea is that INTR_CPUS should always return a valid set.

Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).

The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.

The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled.  They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.

Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
2016-05-09 20:50:21 +00:00
John Baldwin
82cb5c3b5b Native PCI-express HotPlug support.
PCI-express HotPlug support is implemented via bits in the slot
registers of the PCI-express capability of the downstream port along
with an interrupt that triggers when bits in the slot status register
change.

This is implemented for FreeBSD by adding HotPlug support to the
PCI-PCI bridge driver which attaches to the virtual PCI-PCI bridges
representing downstream ports on HotPlug slots. The PCI-PCI bridge
driver registers an interrupt handler to receive HotPlug events. It
also uses the slot registers to determine the current HotPlug state
and drive an internal HotPlug state machine. For simplicty of
implementation, the PCI-PCI bridge device detaches and deletes the
child PCI device when a card is removed from a slot and creates and
attaches a PCI child device when a card is inserted into the slot.

The PCI-PCI bridge driver provides a bus_child_present which claims
that child devices are present on HotPlug-capable slots only when a
card is inserted. Rather than requiring a timeout in the RC for
config accesses to not-present children, the pcib_read/write_config
methods fail all requests when a card is not present (or not yet
ready).

These changes include support for various optional HotPlug
capabilities such as a power controller, mechanical latch,
electro-mechanical interlock, indicators, and an attention button.
It also includes support for devices which require waiting for
command completion events before initiating a subsequent HotPlug
command. However, it has only been tested on ExpressCard systems
which support surprise removal and have none of these optional
capabilities.

PCI-express HotPlug support is conditional on the PCI_HP option
which is enabled by default on arm64, x86, and powerpc.

Reviewed by:	adrian, imp, vangyzen (older versions)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D6136
2016-05-05 22:26:23 +00:00
Roger Pau Monné
71718d3c73 xen/i386: enable the platform hypercall for i386
Not sure why the platform hypercall was disabled on i386, just enable it in
order to fix compilation of the PV timer on i386.

Sponsored by: Citrix Systems R&D
2016-05-03 08:05:14 +00:00
Konstantin Belousov
e1da986b54 Make it explicit that D_MEM cdevsw d_flag is to signify that the
driver is (or behaves identically to) /dev/mem.  Remove the D_MEM flag
from random drivers.

Note that currently the D_MEM flag does not affect any behaviour, but
this going to change in the next commit.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D6149
2016-05-01 17:46:56 +00:00
John Baldwin
e131ba36e8 Move 'device pci' for the PCI bus driver to the MI NOTES file.
The PCI bus was already listed in all of the MD NOTES files and the
driver should at least compile on all platforms.
2016-04-29 23:53:55 +00:00
Pedro F. Giffuni
b66bb393f2 Cleanup redundant parenthesis from existing howmany()/roundup() macro uses. 2016-04-22 16:57:42 +00:00
Pedro F. Giffuni
a380994fff Yet more redundant parenthesis from r298431.
Mea culpa.
2016-04-21 20:30:38 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Pedro F. Giffuni
ea24b0561f X86: use our nitems() macro when it is avaliable through param.h.
No functional change, only trivial cases are done in this sweep,

Discussed in:	freebsd-current
2016-04-19 23:41:46 +00:00
Sepherosa Ziehau
0c29fe6db8 hyperv: Deprecate HYPERV option by moving Hyper-V IDT vector into vmbus
Submitted by:	Jun Su <junsu microsoft com>
Reviewed by:	jhb, kib, sephe
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5910
2016-04-15 02:20:18 +00:00
Pedro F. Giffuni
a3269b0863 x86: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-14 17:04:06 +00:00
John Baldwin
4478441145 Expose doreti as a global symbol on amd64 and i386.
doreti provides the common code path for returning from interrupt
andlers on x86.  Exposing doreti as a global symbol allows kernel
modules to include low-level interrupt handlers instead of requiring
all low-level handlers to be statically compiled into the kernel.

Submitted by:	Howard Su <howard0su@gmail.com>
Reviewed by:	kib
2016-04-13 17:37:31 +00:00
Andriy Gapon
0d63fc3ed8 re-enable AMD Topology extension on certain models if disabled by BIOS
Some BIOSes disable AMD Topology extension on AMD Family 15h notebook
processors.  We re-enable the extension, so that we can properly discover
core and cache topology.  Linux seems to do the same.

Reported by:	Johannes Dieterich <dieterich.joh@gmail.com>
Reviewed by:	jhb, kib
Tested by:	Johannes Dieterich <dieterich.joh@gmail.com>
		(earlier version)
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D5883
2016-04-12 13:30:39 +00:00
Baptiste Daroussin
b6348be7b9 Add kern.features flags for linux and linux64 modules
kern.features.linux: 1 meaning linux 32 bits binaries are supported
kern.features.linux64: 1 meaning linux 64 bits binaries are supported

The goal here is to help 3rd party applications (including ports) to determine
if the host do support linux emulation

Reviewed by:	dchagin
MFC after:	1 week
Relnotes:	yes
Differential Revision:	D5830
2016-04-05 22:36:48 +00:00
John Baldwin
2b1e924b69 Move i386/i386/autoconf.c to sys/x86/x86 and use it on both amd64 and i386. 2016-04-03 23:03:54 +00:00
Konstantin Belousov
0df87548b9 Type of the interrupt handlers on x86 cannot be expressed in C.
Simplify and unify placeholder type definitions.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D5771
2016-03-29 19:56:48 +00:00
Dmitry Chagin
7c5982000d Revert r297310 as the SOL_XXX are equal to the IPPROTO_XX except SOL_SOCKET.
Pointed out by:	ae@
2016-03-27 10:09:10 +00:00
Dmitry Chagin
c826fcfe22 iConvert Linux SOL_IPV6 level.
MFC after:	1 week
2016-03-27 08:12:01 +00:00
Alexander Motin
baa7dd65be Polish wbwd(4) driver and add more supported chips.
MFC after:	1 month
2016-03-24 20:52:35 +00:00
John Baldwin
7a2c1d8c60 Enable interrupts on the BSP once all PICs are initialized.
This moves the enabling of interrupts slightly earlier (the old location
was still before devices were enumerated and probed) and does it in the
interrupt code (rather than in the device configuration code).  This
also avoids tripping over an assertion on the first TLB shootdown with
earlier AP startup.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5710
2016-03-24 00:24:07 +00:00
Dmitry Chagin
351cf753eb Regen for r297061 (fstatfs64 Linux syscall).
MFC after:	1 week
2016-03-20 13:23:01 +00:00
Dmitry Chagin
99546279d6 Implement fstatfs64 system call.
PR:		181012
Submitted by:	John Wehle
MFC after:	1 week
2016-03-20 13:21:20 +00:00
Konstantin Belousov
3ef966c4c0 The PKRU state size is 4 bytes, its support makes the XSAVE area size
non-multiple of 64 bytes.  Thereafter, the user state save area is
misaligned, which triggers assertion in the debugging kernels, or
segmentation violation on accesses for non-debugging configs.

Force the desired alignment of the user save area as the fix
(workaround is to disable bit 9 in the hw.xsave_mask loader tunable).
This correction is required for booting on the upcoming Intel' Purley
platform.

Reported and tested by:	"Pieper, Jeffrey E" <jeffrey.e.pieper@intel.com>,
	jimharris
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-03-15 15:42:53 +00:00
Justin Hibbits
c47476d7e6 Migrate many bus_alloc_resource() calls to bus_alloc_resource_anywhere().
Most calls to bus_alloc_resource() use "anywhere" as the range, with a given
count.  Migrate these to use the new bus_alloc_resource_anywhere() API.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D5370
2016-02-27 03:38:01 +00:00
Svatopluk Kraus
a1e1814d76 As <machine/pmap.h> is included from <vm/pmap.h>, there is no need to
include it explicitly when <vm/pmap.h> is already included.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D5373
2016-02-22 09:02:20 +00:00
Ian Lepore
37d65816e1 Minor style cleanups.
Submitted by:		bde
2016-02-21 18:17:09 +00:00
John Baldwin
aa949be551 Convert ss_sp in stack_t and sigstack to void *.
POSIX requires these members to be of type void * rather than the
char * inherited from 4BSD.  NetBSD and OpenBSD both changed their
fields to void * back in 1998.  No new build failures were reported
via an exp-run.

PR:		206503 (exp-run)
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D5092
2016-01-27 17:55:01 +00:00
Xin LI
669414e4fb Implement AT_SECURE properly.
AT_SECURE auxv entry has been added to the Linux 2.5 kernel to pass a
boolean flag indicating whether secure mode should be enabled. 1 means
that the program has changes its credentials during the execution.
Being exported AT_SECURE used by glibc issetugid() call.

Submitted by:	imp, dchagin
Security:	FreeBSD-SA-16:10.linux
Security:	CVE-2016-1883
2016-01-27 07:20:55 +00:00
Konstantin Belousov
2de38f7ec7 Adjust i386 comment to match amd64 one after r294311.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-01-19 08:09:09 +00:00
Gleb Smirnoff
de44d808ef Regen after r293907. 2016-01-14 10:15:21 +00:00
Gleb Smirnoff
037f750877 Change linux get_robust_list system call to match actual linux one.
The set_robust_list system call request the kernel to record the head
of the list of robust futexes owned by the calling thread. The head
argument is the list head to record.
The get_robust_list system call should return the head of the robust
list of the thread whose thread id is specified in pid argument.
The list head should be stored in the location pointed to by head
argument.

In contrast, our implemenattion of get_robust_list system call copies
the known portion of memory pointed by recorded in set_robust_list
system call pointer to the head of the robust list to the location
pointed by head argument.

So, it is possible for a local attacker to read portions of kernel
memory, which may result in a privilege escalation.

Submitted by:	mjg
Security:	SA-16:03.linux
2016-01-14 10:13:58 +00:00
Dmitry Chagin
038c720553 Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall
instead of vdso. An upcoming linux_base-c6 needs it.

Differential Revision:  https://reviews.freebsd.org/D1090

Reviewed by:	kib, trasz
MFC after:	1 week
2016-01-09 20:18:53 +00:00
Ed Maste
0e42ee5dd8 Move amd64 metadata.h to x86 and share with i386
MFC after:	1 week
2016-01-07 19:47:26 +00:00
Ian Lepore
69dcb7e771 Make the 'env' directive described in config(5) work on all architectures,
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.

Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly.  Most startup code wasn't
doing so.  Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.

Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.

The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values.  Now if the size is zero, the provided buffer
full of existing env data is installed.  A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.

Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp.  A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data.  Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
2016-01-02 02:53:48 +00:00
John Baldwin
9e8d8b4b0c Move shared variables from {amd64,i386}/initcpu.c to x86/identcpu.c.
While here, move the common bits of <machine/cputypes.h> to
<x86/cputypes.h> as well.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D4670
2015-12-23 21:41:42 +00:00
Konstantin Belousov
7c958a41fe Merge common parts of i386 and amd64 md_var.h and smp.h into
new headers x86/include x86_var.h and x86_smp.h.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D4358
2015-12-07 17:41:20 +00:00
John Baldwin
d99775308a Set %esp correctly in the extended TSS.
The pcb is saved at the top of the kernel stack on x86 platforms.
The initial kenrel stack pointer is set in the TSS so that the trapframe
from user -> kernel transitions begins directly below the pcb and grows
down.

The XSAVE changes moved the FPU save area out of the pcb and into a
variable-sized area after the pcb.  This required updating the expressions
to calculate the initial stack pointer from 'stacktop - sizeof(pcb)' to
'stacktop - sizeof(pcb) + FPU save area size'.

The i386_set_ioperm() system call allows user applications to access
individual I/O ports via the I/O port permission bitmap in the TSS.
On FreeBSD this requires allocating a custom per-process TSS instead of
using the shared per-CPU TSS.

The expression to initialize the initial kernel stack pointer in the
per-process TSS created for i386_set_ioperm() was not properly updated
after the XSAVE changes.  Processes that used i386_set_ioperm() would
trash the trapframe during subsequent context switches resulting in
panics from memory corruption.

This changes fixes the kernel stack pointer calculation for the per-process
TSS.

Reviewed by:	kib, n_hibma
Reported by:	n_hibma
MFC after:	1 week
2015-12-07 16:27:11 +00:00
Conrad Meyer
10386b56ad pmap_invalidate_range: For very large ranges, flush the whole TLB
Typical TLBs have 40-512 entries available.  At some point, iterating
every single page in a requested invalidation range and issuing invlpg
on it is more expensive than flushing the TLB and allowing it to reload
on demand.

Broadwell CPUs have 1536 L2 TLB entries, so I've picked the arbitrary
number 4096 entries as a hueristic at which point we flush TLB rather
than invalidating every single potential page.

Reviewed by:	alc
Feedback from:	jhb, kib
MFC notes:	Depends on r291688
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4280
2015-12-06 17:39:13 +00:00
Konstantin Belousov
1ac627629b In the pmap_set_pg() function, which enables the global bit on the
ptes mapping the kernel on CPUs where global TLB entries are
supported, revert to flushing only non-global entries, i.e. to the
pre-r291688 state.  There is no need to flush global TLB entries,
since only global entries created during the previous iterations of
the loop could exist at this moment.

Submitted by:	alc
Differential revision:	https://reviews.freebsd.org/D4368
2015-12-05 08:46:41 +00:00
Konstantin Belousov
27691a24ab For amd64 non-PCID machines, and for i386 machines with support for
the PG_G global pte flag, pmap_invalidate_all() fails to flush global
TLB entries [*].  This is because TLB shootdown handler for such
configs reloads CR3, and on i386 pmap_invalidate_all() does the same
for the initiating CPU.  Note that current code does not issue total
invalidation requests for the kernel_pmap.

Rename amd64 function invltlb_globpcid() to invltlb_glob(), it is not
specific for PCID for quite some time, and implement the same
functionality for i386.  Use the function instead of invltlb() in
shootdown handlers and in i386 pmap_invalidate_all(), but only for the
kernel pmap (which maps pages with the PG_G attribute set), which
takes care of PG_G TLB entries on flush.

To detect the affected pmap in i386 TLB shootdown handler, pmap should
be passed to the smp_masked_invltlb() function, which makes amd64 and
i386 TLB shootdown code almost identical.  Merge the code under x86/.

Noted by:	jhb [*]
Reviewed by:	cem, jhb, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D4346
2015-12-03 11:14:14 +00:00
Konstantin Belousov
724f4b62b0 Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct
sysent.

sv_prepsyscall is unused.

sv_sigsize and sv_sigtbl translate signal number from the FreeBSD
namespace into the ABI domain.  It is only utilized on i386 for iBCS2
binaries.  The issue with this approach is that signals for iBCS2 were
delivered with the FreeBSD signal frame layout, which does not follow
iBCS2.  The same note is true for any other potential user if
sv_sigtbl.  In other words, if ABI needs signal number translation, it
really needs custom sv_sendsig method instead.

Sponsored by:	The FreeBSD Foundation
2015-11-28 08:49:07 +00:00
Konstantin Belousov
493a48901f Disconnect iBCS2 emulator from the build. The ibcs2 option, the build
glue and the sources are not removed for now.

Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
2015-11-28 08:31:32 +00:00
Ed Maste
2e0002c18e Fix whitespace on addition of IPSEC option 2015-11-26 21:35:50 +00:00
Konstantin Belousov
5e27d79314 Split kerne timekeep ABI structure vdso_sv_tk out of the struct
sysentvec.  This allows the timekeep data to be shared between similar
ABIs which cannot share sysentvec.

Make the timekeep_push_vdso() tick callback to the timekeep structures
instead of sysentvecs.  If several sysentvec share the vdso_sv_tk
structure, we would update the userspace data several times on each
tick, without the change.

Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when
sysentvec is marked with the new SV_TIMEKEEP flag.  This saves
allocation and update of unneeded vdso_sv_tk for ABIs which do not
provide userspace gettimeofday yet, which are PowerPCs arches right
now.

Make vdso_sv_tk allocator public, namely split out and export
alloc_sv_tk() and alloc_sv_tk_compat32().  ABIs which share timekeep
data now can allocate it manually and share as appropriate.

Requested by:	nwhitehorn
Tested by:	nwhitehorn, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-11-23 07:09:35 +00:00
John Baldwin
645743ea99 Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.

One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.

A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.

The 'pcb_size' variable is used to index into the stoppcbs[] array.

The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.

While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved.  Annotations for
other platforms will be added in the future.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3773
2015-11-12 22:00:59 +00:00
Konstantin Belousov
05f1048743 The prefix for CLFLUSHOPT is 0x66. It was right on amd64.
Sponsored by:	The FreeBSD Foundation
2015-10-30 09:53:33 +00:00
John Baldwin
35aafbeda8 Use movw instead of movl (or plain mov) when moving segment registers
into memory.  This is a nop on clang's assembler, but some assemblers
complain if the size suffix is incorrect.

Submitted by:	bde
2015-10-29 21:25:46 +00:00
Hans Petter Selasky
8e9ef12d1e Build fix for i386/XBOX and pc98/GENERIC.
Reviewed by:	kib
2015-10-28 12:10:01 +00:00
Konstantin Belousov
af95bbf5bf Intel SDM before revision 56 described the CLFLUSH instruction as only
ordered with the MFENCE instruction.  Similar weak guarantees are also
specified by the AMD APM vol. 3 rev. 3.22.  x86 pmap methods
pmap_invalidate_cache_range() and pmap_invalidate_cache_pages() braced
CLFLUSH loop with MFENCE both before and after the loop.

In the revision 56 of SDM, Intel stated that all existing
implementations of CLFLUSH are strict, CLFLUSH instructions execution
is ordered WRT other CLFLUSH and writes.  Also, the strict behaviour
is made architectural.

A new instruction CLFLUSHOPT (which was documented for some time in
the Instruction Set Extensions Programming Reference) provides the
weak behaviour which was previously attributed to CLFLUSH.

Use CLFLUSHOPT when available.  When CLFLUSH is used on Intel CPUs, do
not execute MFENCE before and after the flushing loop.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2015-10-24 21:37:47 +00:00
Konstantin Belousov
3f8e071052 Add CLFLUSHOPT instruction wrappers.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-23 11:45:38 +00:00
Roger Pau Monné
6a306bff7f x86/xen: Consolidate xen-os.h in a single place
amd64 and i386 platform code contain very similar xen/xen-os.h

The only differences are:
 - Functions/variables/types which were unused in i386/xen/xen-os.h:
    * xen_xchg
    * __xchg_dummy
    * __xg
    * __xchg
    * atomic_t
    * atomic_inc
    * rdtscll

The functions/variables/types unused in xen-os.h can be dropped and there
is no more differences betwen amd64 and i386.

The new header is placed in x86/include/xen and each platform will have
dummy headers include x86/xen/*.h. This is to be able to include
machine/xen/*.h in the PV drivers.

Submitted by:		Julien Grall <julien.grall@citrix.com>
Reviewed by:		royger
Differential Revision:	https://reviews.freebsd.org/D3880
Sponsored by:		Citrix Systems R&D
2015-10-21 10:04:35 +00:00
Alexander Motin
4a3760bae6 Remove compatibility shims for legacy ATA device names.
We got new ATA stack in FreeBSD 8.x, switched to it at 9.x, completely
removed old stack at 10.x, so at 11.x it is time to remove compat shims.
2015-10-11 13:01:51 +00:00
Conrad Meyer
1b8d039386 Fix missing semi-colon from r289055.
Obtained from:	mjg
Sponsored by:	EMC / Isilon Storage Division
2015-10-08 23:27:45 +00:00
Mateusz Guzik
3e15a670d2 linux: fix handling of out-of-bounds syscall attempts
Due to an off by one the code would read an entry past the table, as
opposed to the last entry which contains the nosys handler.

Reported by:	Pawel Biernacki <pawel.biernacki gmail.com>
2015-10-08 21:08:35 +00:00
Roger Pau Monné
a231723cc0 xen/console: Introduce a new console driver for Xen guest
The current Xen console driver is crashing very quickly when using it on
an ARM guest. This is because the console lock is recursive and it may
lead to recursion on the tty lock and/or corrupt the ring pointer.

Furthermore, the console lock is not always taken where it should be and has
to be released too early because of the way the console has been designed.

Over the years, code has been modified to support various new features but
the driver has not been reworked.

This new driver has been rewritten with the idea of only having a small set
of specific function to write either via the shared ring or the hypercall
interface.

Note that HVM support has been left aside for now because it requires
additional features which are not yet supported. A follow-up patch will be
sent with HVM guest support.

List of items that may be good to have but not mandatory:
 - Avoid to flush for each character written when using the tty
 - Support multiple consoles

Submitted by:		Julien Grall <julien.grall@citrix.com>
Reviewed by:		royger
Differential Revision:	https://reviews.freebsd.org/D3698
Sponsored by:		Citrix Systems R&D
2015-10-08 16:39:43 +00:00
Roger Pau Monné
1a52c10530 Update Xen headers from 4.2 to 4.6
Pull the latest headers for Xen which allow us to add support for ARM and
use new features in FreeBSD.

This is a verbatim copy of the xen/include/public so every headers which
don't exits anymore in the Xen repositories have been dropped.

Note the interface version hasn't been bumped, it will be done in a
follow-up. Although, it requires fix in the code to get it compiled:

 - sys/xen/xen_intr.h: evtchn_port_t is already defined in the headers so
   drop it.

 - {amd64,i386}/include/intr_machdep.h: NR_EVENT_CHANNELS now depends on
   xen/interface/event_channel.h, so include it.

 - {amd64,i386}/{amd64,i386}/support.S: It's not neccessary to include
   machine/intr_machdep.h. This is also fixing build compilation with the
   new headers.

 - dev/xen/blkfront/blkfront.c: The typedef for blkif_request_segmenthas
   been dropped. So directly use struct blkif_request_segment

Finally, modify xen/interface/xen-compat.h to throw a preprocessing error if
__XEN_INTERFACE_VERSION__ is not set. This is allow us to catch any file
where xen/xen-os.h is not correctly included.

Submitted by:		Julien Grall <julien.grall@citrix.com>
Reviewed by:		royger
Differential Revision:	https://reviews.freebsd.org/D3805
Sponsored by:		Citrix Systems R&D
2015-10-06 11:29:44 +00:00
Alan Cox
9f86aba61c Exploit r288122 to address a cosmetic issue. Since PV chunk pages don't
belong to a vm object, they can't be paged out.  Since they can't be paged
out, they are never enqueued in a paging queue.  Nonetheless, passing
PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages
are being enqueued in the inactive queue.  As of r288122, we can avoid
this false impression by passing PQ_NONE.

Submitted by:	kmacy (an earlier version)
Differential Revision:	https://reviews.freebsd.org/D1674
2015-09-26 07:18:05 +00:00
Konstantin Belousov
cff8c6f2d1 Add support for weak symbols to the kernel linkers. It means that
linkers no longer raise an error when undefined weak symbols are
found, but relocate as if the symbol value was 0.  Note that we do not
repeat the mistake of userspace dynamic linker of making the symbol
lookup prefer non-weak symbol definition over the weak one, if both
are available.  In fact, kernel linker uses the first definition
found, and ignores duplicates.

Signature of the elf_lookup() and elf_obj_lookup() functions changed
to split result/error code and the symbol address returned.
Otherwise, it is impossible to return zero address as the symbol
value, to MD relocation code.  This explains the mechanical changes in
elf_machdep.c sources.

The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the
lookup() call, the patch leaves the code as is (untested).

Reported by:	glebius
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-20 01:27:59 +00:00
Mark Johnston
610141cebb Add stack_save_td_running(), a function to trace the kernel stack of a
running thread.

It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.

This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.

Reviewed by:	bdrewery, jhb, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3256
2015-09-11 03:54:37 +00:00
Mark Johnston
4db79feb8f Merge stack(9) implementations for i386 and amd64 under x86/.
Reviewed by:	jhb, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3255
2015-09-11 03:24:07 +00:00
Konstantin Belousov
1fa6712471 Do not hold the process around the vm_fault() call from the trap()s.
The only operation which is prevented by the hold is the kernel stack
swapout for the faulted thread, which should be fine to allow.

Remove useless checks for NULL curproc or curproc->p_vmspace from the
trap_pfault() wrappers on x86 and powerpc.

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-10 17:46:48 +00:00
Roger Pau Monné
e8234cfef6 preload_search_info: make sure mod is set
Add a check to preload_search_info to make sure mod is set. Most of the
callers of preload_search_info don't check that the mod parameter is
set, which can cause page faults. While at it, remove some now unnecessary
checks before calling preload_search_info.

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D3440
2015-08-21 15:57:57 +00:00
Marcel Moolenaar
7ef5e8bc80 Better support memory mapped console devices, such as VGA and EFI
frame buffers and memory mapped UARTs.

1.  Delay calling cninit() until after pmap_bootstrap(). This makes
    sure we have PMAP initialized enough to add translations. Keep
    kdb_init() after cninit() so that we have console when we need
    to break into the debugger on boot.
2.  Unfortunately, the ATPIC code had be moved as well so as to
    avoid a spurious trap #30. The reason for which is not known
    at this time.
3.  In pmap_mapdev_attr(), when we need to map a device prior to the
    VM system being initialized, use virtual_avail as the KVA to map
    the device at. In particular, avoid using the direct map on amd64
    because we can't demote by virtue of not being able to allocate
    yet. Keep track of the translation.
    Re-use the translation after the VM has been initialized to not
    waste KVA and to satisfy the assumption in uart(4) that the handle
    returned for the low-level console is the same as later returned
    when the device is probed and attached.
4.  In pmap_unmapdev() remove the mapping from the table when called
    pre-init. Otherwise keep the mapping. During bus probe and attach
    device resources are mapped and unmapped multiple times, which
    would have us destroy the mapping used by the low-level console.
5.  In pmap_init(), set pmap_initialized to signal that we're not
    pre-init anymore. On amd64, bring the direct map in sync with the
    translations created at that time.
6.  Implement bus_space_map() and bus_space_unmap() for real: when
    the tag corresponds to memory space, call the corresponding
    pmap_mapdev() and pmap_unmapdev() functions to construct and
    actual handle.
7.  In efifb.c and vt_vga.c, remove the crutches and hacks and simply
    call pmap_mapdev_attr() or bus_space_map() as desired.

Notes:
1.  uart(4) already used bus_space_map() during low-level console
    setup but since serial ports have traditionally been I/O port
    based, the lack of a proper implementation for said function
    was not a problem. It has always supported memory mapped UARTs
    for low-level consoles by setting hw.uart.console accordingly.
2.  The use of the direct map on amd64 without setting caching
    attributes has been a bigger problem than previously thought.
    This change has the fortunate (and unexpected) side-effect of
    fixing various EFI frame buffer problems (though not all).

PR: 191564, 194952

Special thanks to:
1.  XipLink, Inc -- generously donated an Intel Bay Trail E3800
    based eval board (ADLE3800PC).
2.  The FreeBSD Foundation, in particular emaste@ -- for UEFI
    support in general and testing.
3.  Everyone who tested the proposed for PR 191564.
4.  jhb@ and kib@ for being a soundboard and applying a clue bat
    if so needed.
2015-08-12 15:26:32 +00:00
Konstantin Belousov
0e190a486f Initialization of smp_tlb_wait does not require release semantic, no
data is synchronized by store/load to the variable.  The
lapic_write_icr() function ensures that store buffers are flushed
before IPI command is issued.

Discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-08-12 09:46:39 +00:00
Konstantin Belousov
c77d57c8b4 AP should load aps_ready with acquire semantic to see BSP updates to
the SMP structures, synchronized with the load by release store in
release_aps().

The change is formal, x86 strong memory model implicitely provided
the guarantees.

Discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-08-12 09:43:12 +00:00
Konstantin Belousov
edc8222303 Make kstack_pages a tunable on arm, x86, and powepc. On i386, the
initial thread stack is not adjusted by the tunable, the stack is
allocated too early to get access to the kernel environment. See
TD0_KSTACK_PAGES for the thread0 stack sizing on i386.

The tunable was tested on x86 only.  From the visual inspection, it
seems that it might work on arm and powerpc.  The arm
USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already
incorrect for the threads with non-default kstack size.  I only
changed the macros to use variable instead of constant, since I cannot
test.

On arm64, mips and sparc64, some static data structures are sized by
KSTACK_PAGES, so the tunable is disabled.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
2015-08-10 17:18:21 +00:00
Konstantin Belousov
c3f62b5352 Remove unused i386 header privatespace.h. For the native kernel, its
use was removed in r173592 (Nov 2007), yet Xen PV bits continued
referencing the privatespace structure, and were removed in r282274
(Apr 2015).

Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
2015-08-07 05:59:58 +00:00
John Baldwin
3c790178c5 Remove some more vestiges of the Xen PV domu support. Specifically,
use vtophys() directly instead of vtomach() and retire the no-longer-used
headers <machine/xenfunc.h> and <machine/xenvar.h>.

Reported by:	bde (stale bits in <machine/xenfunc.h>)
Reviewed by:	royger (earlier version)
Differential Revision:	https://reviews.freebsd.org/D3266
2015-08-06 17:07:21 +00:00
Ed Maste
fc8c856029 Rationalize BSD license on sys/*/include/in_cksum.h
Remove the advertising clause from the Regents of the University of
California's license, per the letter dated July 22, 1999.

Update clause numbering.
2015-08-05 19:05:12 +00:00
Konstantin Belousov
c8fbdcc10d Fix UP build after r286296, ensure that CPU_FOREACH() is defined.
Sponsored by:	The FreeBSD Foundation
2015-08-05 10:50:33 +00:00
Jason A. Harmening
713841afb2 Add two new pmap functions:
vm_offset_t pmap_quick_enter_page(vm_page_t m)
void pmap_quick_remove_page(vm_offset_t kva)

These will create and destroy a temporary, CPU-local KVA mapping of a specified page.

Guarantees:
--Will not sleep and will not fail.
--Safe to call under a non-sleepable lock or from an ithread

Restrictions:
--Not guaranteed to be safe to call from an interrupt filter or under a spin mutex on all platforms
--Current implementation does not guarantee more than one page of mapping space across all platforms. MI code should not make nested calls to pmap_quick_enter_page.
--MI code should not perform locking while holding onto a mapping created by pmap_quick_enter_page

The idea is to use this in busdma, for bounce buffer copies as well as virtually-indexed cache maintenance on mips and arm.

NOTE: the non-i386, non-amd64 implementations of these functions still need review and testing.

Reviewed by:	kib
Approved by:	kib (mentor)
Differential Revision:	http://reviews.freebsd.org/D3013
2015-08-04 19:46:13 +00:00
Konstantin Belousov
3208d3ff46 Give large kernel stack to the initial thread . Otherwise, ZFS
overflows the stack during root mount in some configurations.

Tested by:	Fabian Keil <freebsd-listen@fabiankeil.de> (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-08-04 13:50:52 +00:00
Warner Losh
75333e6435 Add pmspvc device back to GENERIC. The issues with the device playing
grabby hands with other driver's devices has been solved.

MFC After: 3 weeks
2015-08-03 13:49:46 +00:00
Konstantin Belousov
f94cc23475 Clear the IA32_MISC_ENABLE MSR bit, which limits the max CPUID
reported, on APs.  We already did this on BSP.

Otherwise, the userspace software which depends on the features
reported by the high CPUID levels is misbehaving.  In particular, AVX
detection is non-functional, depending on which CPU thread happens to
execute when doing CPUID.  Another victim is the libthr signal
handlers interposer, which needs to save full FPU extended state.

Reported and tested by:	Andre Meiser <ortadur@web.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-08-03 12:14:42 +00:00
Glen Barber
45e1c1a38d Pull pmspcv (pms(4)) from GENERIC. It has PCI ID conflicts
with ahd(4), mvs(4), and likely other drivers.

MFC after:	immediately
With hat:	re
Sponsored by:	The FreeBSD Foundation
2015-07-31 15:23:48 +00:00
Konstantin Belousov
0b6476ec5b Improve comments.
Submitted by:	bde
MFC after:	2 weeks
2015-07-30 15:47:53 +00:00
Konstantin Belousov
48cae112b5 Use private cache line for the locked nop in *mb() on i386.
Suggested by:	alc
Reviewed by:	alc, bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-30 00:13:20 +00:00
Konstantin Belousov
dd5b64258f MFamd64 r285934: Remove store/load (= full) barrier from the i386
atomic_load_acq_*().

Noted by:	alc (long time ago)
Reviewed by:	alc, bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-29 23:59:17 +00:00
Sean Bruno
79855a57e2 Remove dead functions pmap_pvdump and pads.
Differential Revision:	D3206
Submitted by:	kevin.bowling@kev009.com
Reviewed by:	alc
2015-07-29 20:47:27 +00:00
Konstantin Belousov
c48c590f63 Remove duplicate and useless declarations.
Submitted by:	bde
2015-07-22 09:12:40 +00:00
John Baldwin
9a2d6ab990 Various changes to the registers displayed in DDB for x86.
- Fix segment registers to only display the low 16 bits.
- Remove unused handlers and entries for the debug registers.
- Display xcr0 (if valid) in 'show sysregs'.
- Add '0x' prefix to MSR values to match other values in 'show sysregs'.
- MFamd64: Display various MSRs in 'show sysregs'.
- Add a 'show dbregs' to display the value of debug registers.
- Dynamically size the column width for register values to properly
  align columns on 64-bit platforms.
- Display %gs for i386 in 'show registers'.

Differential Revision:	https://reviews.freebsd.org/D2784
Reviewed by:	kib, markj
MFC after:	2 weeks
2015-07-22 01:09:02 +00:00
Mark Johnston
a5cbf8b9c0 Let the unwinder handle faults during function prologues or epilogues.
The i386 and amd64 DDB stack unwinders contain code to detect and handle
the case where the first frame is not completely set up or torn down. This
code was accidentally unused however, since db_backtrace() was never called
with a non-NULL trap frame. This change fixes that.

Also remove get_rsp() from the amd64 code. It appears to have come from
i386, which needs to take into account whether the exception triggered a
CPL switch, since SS:ESP is only pushed onto the stack if so. On amd64,
SS:RSP is pushed regardless, so get_rsp() was doing the wrong thing for
kernel-mode exceptions. As a result, we can also remove custom print
functions for these registers.

Reviewed by:	jhb
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D2881
2015-07-21 23:22:23 +00:00
Mark Johnston
f8a757d016 Improve stack unwinding on i386 and amd64 after an IP fault.
If we can't find a symbol corresponding to the faulting instruction, assume
that the previously-executed function is a call and attempt to find the
calling function using the return address on the stack. Otherwise we end
up associating the last stack frame with the current call, which is
incorrect and causes the unwinder to skip printing of the calling function,
resulting in a confusing backtrace.

Reviewed by:	jhb
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D2859
2015-07-21 23:13:11 +00:00
Mark Johnston
32cd0147fa Implement the lockstat provider using SDT(9) instead of the custom provider
in lockstat.ko. This means that lockstat probes now have typed arguments and
will utilize SDT probe hot-patching support when it arrives.

Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D2993
2015-07-19 22:14:09 +00:00
Konstantin Belousov
a8e1bc2e14 Revert bit of the r285627, locore.s does not need include of
opt_kstack_pages.h.  The asm gets the right KSTACK_PAGES from the
assym.s.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
2015-07-19 10:45:58 +00:00
Benno Rice
eacbeb2b95 Merge driver for PMC Sierra's range of SAS/SATA HBAs.
Submitted by:	Achim Leubner <Achim.Leubner@pmcs.com>
Reviewed by:	scottl
2015-07-17 23:30:43 +00:00
Konstantin Belousov
888e282ab4 When checking for the valid value of the frame pointer, verify that it
belongs to the kernel stack address range for the thread.  Right now,
code checks that new frame is not farther then KSTACK_PAGES pages from
the current frame, which allows the address to point past the top of
the stack.

Reviewed by:	andrew, emaste, markj
Differential revision:	https://reviews.freebsd.org/D3108
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-16 19:40:18 +00:00
Zbigniew Bodek
721555e7ee Fix KSTACK_PAGES issue when the default value was changed in KERNCONF
If KSTACK_PAGES was changed to anything alse than the default,
the value from param.h was taken instead in some places and
the value from KENRCONF in some others. This resulted in
inconsistency which caused corruption in SMP envorinment.

Ensure all places where KSTACK_PAGES are used the opt_kstack_pages.h
is included.

The file opt_kstack_pages.h could not be included in param.h
because was breaking the toolchain compilation.

Reviewed by:   kib
Obtained from: Semihalf
Sponsored by:  The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3094
2015-07-16 10:46:52 +00:00
Christian Brueffer
f4c1eac7cd Spell crypto correctly. 2015-07-14 10:47:56 +00:00
Konstantin Belousov
85237e335d Convert between abridged (from FXSAVE) and unabridged (from FSAVE)
versions of the x87 tags.  The conversion is naive, used abridged tag
is converted to valid unabridged, without additional checks for zero
and special values.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-10 09:20:13 +00:00
Konstantin Belousov
bcfc2be186 Duplicate the copyright from the i386/i386/machdep.c into
i386/include/frame.h after a code was moved from machdep.c to frame.h
in r284925.

Use include guards style similar to other guards.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-10 09:15:06 +00:00
John-Mark Gurney
e808e13b8b Now that aesni won't reuse fpu contexts (D3016), add seatbelts to the
fpu code to prevent other reuse of the contexts in the future...

Differential Revision:        https://reviews.freebsd.org/D3015
Reviewed by:	kib, gnn
2015-07-08 19:26:36 +00:00
Konstantin Belousov
8954a9a4e6 Add the atomic_thread_fence() family of functions with intent to
provide a semantic defined by the C11 fences with corresponding
memory_order.

atomic_thread_fence_acq() gives r | r, w, where r and w are read and
write accesses, and | denotes the fence itself.

atomic_thread_fence_rel() is r, w | w.

atomic_thread_fence_acq_rel() is the combination of the acquire and
release in single operation.  Note that reads after the acq+rel fence
could be made visible before writes preceeding the fence.

atomic_thread_fence_seq_cst() orders all accesses before/after the
fence, and the fence itself is globally ordered against other
sequentially consistent atomic operations.

Reviewed by:	alc
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-07-08 18:12:24 +00:00
Achim Leubner
4e1bc9a039 Driver 'pmspcv' added. Supports PMC-Sierra PM8001/8081/8088/8089/8074/8076/8077 SAS/SATA HBA Controllers. 2015-07-07 13:17:02 +00:00
George V. Neville-Neil
0661a7c224 Fix up tabs vs. spaces 2015-07-04 20:31:06 +00:00
George V. Neville-Neil
3839369c03 Enable IPSEC in all GENERIC kernels.
Universe and kernel build tests passed 4 July 2015

PR:		128030
Sponsored by:	Rubicon Communications (Netgate)
2015-07-04 17:37:00 +00:00
Konstantin Belousov
6fdfd88220 Use single instance of the identical INKERNEL() and PMC_IN_KERNEL()
macros on amd64 and i386.  Move the definition to machine/param.h.
kgdb defines INKERNEL() too, the conflict is resolved by renaming kgdb
version to PINKERNEL().

On i386, correct the lowest kernel address.  After the shared page was
introduced, USRSTACK no longer points to the last user address + 1 [*]

Submitted by:	Oliver Pinter [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-02 14:37:21 +00:00
Konstantin Belousov
b24de3ac40 Provide npx_get_fsave(9) and npx_set_fsave(9) functions to obtain and
restore the FPU state from the format of machine FSAVE area.  The
intended use is for ABI emulators to provide FSAVE-formatted FPU state
to usermode requiring it, while kernel could use FXSAVE due to
XMM/XSAVE.

The core functionality to convert from/to FXSAVE format is shared with
the fill_fpregs_xmm() and set_fpregs_xmm().  Move the later functions
to npx.c and rename them to npx_fill_fpregs_xmm() and
npx_set_fpregs_xmm().  They differ from nptx_get/set_fsave(9) since
our mcontext contains padding to be zeroed or ignored.

fill_fpregs() and set_fpregs() could be converted to use the new
interface, but there are small differences to handle.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-29 12:06:36 +00:00
Konstantin Belousov
f9343dacbd Move CS_SECURE() and EFL_SECURE() macros to the machine/frame.h. They
are useful for most implementations of sendsig().

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-29 10:35:00 +00:00
Konstantin Belousov
3ac3c0f269 Add a comment about too strong semantic of atomic_load_acq() on x86.
Submitted by:	bde
MFC after:	2 weeks
2015-06-29 09:58:40 +00:00
Konstantin Belousov
1817023775 Add x86 PT_GETFSBASE, PT_GETGSBASE machine-depended ptrace requests to
obtain the thread %fs and %gs bases.  Add x86 PT_SETFSBASE and
PT_SETGSBASE requests to set the bases from debuggers.  The set
requests, similarly to the sysarch({I386,AMD64}_SET_FSBASE),
override the corresponding segment registers.

The main purpose of the operations is to retrieve and modify the tcb
address for debuggee.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-29 07:07:24 +00:00
Konstantin Belousov
06d058bd23 Reduce code duplication. Add helper fill_based_sd(9) which creates a
based user data descriptor covering whole VA.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-29 06:59:08 +00:00
Konstantin Belousov
7626d062c3 Remove unneeded data dependency, currently imposed by
atomic_load_acq(9), on it source, for x86.

Right now, atomic_load_acq() on x86 is sequentially consistent with
other atomics, code ensures this by doing store/load barrier by
performing locked nop on the source.  Provide separate primitive
__storeload_barrier(), which is implemented as the locked nop done on
a cpu-private variable, and put __storeload_barrier() before load, to
keep seq_cst semantic but avoid introducing false dependency on the
no-modification of the source for its later use.

Note that seq_cst property of x86 atomic_load_acq() is not documented
and not carried by atomics implementations on other architectures,
although some kernel code relies on the behaviour.  This commit does
not intend to change this.

Reviewed by:	alc
Discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-28 05:04:08 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
John Baldwin
5eb95e11ba Report the values of x86 segment registers to remote debuggers.
While here, also report %eflags from the i386 trapframe.

Differential Revision:	https://reviews.freebsd.org/D2743
Reviewed by:	kib
Obtained from:	1 month
2015-06-12 15:14:08 +00:00
John Baldwin
9f9d9cf33e Ensure that the upper 16 bits of segment registers manually saved in
trapframes are cleared by explicitly pushing a zero and then moving
the segment register into the low 16 bits.  Certain Intel processors
treat a push of a segment register as a move of the segment register
into the low 16 bits leaving the upper 16 bits of the word in the
stack unchanged.

Reviewed by:	kib
MFC after:	1 month
2015-06-12 15:06:17 +00:00
Ruslan Bukin
4f4d15f0d0 Allow DTrace to be compiled-in to the kernel.
This will require for AArch64 as we dont have modules yet.

Sponsored by:	HEIF5
Sponsored by:	ARM Ltd.
Differential Revision:	https://reviews.freebsd.org/D1997
2015-06-10 15:53:39 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Mateusz Guzik
4ea6a9a28f Generalised support for copy-on-write structures shared by threads.
Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.
2015-06-10 10:43:59 +00:00
Alan Cox
65a9768f62 Account for superpage mappings that are created by pmap_copy(). 2015-06-09 18:04:28 +00:00
Dimitry Andric
90acdcbc88 Merge r283870 from amd64:
Remove unneeded NULL checks in trap_fatal().

Since td_name is an array member of struct thread, it can never be NULL,
so the check can be removed.  In addition, curproc can never be NULL,
so remove the if statement, and splice the two printfs() together.

While here, remove the u_long cast, and use the correct printf format
specifier for curproc->p_pid.

Requested by:	jhb
MFC after:	3 days
2015-06-08 20:12:44 +00:00
Alan Cox
966272ca33 Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages.
Differential Revision:	https://reviews.freebsd.org/D2712
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-08 04:59:32 +00:00
Konstantin Belousov
32a1e9e4a5 Update print_INTEL_TLB() by the tag values from the Intel SDM
rev. 55.  The modern CPUs cache and TLB descriptions looked quite
questionable without the update, e.g. Haswell i7 4770S reported:
	Data TLB: 4 KB pages, 4-way set associative, 64 entries
	L2 cache: 256 kbytes, 8-way associative, 64 bytes/line
After the update, the report is:
	Data TLB: 1 GByte pages, 4-way set associative, 4 entries
	Data TLB: 4 KB pages, 4-way set associative, 64 entries
	Instruction TLB: 2M/4M pages, fully associative, 8 entries
	Instruction TLB: 4KByte pages, 8-way set associative, 64 entries
	64-Byte prefetching
	Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries
	L2 cache: 256 kbytes, 8-way associative, 64 bytes/line
Some tags were apparently removed from the table 3-21, Vol. 2A.  Keep
them around, but add a comment stating the removal.

Update the format line for cpu_stdext_feature according to the bits
from the SDM rev.55.  It appears that Haswells do not store %cs and
%ds values in the FPU save area.

Store content of the %ecx register from the CPUID leaf 0x7
subleaf 0 as cpu_stdext_feature2 and print defined bits from it,
again acording to SDM rev. 55.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-06 22:03:24 +00:00
Dmitry Chagin
d707582f83 When I merged the lemul branch I missied kib@'s r282708 commit.
This is not the final fix as I need properly cleanup thread resources
before other threads suicide.

Tested by:	Ruslan Makhmatkhanov
2015-05-25 20:44:46 +00:00
Dmitry Chagin
5c5aac2d45 Regen for r283492. 2015-05-24 18:09:01 +00:00
Dmitry Chagin
9802eb9ebc Implement Linux specific syncfs() system call. 2015-05-24 18:08:01 +00:00
Dmitry Chagin
c532a88cfc Regen for r283488. 2015-05-24 18:05:21 +00:00
Dmitry Chagin
e1ff74c0f7 Implement recvmmsg() and sendmmsg() system calls. 2015-05-24 18:04:04 +00:00
Dmitry Chagin
b7aaa9fdb0 Reduce duplication between MD Linux code by moving msg related
struct definitions out into the compat/linux/linux_socket.h
2015-05-24 18:03:14 +00:00
Dmitry Chagin
3ce05165b1 Regen for r283484. 2015-05-24 18:02:17 +00:00
Dmitry Chagin
6e4c8004dc Implement epoll_pwait() system call. 2015-05-24 18:00:14 +00:00
Dmitry Chagin
ca045164a7 Regen for r283480. 2015-05-24 17:58:24 +00:00
Dmitry Chagin
19d8b461f4 Add utimensat() system call.
The patch developed by Jilles Tjoelker and Andrew Wilcox and
adopted for lemul branch by me.
2015-05-24 17:57:07 +00:00
Dmitry Chagin
0c38abc250 The kernel sends signals to the processes via ABI specific sv_sendsig method.
Native ABI do not need signal conversion, only emulators may want this. Usually
emulators implements its own sv_sendsig method. For now only ibcs2 emulator does
not have own sv_sendsig implementation and depends on native sendsig() method.
So, remove any extra attempts to convert signal numbers from native sendsig()
methods except from i386 where ibsc2 is living.
2015-05-24 17:56:02 +00:00
Dmitry Chagin
4ab7403bbd Rework signal code to allow using it by other modules, like linprocfs:
1. Linux sigset always 64 bit on all platforms. In order to move Linux
sigset code to the linux_common module define it as 64 bit int. Move
Linux sigset manipulation routines to the MI path.

2. Move Linux signal number definitions to the MI path. In general, they
are the same on all platforms except for a few signals.

3. Map Linux RT signals to the FreeBSD RT signals and hide signal conversion
tables to avoid conversion errors.

4. Emulate Linux SIGPWR signal via FreeBSD SIGRTMIN signal which is outside
of allowed on Linux signal numbers.

PR:		197216
2015-05-24 17:47:20 +00:00
Dmitry Chagin
a7ac457613 According to Linux man sigaltstack(3) shall return EINVAL if the ss
argument is not a null pointer, and the ss_flags member pointed to by ss
contains flags other than SS_DISABLE. However, in fact, Linux also
allows SS_ONSTACK flag which is simply ignored.

For buggy apps (at least mono) ignore other than SS_DISABLE
flags as a Linux do.

While here move MI part of sigaltstack code to the appropriate place.

Reported by:	abi at abinet dot ru
2015-05-24 17:44:08 +00:00
Dmitry Chagin
657100de57 Regen for r283467. 2015-05-24 17:39:18 +00:00
Dmitry Chagin
fcdffc03f8 Call nosys in case when the incorrect syscall number is specified.
Reported by:	trinity
2015-05-24 17:38:02 +00:00
Dmitry Chagin
8d939ad405 Regen for r283465. 2015-05-24 17:35:42 +00:00
Dmitry Chagin
b6aeb7d5dd Add preliminary fallocate system call implementation
to emulate posix_fallocate() function.

Differential Revision:	https://reviews.freebsd.org/D1523
Reviewed by:	emaste
2015-05-24 17:33:21 +00:00
Dmitry Chagin
274d2df2e5 Regen for r283451. 2015-05-24 17:00:43 +00:00
Dmitry Chagin
a6b40812ec Implement ppoll() system call.
Differential Revision:	https://reviews.freebsd.org/D1105
Reviewed by:	trasz
2015-05-24 16:59:25 +00:00
Dmitry Chagin
22f3dfdc12 Regen for r283444. 2015-05-24 16:50:17 +00:00
Dmitry Chagin
a31d76867d Implement eventfd system call.
Differential Revision:	https://reviews.freebsd.org/D1094
In collaboration with:	Jilles Tjoelker
2015-05-24 16:49:14 +00:00
Dmitry Chagin
3e89b64168 Put the correct value for the abi_nfdbits parameter of kern_select() for
all supported Linuxulators.

Differential Revision:	https://reviews.freebsd.org/D1093
Reviewed by:	trasz
2015-05-24 16:47:13 +00:00
Dmitry Chagin
28fb55359b Regen for r283441. 2015-05-24 16:42:49 +00:00
Dmitry Chagin
e16fe1c730 Implement epoll family system calls. This is a tiny wrapper
around kqueue() to implement epoll subset of functionality.
The kqueue user data are 32bit on i386 which is not enough for
epoll user data, so we keep user data in the proc emuldata.

Initial patch developed by rdivacky@ in 2007, then extended
by Yuri Victorovich @ r255672 and finished by me
in collaboration with mjg@ and jillies@.

Differential Revision:	https://reviews.freebsd.org/D1092
2015-05-24 16:41:39 +00:00
Dmitry Chagin
4d0f380d87 To avoid code duplication move open/fcntl definitions to the MI
header file.

Differential Revision:	https://reviews.freebsd.org/D1087
Reviewed by:	trasz
2015-05-24 16:31:44 +00:00
Dmitry Chagin
26c68e1fe5 Use the BSD_TO_LINUX_SIGNAL() wherever there is no need
to check the ABI as it is known.

Differential Revision:	https://reviews.freebsd.org/D1086
2015-05-24 16:30:23 +00:00
Dmitry Chagin
437c43c1cb Being exported through vdso the note.Linux section used by glibc
to determine the kernel version (this saves one uname call).
Temporarily disable the export of a note.Linux section until I figured
out how to change the kernel version in the note.Linux on the fly.

Differential Revision:	https://reviews.freebsd.org/D1081
Reviewed by:	trasz
2015-05-24 16:25:44 +00:00
Dmitry Chagin
4048f59cd0 Add AT_RANDOM and AT_EXECFN auxiliary vector entries which are used by
glibc. At list since glibc version 2.16 using AT_RANDOM is mandatory.

Differential Revision:	https://reviews.freebsd.org/D1080
2015-05-24 16:24:24 +00:00
Dmitry Chagin
0a1884d768 Regen for r283428. 2015-05-24 16:19:57 +00:00
Dmitry Chagin
baa232bbfd Change linux faccessat syscall definition to match actual linux one.
The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented
within the glibc wrapper function for faccessat().  If either of these
flags are specified, then the wrapper function employs fstatat() to
determine access permissions.

Differential Revision:	https://reviews.freebsd.org/D1078
Reviewed by:	trasz
2015-05-24 16:18:03 +00:00
Dmitry Chagin
67d3974849 Introduce a new module linux_common.ko which is intended for the
following primary purposes:

1. Remove the dependency of linsysfs and linprocfs modules from linux.ko,
which will be architecture specific on amd64.

2. Incorporate into linux_common.ko general code for platforms on which
we'll support two Linuxulator modules (for both instruction set - 32 & 64 bit).

3. Move malloc(9) declaration to linux_common.ko, to enable getting memory
usage statistics properly.

Currently linux_common.ko incorporates a code from linux_mib.c and linux_util.c
and linprocfs, linsysfs and linux kernel modules depend on linux_common.ko.

Temporarily remove dtrace garbage from linux_mib.c and linux_util.c

Differential Revision:	https://reviews.freebsd.org/D1072
In collaboration with:	Vassilis Laganakos.

Reviewed by:	trasz
2015-05-24 15:51:18 +00:00
Dmitry Chagin
31eb438886 x86_64 Linux do not use multiplexing on ipc system calls.
Move struct ipc_perm definition to the MD path as it differs for 64 and
32 bit platform.

Differential Revision:	https://reviews.freebsd.org/D1068
Reviewed by:	trasz
2015-05-24 15:44:41 +00:00
Dmitry Chagin
26cf41d6ca Remove stale comment about a signal trampoline which
is moved to the shared page at r219609.

Differential Revision:	https://reviews.freebsd.org/D1063
Reviewed by:	trasz
2015-05-24 15:32:52 +00:00
Dmitry Chagin
0020bdf13a Put linux_platform into the vdso to avoid copying it onto the stack at
every exec.

Differential Revision:	https://reviews.freebsd.org/D1062
Reviewed by:	trasz
2015-05-24 15:30:52 +00:00
Dmitry Chagin
32084836c0 Eliminate a now unused global declaration of elf_linux_sysvec.
Differential Revision:	https://reviews.freebsd.org/D1061
Reviewed by:	trasz
2015-05-24 15:29:20 +00:00
Dmitry Chagin
bdc379344a Implement vdso - virtual dynamic shared object. Through vdso Linux
exposes functions from kernel with proper DWARF CFI information so that
it becomes easier to unwind through them.
Using vdso is a mandatory for a thread cancelation && cleanup
on a modern glibc.

Differential Revision:	https://reviews.freebsd.org/D1060
2015-05-24 15:28:17 +00:00
Dmitry Chagin
b2e0aad9e5 Regen for r283403. 2015-05-24 15:22:33 +00:00
Dmitry Chagin
ae50b4d7b5 Implement pselect6() system call.
Differential Revision:	https://reviews.freebsd.org/D1051
Reviewed by:	trasz
2015-05-24 15:21:25 +00:00
Dmitry Chagin
e7fa9de6eb Regen for r283401. 2015-05-24 15:19:44 +00:00
Dmitry Chagin
c3978c7bb1 Implement prlimit64() system call.
Differential Revision:	https://reviews.freebsd.org/D1050
Reviewed by:	emaste, trasz
2015-05-24 15:18:19 +00:00
Dmitry Chagin
737325a46d Regen for r283399. 2015-05-24 15:15:46 +00:00
Dmitry Chagin
254a937ee5 Implement dup3() system call.
Differential Revision:	https://reviews.freebsd.org/D1049
Reviewed by:	emaste
2015-05-24 15:14:51 +00:00
Dmitry Chagin
f680d990e8 Regen for r283396. 2015-05-24 15:12:38 +00:00
Dmitry Chagin
7ac9766db4 Implement rt_sigqueueinfo() system call.
Differential Revision:	https://reviews.freebsd.org/D1047
Reviewed by:	trasz
2015-05-24 15:11:32 +00:00
Dmitry Chagin
e4454275a5 Regen for r283394. 2015-05-24 15:08:25 +00:00
Dmitry Chagin
e5fe4ccf59 Implement waitid() system call.
Differential Revision:	https://reviews.freebsd.org/D1046
2015-05-24 15:06:39 +00:00
Dmitry Chagin
18599d43e9 Regen for r283392. 2015-05-24 15:05:22 +00:00
Dmitry Chagin
28ca9c9e0b struct l_rusage does not defined for i386 Linuxulator due to it's nature.
Differential Revision:	https://reviews.freebsd.org/D2139
2015-05-24 15:04:12 +00:00
Dmitry Chagin
001398c4c5 To reduce code duplication introduce linux_copyout_rusage() method.
Use it in linux_wait4() system call and move linux_wait4() to the MI path.
While here add a prototype for the static bsd_to_linux_rusage().

Differential Revision:	https://reviews.freebsd.org/D2138
Reviewed by:	trasz
2015-05-24 15:03:09 +00:00
Dmitry Chagin
af682d487b Some style(9) && whitespaces fixes. No functional changes.
Differential Revision:	https://reviews.freebsd.org/D1041
Reviewed by:	emaste
2015-05-24 14:55:12 +00:00
Dmitry Chagin
81338031c4 Switch linuxulator to use the native 1:1 threads.
The reasons:
1. Get rid of the stubs/quirks with process dethreading,
   process reparent when the process group leader exits and close
   to this problems on wait(), waitpid(), etc.
2. Reuse our kernel code instead of writing excessive thread
   managment routines in Linuxulator.

Implementation details:

1. The thread is created via kern_thr_new() in the clone() call with
   the CLONE_THREAD parameter. Thus, everything else is a process.
2. The test that the process has a threads is done via P_HADTHREADS
   bit p_flag of struct proc.
3. Per thread emulator state data structure is now located in the
   struct thread and freed in the thread_dtor() hook.
   Mandatory holdig of the p_mtx required when referencing emuldata
   from the other threads.
4. PID mangling has changed. Now Linux pid is the native tid
   and Linux tgid is the native pid, with the exception of the first
   thread in the process where tid and pid are one and the same.

Ugliness:

   In case when the Linux thread is the initial thread in the thread
   group thread id is equal to the process id. Glibc depends on this
   magic (assert in pthread_getattr_np.c). So for system calls that
   take thread id as a parameter we should use the special method
   to reference struct thread.

Differential Revision:	https://reviews.freebsd.org/D1039
2015-05-24 14:53:16 +00:00
Dmitry Chagin
91d1786f65 In preparation for switching linuxulator to the use the native 1:1
threads add a hook for cleaning thread resources before the thread die.

Differential Revision:	https://reviews.freebsd.org/D1038
2015-05-24 14:51:29 +00:00
Dmitry Chagin
64cfe4dc38 Regen for r283379. 2015-05-24 14:47:00 +00:00
Dmitry Chagin
2003907d45 Implement a Linux version of sched_getparam() && sched_setparam().
Temporarily use the first thread in proc.

Differential Revision:	https://reviews.freebsd.org/D1036
Reviewed by:	trasz
2015-05-24 14:45:57 +00:00
Dmitry Chagin
37588665e4 Regen for r283375. 2015-05-24 14:43:06 +00:00