Commit Graph

6295 Commits

Author SHA1 Message Date
kib
c6e5735bc4 Change (unused) prototype for stmxcsr() to match reality.
Noted by:	jhb
MFC after:	1 week
2012-07-30 19:26:02 +00:00
alc
dcd4c15b16 Shave off a few more cycles from pmap_enter()'s critical section. In
particular, do a little less work with the PV list lock held.
2012-07-29 18:20:49 +00:00
kib
b9c519314f Forcibly shut up clang warning about NULL pointer dereference.
MFC after:	3 weeks
2012-07-23 19:16:31 +00:00
kib
bff2a9c29d Constently use 2-space sentence breaks.
Submitted by:	 bde
MFC after:	 1 week
2012-07-21 13:53:00 +00:00
kib
8175cd6cd3 Stop caching curpcb in the local variable.
Requested by:	    bde
MFC after:	    1 week
2012-07-21 13:47:37 +00:00
kib
125bff5612 The PT_I386_{GET,SET}XMMREGS and PT_{GET,SET}XSTATE operate on the
stopped threads. Implementation assumes that the thread's FPU context
is spilled into the PCB due to stop. This is mostly true, except when
FPU state for the thread is not initialized. Then the requests operate
on the garbage state which is currently left in the PCB, causing
confusion.

The situation is indeed observed after a signal delivery and before
#NM fault on execution of any FPU instruction in the signal handler,
since sendsig(9) drops FPU state for current thread, clearing
PCB_FPUINITDONE. When inspecting context state for the signal handler,
debugger sees the FPU state of the main program context instead of the
clear state supposed to be provided to handler.

Fix this by forcing clean FPU state in PCB user FPU save area by
performing getfpuregs(9) before accessing user FPU save area in
ptrace_machdep.c.

Note: this change will be merged to i386 kernel as well, where it is
much more important, since e.g. gdb on i386 uses PT_I386_GETXMMREGS to
inspect FPU context on CPUs that support SSE. Amd64 version of gdb
uses PT_GETFPREGS to inspect both 64 and 32 bit processes, which does
not exhibit the bug.

Reported by:	bde
MFC after:	1 week
2012-07-21 13:06:37 +00:00
kib
3584ef8115 Stop clearing x87 exceptions in the #MF handler on amd64. If user code
understands FPU hardware enough to catch SIGFPE and unmask exceptions
in control word, then it may as well properly handle return from
SIGFPE without causing an infinite loop of #MF exceptions due to
faulting instruction restart, when needed.

Clearing exceptions causes information loss for handlers which do
understand FPU hardware, and struct siginfo si_code member cannot be
considered adequate replacement for en_sw content due to translation.

Supposed reason for clearing the exceptions, which is IRQ13 handling
oddities, were never applicable to amd64.

Note: this change will be merged to i386 kernel as well, since we do
not support IRQ13 delivery of #MF notifications for some time.

Requested by:	bde
MFC after:	1 week
2012-07-21 13:05:34 +00:00
kib
ebf0cf4fd1 Introduce curpcb magic variable, similar to curthread, which is MD
amd64.  It is implemented as __pure2 inline with non-volatile asm read
from pcpu, which allows a compiler to cache its results.

Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb.

Note that __curthread() uses magic value 0 as an offsetof(struct pcpu,
pc_curthread). It seems to be done this way due to machine/pcpu.h
needs to be processed before sys/pcpu.h, because machine/pcpu.h
contributes machine-depended fields to the struct pcpu definition. As
result, machine/pcpu.h cannot use struct pcpu yet.

The __curpcb() also uses a magic constant instead of offsetof(struct
pcpu, pc_curpcb) for the same reason. The constants are now defined as
symbols and CTASSERTs are added to ensure that future KBI changes do
not break the code.

Requested and reviewed by: bde
MFC after:    3 weeks
2012-07-19 19:09:12 +00:00
alc
1561669204 Don't unnecessarily set PGA_REFERENCED in pmap_enter(). 2012-07-19 05:34:19 +00:00
kib
e69c4af4ad On AMD64, provide siginfo.si_code for floating point errors when error
occurs using the SSE math processor.  Update comments describing the
handling of the exception status bits in coprocessors control words.

Remove GET_FPU_CW and GET_FPU_SW macros which were used only once.
Prefer to use curpcb to access pcb_save over the longer path of
referencing pcb through the thread structure.

Based on the submission by:	Ed Alley <wea llnl gov>
PR:	  amd64/169927
Reviewed by:	bde
MFC after:	3 weeks
2012-07-18 15:43:47 +00:00
kib
abd7bc0665 Add stmxcsr.
Submitted by:	Ed Alley <wea llnl gov>
PR:	  amd64/169927
MFC after:	3 weeks
2012-07-18 15:36:03 +00:00
kib
5bd2b3d73a Add support for the XSAVEOPT instruction use. Our XSAVE/XRSTOR usage
mostly meets the guidelines set by the Intel SDM:
1. We use XRSTOR and XSAVE from the same CPL using the same linear
   address for the store area
2. Contrary to the recommendations, we cannot zero the FPU save area
   for a new thread, since fork semantic requires the copy of the
   previous state. This advice seemingly contradicts to the advice
   from the item 6.
3. We do use XSAVEOPT in the context switch code only, and the area
   for XSAVEOPT already always contains the data saved by XSAVE.
4. We do not modify the save area between XRSTOR, when the area is
   loaded into FPU context, and XSAVE. We always spit the fpu context
   into save area and start emulation when directly writing into FPU
   context.
5. We do not use segmented addressing to access save area, or rather,
   always address it using %ds basing.
6. XSAVEOPT can be only executed in the area which was previously
   loaded with XRSTOR, since context switch code checks for FPU use by
   outgoing thread before saving, and thread which stopped emulation
   forcibly get context loaded with XRSTOR.
7. The PCB cannot be paged out while FPU emulation is turned off, since
   stack of the executing thread is never swapped out.

The context switch code is patched to issue XSAVEOPT instead of XSAVE
if supported. This approach eliminates one conditional in the context
switch code, which would be needed otherwise.

For user-visible machine context to have proper data, fpugetregs()
checks for unsaved extension blocks and manually copies pristine FPU
state into them, according to the description provided by CPUID leaf
0xd.

MFC after:  1 month
2012-07-14 15:48:30 +00:00
alc
c99370c950 Wring a few cycles out of pmap_enter(). In particular, on a user-space
pmap, avoid walking the page table twice.
2012-07-13 04:10:41 +00:00
jhb
501beca671 Add a clts() wrapper around the 'clts' instruction to <machine/cpufunc.h>
on x86 and use that to implement stop_emulating() in the fpu/npx code.
Reimplement start_emulating() in the non-XEN case by using load_cr0() and
rcr0() instead of the 'lmsw' and 'smsw' instructions.  Intel explicitly
discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in
the description of these instructions in Volume 2 of the ADM.

Reviewed by:	kib
MFC after:	1 month
2012-07-09 20:55:39 +00:00
jhb
fb20c6fee6 Partially revert r217515 so that the mem_range_softc variable is always
present on x86 kernels.  This fixes the build of kernels that include
'device acpi' but do not include 'device mem'.

MFC after:	1 month
2012-07-09 20:42:08 +00:00
kib
8cabc35aa0 Use assembler mnemonic instead of manually assembling, contination for r238142.
Reviewed by:	jhb
MFC after:	1 month
2012-07-06 20:11:58 +00:00
jhb
56c146fdcb Several fixes to the amd64 disassembler:
- Add generic support for opcodes that are escape bytes used for
  multi-byte opcodes (such as the 0x0f prefix).  Use this to replace
  the hard-coded 0x0f special case and add support for three-byte
  opcodes that use the 0x0f38 prefix.
- Decode all Intel VMX instructions.  invept and invvpid in particular are
  three-byte opcodes that use the 0x0f38 escape prefix.
- Rework how the special 'SDEP' size flag works such that the default
  instruction name (i_name) is the instruction when the data size
  prefix (0x66) is not specified, and the alternate name in i_extra is
  used when the prefix is included.
- Add a new 'ADEP' size flag similar to 'SDEP' except that it chooses
  between i_name and i_extra based on the address size prefix (0x67).
  Use this to fix the decoding for jrcxz vs jecxz which is determined
  by the address size prefix, not the operand size prefix.  Also, jcxz
  is not possible in 64-bit mode, but jrcxz is the default instruction
  for that opcode.
- Add support for handling instructions that have a mandatory 'rep'
  prefix (this means not outputting the 'repe ' prefix until determining
  if it is used as part of an opcode).  Make 'pause' less of a special
  case this way.
- Decode 'cmpxchg16b' and 'cdqe' which are variants of other instructions
  but with a REX.W prefix.

MFC after:	1 month
2012-07-06 14:25:59 +00:00
alc
6c7a2e716b Make pmap_enter()'s management of PV entries consistent with the other pmap
functions that manage PV entries.  Specifically, remove the PV entry from
the containing PV list only after the corresponding PTE is destroyed.

Update the pmap's wired mapping count in pmap_enter() before the PV list
lock is acquired.
2012-07-06 06:42:25 +00:00
jhb
01cf370271 Now that our assembler supports the xsave family of instructions, use them
natively rather than hand-assembled versions.  For xgetbv/xsetbv, add a
wrapper API to deal with xcr* registers: rxcr() and load_xcr().

Reviewed by:	kib
MFC after:	1 month
2012-07-05 18:19:35 +00:00
alc
6f8105b9e4 Calculate the new PTE value in pmap_enter() before acquiring any locks.
Move an assertion to the beginning of pmap_enter().
2012-07-05 07:20:16 +00:00
alc
c6d735204d Correct an error in r237513. The call to reserve_pv_entries() must come
before pmap_demote_pde() updates the PDE.  Otherwise, pmap_pv_demote_pde()
can crash.

Crash reported by:	kib
Patch tested by:	kib
2012-07-05 00:08:47 +00:00
jhb
850f973fef Decode the 'xsave', 'xrstor', 'xsaveopt', 'xgetbv', 'xsetbv', and
'rdtscp' instructions.

MFC after:	1 month
2012-07-04 16:47:39 +00:00
delphij
7bd211d5cb tws(4) is interfaced with CAM so move it to the same section.
Reported by:	joel
MFC after:	3 days
2012-07-01 08:10:49 +00:00
alc
aacced50df Optimize reserve_pv_entries() using the popcnt instruction. 2012-06-30 20:25:12 +00:00
alc
96954b494e In r237592, I forgot that pmap_enter() might already hold a PV list lock
at the point that it calls get_pv_entry().  Thus, pmap_enter()'s PV list
lock pointer must be passed to get_pv_entry() for those rare occasions
when get_pv_entry() calls reclaim_pv_chunk().

Update some related comments.
2012-06-29 18:15:56 +00:00
alc
44ab6b3053 Avoid some unnecessary PV list locking in pmap_enter(). 2012-06-28 22:03:59 +00:00
alc
0c1e664993 Optimize pmap_pv_demote_pde(). 2012-06-28 05:42:04 +00:00
alc
c5e6daff9d Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
alc
e1ae76f31c Introduce RELEASE_PV_LIST_LOCK(). 2012-06-26 16:45:18 +00:00
alc
2cb386d2ce Add PV list locking to pmap_enter(). Its execution is no longer serialized
by the pvh global lock.

Add a needed atomic operation to pmap_object_init_pt().
2012-06-26 06:02:43 +00:00
alc
ba4bea6755 Add PV chunk and list locking to pmap_change_wiring(), pmap_protect(), and
pmap_remove().  The execution of these functions is no longer serialized
by the pvh global lock.

Make some stylistic changes to the affected code for the sake of
consistency with related code elsewhere in the pmap.
2012-06-25 07:13:25 +00:00
alc
2d5bdc7fff Introduce reserve_pv_entry() and use it in pmap_pv_demote_pde(). In order
to add PV list locking to pmap_pv_demote_pde(), it is necessary to change
the way that pmap_pv_demote_pde() allocates PV entries.  Specifically,
once pmap_pv_demote_pde() begins modifying the PV lists, it can't allocate
any new PV chunks, because that could require the PV list lock to be
dropped.  So, all necessary PV chunks must be allocated in advance.  To my
surprise, this new approach is a few percent faster than the old one.
2012-06-23 22:54:25 +00:00
kib
7b36a08108 Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
kib
d98ad62d7e Reserve AT_TIMEKEEP auxv entry for providing usermode the pointer to
timekeeping information.

MFC after:  1 week
2012-06-22 06:38:31 +00:00
alc
c64bfa11c5 Introduce CHANGE_PV_LIST_LOCK_TO_{PHYS,VM_PAGE}() to avoid duplication of
code.
2012-06-22 05:01:36 +00:00
alc
dd83f274ed Update the PV stats in free_pv_entry() using atomics. After which, it is
no longer necessary for free_pv_entry() to be serialized by the pvh global
lock.

Retire pmap_insert_entry() and pmap_remove_entry().  Once upon a time,
these functions were called from multiple places within the pmap.  Now,
each has only one caller.
2012-06-21 16:37:36 +00:00
alc
5199d5838c Add PV list locking to pmap_copy(), pmap_enter_object(), and
pmap_enter_quick().  These functions are no longer serialized by the pvh
global lock.

There is no need to release the PV list lock before calling free_pv_chunk()
in pmap_remove_pages().
2012-06-20 07:25:20 +00:00
alc
a6f2f68675 Condition the implementation of pv_entry_count on PV_STATS. On amd64,
pv_entry_count is purely informational.  It does not serve any functional
purpose.

Add PV chunk locking to get_pv_entry().
2012-06-19 08:12:44 +00:00
np
67d5f1a727 - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
kib
4eede7506a Adjust the fix in r236953, by not generating the signal manually, but
performing the return to usermode using full return path.  This
consolidates the handling of exceptional situations in less number of
places, and is less code as well.

Reviewed by:   jhb
MFC after:     1 week
2012-06-18 21:08:48 +00:00
alc
857159a401 Add PV chunk and list locking to pmap_page_exists_quick(),
pmap_page_is_mapped(), and pmap_remove_pages().  These functions
are no longer serialized by the pvh global lock.
2012-06-18 16:21:59 +00:00
alc
6eeaee04e4 The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer.  This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by:	kib
MFC after:	6 weeks
2012-06-16 18:56:19 +00:00
adrian
979df60a78 Oops - use the actual 11n enable option. 2012-06-15 15:32:16 +00:00
adrian
6f922a3302 Ok, ok. 802.11n can be on by default in GENERIC in -HEAD.
God help me.
2012-06-15 02:16:29 +00:00
alc
2038da6f9a Update a couple comments to reflect r235598.
X-MFC after:	r235598
2012-06-14 17:47:54 +00:00
alc
06117384c5 Correctly identify the function in a KASSERT().
MFC after:	3 days
2012-06-14 17:40:49 +00:00
jkim
d0039ee2cf - Remove unused code for CR3 and CR4.
- Fix few style(9) nits while I am here.
2012-06-13 22:53:56 +00:00
jkim
d1d32ebbe5 - Fix resumectx() prototypes to reflect reality.
- For i386, simply jump to resumectx() with PCB in %ecx.
- Fix a style(9) nit while I am here.
2012-06-13 21:03:01 +00:00
bz
5f1573508a Fix a problem where zero-length RDATA fields can cause named(8) to crash.
[12:03]

Correct a privilege escalation when returning from kernel if
running FreeBSD/amd64 on non-AMD processors. [12:04]

Fix reference count errors in IPv6 code. [EN-12:02]

Security:	CVE-2012-1667
Security:	FreeBSD-SA-12:03.bind
Security:	CVE-2012-0217
Security:	FreeBSD-SA-12:04.sysret
Security:	FreeBSD-EN-12:02.ipv6refcount
Approved by:	so (simon, bz)
2012-06-12 12:10:10 +00:00
iwasaki
fb4fac5af8 Share IPI init and startup code of mp_machdep.c with acpi_wakeup.c
as ipi_startup().
2012-06-12 00:14:54 +00:00