- Move mwlfw from {amd64,i386}/conf/NOTES to sys/conf/NOTES (mwl(4) is
already present in sys/conf/NOTES).
- Remove duplicate mwl(4) entries from {amd64,i386}/conf/NOTES.
- While here, add a description to the sfxge line in amd64/conf/NOTES.
reason for generated trap. The dump of basic signal information and 8
bytes of the faulting instruction are printed on the controlling
terminal of the process, if the machdep.uprintf_signal syscal is
enabled.
The print is the only practical way to debug traps from a.out
processes I am aware of. Because I have to reimplement it each time I
debug an issue with a.out support on amd64, commit the hack to main
tree.
MFC after: 1 week
in long mode which transfers control to 32bit code segment. Unbreak
the lcall $7,$0 implementation on amd64 by putting the 64bit user code
segment' selector into call gate, and execute the 64bit trampoline
which converts the return frame into 32bit format and switches back to
32bit mode for executing int $0x80 trampoline.
Note that all jumps over the hoops are performed in the user mode.
MFC after: 1 week
It is not listed in the boot sequence in the MP specification (1.4),
and it is explicitly ignored on modern CPUs. It was only ever required
when bootstrapping systems with external APICs (that is, SMP machines
with 486s), which FreeBSD has never supported (and never will).
While here, tidy some comments and remove some banal ones.
matches the algorithm in the MP specification (1.4). Previously we
were sending out the deassert INIT IPI immediately after the initial
INIT IPI was sent.
typical hypervisor does not implement access to the required MSR,
causing #GP on boot.
Reported and tested by: olgeni
PR: amd64/170388
MFC after: 3 days
PTE's PG_M and PG_RW bits but not the physical page frame. First,
only perform vm_page_dirty() on a managed vm_page when the PG_M bit is
being cleared. If the updated PTE continues to have PG_M set, then
there is no requirement to perform vm_page_dirty(). Second, flush the
mapping from the TLB when PG_M alone is cleared, not just when PG_M
and PG_RW are cleared. Otherwise, a stale TLB entry may stop PG_M
from being set again on the next store to the virtual page. However,
since the vm_page's dirty field already shows the physical page as
being dirty, no actual harm comes from the PG_M bit not being set.
Nonetheless, it is potentially confusing to someone expecting to see
the PTE change after a store to the virtual page.
stopped threads. Implementation assumes that the thread's FPU context
is spilled into the PCB due to stop. This is mostly true, except when
FPU state for the thread is not initialized. Then the requests operate
on the garbage state which is currently left in the PCB, causing
confusion.
The situation is indeed observed after a signal delivery and before
#NM fault on execution of any FPU instruction in the signal handler,
since sendsig(9) drops FPU state for current thread, clearing
PCB_FPUINITDONE. When inspecting context state for the signal handler,
debugger sees the FPU state of the main program context instead of the
clear state supposed to be provided to handler.
Fix this by forcing clean FPU state in PCB user FPU save area by
performing getfpuregs(9) before accessing user FPU save area in
ptrace_machdep.c.
Note: this change will be merged to i386 kernel as well, where it is
much more important, since e.g. gdb on i386 uses PT_I386_GETXMMREGS to
inspect FPU context on CPUs that support SSE. Amd64 version of gdb
uses PT_GETFPREGS to inspect both 64 and 32 bit processes, which does
not exhibit the bug.
Reported by: bde
MFC after: 1 week
understands FPU hardware enough to catch SIGFPE and unmask exceptions
in control word, then it may as well properly handle return from
SIGFPE without causing an infinite loop of #MF exceptions due to
faulting instruction restart, when needed.
Clearing exceptions causes information loss for handlers which do
understand FPU hardware, and struct siginfo si_code member cannot be
considered adequate replacement for en_sw content due to translation.
Supposed reason for clearing the exceptions, which is IRQ13 handling
oddities, were never applicable to amd64.
Note: this change will be merged to i386 kernel as well, since we do
not support IRQ13 delivery of #MF notifications for some time.
Requested by: bde
MFC after: 1 week
amd64. It is implemented as __pure2 inline with non-volatile asm read
from pcpu, which allows a compiler to cache its results.
Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb.
Note that __curthread() uses magic value 0 as an offsetof(struct pcpu,
pc_curthread). It seems to be done this way due to machine/pcpu.h
needs to be processed before sys/pcpu.h, because machine/pcpu.h
contributes machine-depended fields to the struct pcpu definition. As
result, machine/pcpu.h cannot use struct pcpu yet.
The __curpcb() also uses a magic constant instead of offsetof(struct
pcpu, pc_curpcb) for the same reason. The constants are now defined as
symbols and CTASSERTs are added to ensure that future KBI changes do
not break the code.
Requested and reviewed by: bde
MFC after: 3 weeks
occurs using the SSE math processor. Update comments describing the
handling of the exception status bits in coprocessors control words.
Remove GET_FPU_CW and GET_FPU_SW macros which were used only once.
Prefer to use curpcb to access pcb_save over the longer path of
referencing pcb through the thread structure.
Based on the submission by: Ed Alley <wea llnl gov>
PR: amd64/169927
Reviewed by: bde
MFC after: 3 weeks
mostly meets the guidelines set by the Intel SDM:
1. We use XRSTOR and XSAVE from the same CPL using the same linear
address for the store area
2. Contrary to the recommendations, we cannot zero the FPU save area
for a new thread, since fork semantic requires the copy of the
previous state. This advice seemingly contradicts to the advice
from the item 6.
3. We do use XSAVEOPT in the context switch code only, and the area
for XSAVEOPT already always contains the data saved by XSAVE.
4. We do not modify the save area between XRSTOR, when the area is
loaded into FPU context, and XSAVE. We always spit the fpu context
into save area and start emulation when directly writing into FPU
context.
5. We do not use segmented addressing to access save area, or rather,
always address it using %ds basing.
6. XSAVEOPT can be only executed in the area which was previously
loaded with XRSTOR, since context switch code checks for FPU use by
outgoing thread before saving, and thread which stopped emulation
forcibly get context loaded with XRSTOR.
7. The PCB cannot be paged out while FPU emulation is turned off, since
stack of the executing thread is never swapped out.
The context switch code is patched to issue XSAVEOPT instead of XSAVE
if supported. This approach eliminates one conditional in the context
switch code, which would be needed otherwise.
For user-visible machine context to have proper data, fpugetregs()
checks for unsaved extension blocks and manually copies pristine FPU
state into them, according to the description provided by CPUID leaf
0xd.
MFC after: 1 month
on x86 and use that to implement stop_emulating() in the fpu/npx code.
Reimplement start_emulating() in the non-XEN case by using load_cr0() and
rcr0() instead of the 'lmsw' and 'smsw' instructions. Intel explicitly
discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in
the description of these instructions in Volume 2 of the ADM.
Reviewed by: kib
MFC after: 1 month
- Add generic support for opcodes that are escape bytes used for
multi-byte opcodes (such as the 0x0f prefix). Use this to replace
the hard-coded 0x0f special case and add support for three-byte
opcodes that use the 0x0f38 prefix.
- Decode all Intel VMX instructions. invept and invvpid in particular are
three-byte opcodes that use the 0x0f38 escape prefix.
- Rework how the special 'SDEP' size flag works such that the default
instruction name (i_name) is the instruction when the data size
prefix (0x66) is not specified, and the alternate name in i_extra is
used when the prefix is included.
- Add a new 'ADEP' size flag similar to 'SDEP' except that it chooses
between i_name and i_extra based on the address size prefix (0x67).
Use this to fix the decoding for jrcxz vs jecxz which is determined
by the address size prefix, not the operand size prefix. Also, jcxz
is not possible in 64-bit mode, but jrcxz is the default instruction
for that opcode.
- Add support for handling instructions that have a mandatory 'rep'
prefix (this means not outputting the 'repe ' prefix until determining
if it is used as part of an opcode). Make 'pause' less of a special
case this way.
- Decode 'cmpxchg16b' and 'cdqe' which are variants of other instructions
but with a REX.W prefix.
MFC after: 1 month
functions that manage PV entries. Specifically, remove the PV entry from
the containing PV list only after the corresponding PTE is destroyed.
Update the pmap's wired mapping count in pmap_enter() before the PV list
lock is acquired.
natively rather than hand-assembled versions. For xgetbv/xsetbv, add a
wrapper API to deal with xcr* registers: rxcr() and load_xcr().
Reviewed by: kib
MFC after: 1 month
at the point that it calls get_pv_entry(). Thus, pmap_enter()'s PV list
lock pointer must be passed to get_pv_entry() for those rare occasions
when get_pv_entry() calls reclaim_pv_chunk().
Update some related comments.
pmap_remove(). The execution of these functions is no longer serialized
by the pvh global lock.
Make some stylistic changes to the affected code for the sake of
consistency with related code elsewhere in the pmap.
to add PV list locking to pmap_pv_demote_pde(), it is necessary to change
the way that pmap_pv_demote_pde() allocates PV entries. Specifically,
once pmap_pv_demote_pde() begins modifying the PV lists, it can't allocate
any new PV chunks, because that could require the PV list lock to be
dropped. So, all necessary PV chunks must be allocated in advance. To my
surprise, this new approach is a few percent faster than the old one.
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile
are provided.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
no longer necessary for free_pv_entry() to be serialized by the pvh global
lock.
Retire pmap_insert_entry() and pmap_remove_entry(). Once upon a time,
these functions were called from multiple places within the pmap. Now,
each has only one caller.
pmap_enter_quick(). These functions are no longer serialized by the pvh
global lock.
There is no need to release the PV list lock before calling free_pv_chunk()
in pmap_remove_pages().
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
performing the return to usermode using full return path. This
consolidates the handling of exceptional situations in less number of
places, and is less code as well.
Reviewed by: jhb
MFC after: 1 week
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.
Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.
As an added bonus, tidy up some nearby comments concerning page flags.
Reviewed by: kib
MFC after: 6 weeks
- Remove cpuset stopped_cpus which is no longer used.
- Add a short comment for cpuset suspended_cpus clearing.
- Fix the un-ordered x86/acpica/acpi_wakeup.c in conf/files.amd64 and i386.
Pointed-out by: attilio@
suspend/resume procedures are minimized among them.
common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().
amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().
i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
amd64 code.
Reviewed by: attilio@, acpi@
Constify pc_freemask[].
pmap_pv_reclaim()
Eliminate "freemask" because it was a pessimization. Add a comment about
the resident count adjustment.
free_pv_entry() [i386 only]
Merge an optimization from amd64 (r233954).
get_pv_entry()
Eliminate the move to tail of the pv_chunk on the global pv_chunks list.
(The right strategy needs more thought. Moreover, there were unintended
differences between the amd64 and i386 implementation.)
pmap_remove_pages()
Eliminate unnecessary ()'s.
locked xchg instruction. IA32 memory model guarantees that store has
release semantic, since stores cannot pass loads or stores.
Reviewed by: bde, jhb
Tested by: pho
MFC after: 2 weeks
(described in ACPICA source code).
- Move intr_disable() and intr_restore() from acpi_wakeup.c to acpi.c
and call AcpiLeaveSleepStatePrep() in interrupt disabled context.
- Add acpi_wakeup_machdep() to execute wakeup MD procedures and call
it twice in interrupt disabled/enabled context (ia64 version is
just dummy).
- Rename wakeup_cpus variable in acpi_sleep_machdep() to suspcpus in
order to be shared by acpi_sleep_machdep() and acpi_wakeup_machdep().
- Move identity mapping related code to acpi_install_wakeup_handler()
(i386 version) for preparation of x86/acpica/acpi_wakeup.c
(MFC candidate).
Reviewed by: jkim@
MFC after: 2 days
in_cksum.h required ip.h to be included for struct ip. To be
able to use some general checksum functions like in_addword()
in a non-IPv4 context, limit the (also exported to user space)
IPv4 specific functions to the times, when the ip.h header is
present and IPVERSION is defined (to 4).
We should consider more general checksum (updating) functions
to also allow easier incremental checksum updates in the L3/4
stack and firewalls, as well as ponder further requirements by
certain NIC drivers needing slightly different pseudo values
in offloading cases. Thinking in terms of a better "library".
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
to this pmap.c. This new r/w lock is used primarily to synchronize access
to the PV lists. However, it will be used in a somewhat unconventional
way. As finer-grained PV list locking is added to each of the pmap
functions that acquire this r/w lock, its acquisition will be changed from
write to read, enabling concurrent execution of the pmap functions with
finer-grained locking.
Reviewed by: kib
X-MFC after: r235598
longer uses the active and inactive paging queues. Instead, the pmap now
maintains an LRU-ordered list of pv entry pages, and pmap_pv_reclaim() uses
this list to select pv entries for reclamation.
Note: The old pmap_collect() tried to avoid reclaiming mappings for pages
that have either a hold_count or a busy field that is non-zero. However,
this isn't necessary for correctness, and the locking in pmap_collect() was
insufficient to guarantee that such mappings weren't reclaimed. The new
pmap_pv_reclaim() doesn't even try.
Reviewed by: kib
MFC after: 6 weeks
ataraid(4) previously was present there and having GEOM RAID is convinient.
Unlike other classes GEOM RAID can be set up from BIOS before install and
users are expecting it to be detected automatically.
- DTrace scripts to check for errors, performance, ...
they serve mostly as examples of what you can do with the static probe;s
with moderate load the scripts may be overwhelmed, excessive lock-tracing
may influence program behavior (see the last design decission)
Design decissions:
- use "linuxulator" as the provider for the native bitsize; add the
bitsize for the non-native emulation (e.g. "linuxuator32" on amd64)
- Add probes only for locks which are acquired in one function and released
in another function. Locks which are aquired and released in the same
function should be easy to pair in the code, inter-function
locking is more easy to verify in DTrace.
- Probes for locks should be fired after locking and before releasing to
prevent races (to provide data/function stability in DTrace, see the
man-page of "dtrace -v ..." and the corresponding DTrace docs).
intr_bind() on x86.
This has been requested by jhb and I strongly disagree with this,
but as long as he is the x86 and interrupt subsystem maintainer I will
follow his directives.
The disagreement cames from what we should really consider as a
public KPI. IMHO, if we really need a selection between the kernel
functions, we may need an explicit protection like _KERNEL_KPI, which
defines which subset of the kernel function might really be considered
as part of the KPI (for thirdy part modules) and which not.
As long as we don't have this mechanism I just consider any possible
function as usable by thirdy part code, thus intr_bind() included.
MFC after: 1 week
Includes instruction emulation for memory r/w access. This
opens the door for io-apic, local apic, hpet timer, and
legacy device emulation.
Submitted by: ryan dot berryhill at sandvine dot com
Reviewed by: grehan
Obtained from: Sandvine
discrepancy between modules and kernel, but deal with SMP differences
within the functions themselves.
As an added bonus this also helps in terms of code readability.
Requested by: gibbs
Reviewed by: jhb, marius
MFC after: 1 week
but GNU libc used it without checking its kernel version, e. g., Fedora 10.
- Move pipe(2) implementation for Linuxulator from MD files to MI file,
sys/compat/linux/linux_file.c. There is no MD code for this syscall at all.
- Correct an argument type for pipe() from l_ulong * to l_int *. Probably
this was the source of MI/MD confusion.
Reviewed by: emulation
- Mark 'sdp' as requiring 'inet'.
- Always include "opt_inet.h" and "opt_inet6.h" and modify the IB
driver Makefiles to honor WITH/WITHOUT_INET/INET6/_SUPPORT options
to determine what should be enabled during a module build.
- Fix the mlxen(4) driver and the core IB code to compile without
if INET is disabled (including when both INET and INET6 are disabled).
Reviewed by: bz
MFC after: 2 weeks
in set_apic_interrupt_ids(). Besides, set_apic_interrupts_ids() is not
called in the !SMP case too.
Fix this by:
- Adding the BSP as an interrupt target directly in cpu_startup().
- Remove an obsolete optimization where the BSP are skipped in
set_apic_interrupt_ids().
Reported by: jh
Reviewed by: jhb
MFC after: 3 days
X-MFC: r233961
Pointy hat to: me
an uncorrected ECC error tends to fire on all CPUs in a package
simultaneously and the current printf hacks are not sufficient to make
the messages legible. Instead, use the existing mca_lock spinlock to
serialize calls to mca_log() and change the machine check code to panic
directly when an unrecoverable error is encoutered rather than falling
back to a trap_fatal() call in trap() (which adds nearly a screen-full of
logging messages that aren't useful for machine checks).
MFC after: 2 weeks
"Under a highly specific and detailed set of internal timing conditions,
the processor may incorrectly update the stack pointer after a long series
of push and/or near-call instructions, or a long series of pop and/or
near-return instructions. The processor must be in 64-bit mode for this
erratum to occur."
MFC after: 3 days
bridges. Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.
To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0. For ACPI, use the _ADR
method to find the slot and function of the device. For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices. Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.
This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.
Tested by: bapt
MFC after: 1 week
be less ambiguous and more clearly identify what it means. This
attribute is what Intel refers to as UC-, and it's only difference
relative to normal UC memory is that a WC MTRR will override a UC-
PAT entry causing the memory to be treated as WC, whereas a UC PAT
entry will always override the MTRR.
- Remove the VM_MEMATTR_UNCACHED alias from powerpc.
New kernel events can be added at various location for sampling or counting.
This will for example allow easy system profiling whatever the processor is
with known tools like pmcstat(8).
Simultaneous usage of software PMC and hardware PMC is possible, for example
looking at the lock acquire failure, page fault while sampling on
instructions.
Sponsored by: NETASQ
MFC after: 1 month
<20120222095239.Horde.0hpYHJjmRSRPRKzXsoFRbYk@webmail.leidinger.net>.
According to some private emails received, it apparently is not unpopular
to use at least Quad GigaSwift cards driven by cas(4) in x86 machines.
MFC after: 1 week
The GPL infected parts which were blocking the inclusion of snd_csa
and snd_emu10kx in GENERIC have recently been removed from the tree.
I'm also adding snd_cmi to GENERIC, which I originally intended to
add when we enabled sound support by default.
Discussed with: jhb, pfg, Yuriy Tsibizov <yuriy.tsibizov@gfk.ru>
Approved by: jhb
kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
As of FreeBSD 8, this driver should not be used. Applications that use
posix_openpt(2) and openpty(3) use the pts(4) that is built into the
kernel unconditionally. If it turns out high profile depend on the
pty(4) module anyway, I'd rather get those fixed. So please report any
issues to me.
The pty(4) module is still available as a kernel module of course, so a
simple `kldload pty' can be used to run old-style pseudo-terminals.
longer serve any purpose. Prior to r157446, they served a purpose
because there was a fixed amount of kernel virtual address space
reserved for pv entries at boot time. However, since that change pv
entries are accessed through the direct map, and so there is no limit
imposed by a fixed amount of kernel virtual address space.
Fix a couple of nearby style issues.
Reviewed by: jhb, kib
MFC after: 1 week
AcpiEnterSleepState() executes (optional) _GTS method since ACPICA 20120215
(r231844). To evaluate the method, we need malloc(9), which may sleep.
Reported by: bschmidt
MFC after: 3 days
segments.h to a new x86 segments.h.
Add __packed attribute to some structs (just to be sure).
Also make it clear that i386 GDT and LDT entries are used in ia64 code.
excluded from superpage promotions. At least one of the reason is
that pv_table is sized for non-fictitious pages only.
Consistently check for the page to be non-fictitious before accesing
superpage pv list.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 2 weeks
reg.h with stubs.
The tREGISTER macros are only made visible on i386. These macros are
deprecated and should not be available on amd64.
The i386 and amd64 versions of struct reg have been renamed to struct
__reg32 and struct __reg64. During compilation either __reg32 or __reg64
is defined as reg depending on the machine architecture. On amd64 the i386
struct is also available as struct reg32 which is used in COMPAT_FREEBSD32
code.
Most of compat/ia32/ia32_reg.h is now IA64 only.
Reviewed by: kib (previous version)
field are synchronized, there is no need for pmap_protect() to acquire
the page queues lock unless it is going to access the pv lists.
Reviewed by: kib
Remove FPU types from compat/ia32/ia32_reg.h that are no longer needed.
Create machine/npx.h on amd64 to allow compiling i386 code that uses
this header.
The original npx.h and fpu.h define struct envxmm differently. Both
definitions have been included in the new x86 header as struct __envxmm32
and struct __envxmm64. During compilation either __envxmm32 or __envxmm64
is defined as envxmm depending on machine architecture. On amd64 the i386
struct is also available as struct envxmm32.
Reviewed by: kib