value of the tag or data field.
Add macros for getting the page shift, size and mask for the physical page
that a tte maps (which may be one of several sizes).
Use the new cache functions for invalidating single pages.
a floating point instruction into a 6-bit register number for
double and quad arguments.
Make use of the new INSFPdq_RN macro where apporpriate; this
is required for correctly handling the "high" fp registers
(>= %f32).
Fix a number of bugs related to the handling of the high registers
which were caused by using __fpu_[gs]etreg() where __fpu_[gs]etreg64()
should be used (the former can only access the low, single-precision,
registers).
Submitted by: tmm
Rearrange things slightly so that the contents of the tag access
register are read and restored outside of the macros. The intention
is to pass the page size to look up as an argument to the macros.
i386/ia64/alpha - catch up to sparc64/ppc:
- replace pmap_kernel() with refs to kernel_pmap
- change kernel_pmap pointer to (&kernel_pmap_store)
(this is a speedup since ld can set these at compile/link time)
all platforms (as suggested by jake):
- gc unused pmap_reference
- gc unused pmap_destroy
- gc unused struct pmap.pm_count
(we never used pm_count - we track address space sharing at the vmspace)
the symbol index defined by the relocation. The elf_lookup() support
function is to be used by elf_reloc() when symbol lookups need to be
done. The elf_lookup() function operates on the symbol index and
will do a symbol name based lookup when such is required, otherwise
it uses the symbol index directly. This solves the problem seen on
ia64 where the symbol hash table does not contain local symbols and
a symbol name based lookup would fail for those symbols.
Don't pass the symbol name to elf_reloc(), as it isn't used any more.
user stack in response to a failed window fill, allowing the process to be
killed if its wrong. This caused user programs which misalign their stack
pointer to get stuck in an infinite loop at the kernel-userland boundary,
which is mostly harmless.
The same thing causes a fatal RED state exception on OpenBSD and probably
NetBSD.
Inspired by: art@openbsd.org
and pmap_copy_page(). This gets rid of a couple more physical addresses
in upper layers, with the eventual aim of supporting PAE and dealing with
the physical addressing mostly within pmap. (We will need either 64 bit
physical addresses or page indexes, possibly both depending on the
circumstances. Leaving this to pmap itself gives more flexibilitly.)
Reviewed by: jake
Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)
_BYTE_ORDER. These are far more useful than their non-underscored
equivalents as these can be used in restricted namespace environments.
Mark the non-underscored variants as deprecated.
and add some compatibility defines. Add fields for ins and locals to
struct reg also for the same reason; these aren't filled in yet because
getting at those registers sucks and I'd rather not save them in the
trapframe just for this. Reorder struct reg to be ABI compatible as
well. Add needed include of machine/emul.h.
This gets pmdb (poor man's debugger) from OpenBSD mostly compiling but it
doesn't work yet :(
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
they aren't in the usual path of execution for syscalls and traps.
The main complication for this is that we have to set flags to control
ast() everywhere that changes the signal mask.
Avoid locking in userret() in most of the remaining cases.
Submitted by: luoqi (first part only, long ago, reorganized by me)
Reminded by: dillon
various machdep.c's to being declared in kern_mutex.c.
- Add a new function mutex_init() used to perform early initialization
needed for mutexes such as setting up thread0's contested lock list
and initializing MI mutexes. Change the various MD startup routines
to call this function instead of duplicating all the code themselves.
Tested on: alpha, i386
hold the kernel text, data and loader metadata by not using a fixed slot
to store the TSB page(s) into. Enter fake 8k page entries into the kernel
TSB that cover the 4M kernel page(s), sot that pmap_kenter() will work
without having to treat these pages as a special case.
Problem reported by: mjacob, obrien
Problem spotted and 4M page handling proposed by: jake
and cpu_critical_exit() and moves associated critical prototypes into their
own header file, <arch>/<arch>/critical.h, which is only included by the
three MI source files that need it.
Backout and re-apply improperly comitted syntactical cleanups made to files
that were still under active development. Backout improperly comitted program
structure changes that moved localized declarations to the top of two
procedures. Partially re-apply one of the program structure changes to
move 'mask' into an intermediate block rather then in three separate
sub-blocks to make the code more readable. Re-integrate bug fixes that Jake
made to the sparc64 code.
Note: In general, developers should not gratuitously move declarations out
of sub-blocks. They are where they are for reasons of structure, grouping,
readability, compiler-localizability, and to avoid developer-introduced bugs
similar to several found in recent years in the VFS and VM code.
Reviewed by: jake
code can use it. This takes a single constant argument and fails to compile
if it is 0 (false). The main application of this is to make assertions about
structure sizes at compile time, in order to validate assumptions made in
other code. Examples:
CTASSERT(sizeof(struct foo) == FOO_SIZEOF);
CTASSERT(sizeof(struct foo) == (1 << FOO_SHIFT));
Requested by: jhb, phk
dump the trace buffer feasible.
- Remove KTR_EXTEND. This changes the format of the trace entries when
activated, making writing a userland tool which is not tied to a specific
kernel configuration difficult.
- Use get_cyclecount() for timestamps. nanotime() is much too heavy weight
and requires recursion protection due to ktr traces occuring as a result
of ktr traces. KTR_VERBOSE may still require recursion protection, which
is now conditional on it.
- Allow KTR_CPU to be overridden by MD code. This is so that it is possible
to trace early in startup before pcpu and/or curthread are setup.
- Add a version number for the ktr interface. A userland tool can check this
to detect mismatches.
- Use an array for the parameters to make decoding in userland easier.
- Add file and line recording to the non-extended traces now that the extended
version is no more.
These changes will break gdb macros to decode the extended version of the
trace buffer which are floating around. Users of these macros should either
use the show ktr command in ddb, or use the userland utility which can be run
on a core dump.
Approved by: jhb
Tested on: i386, sparc64
back into the calling MD code. The MD code must ensure no races between
checking the astpening flag and returning to usermode.
Submitted by: peter (ia64 bits)
Tested on: alpha (peter, jeff), i386, ia64 (peter), sparc64
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.
Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.
Approved by: jhb
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).
Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.
This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.
Reviewed by: core
Approved by: core
- change the IOMMU support code so that it supports overcommittting the
available DVMA memory, while still allocating as lazily as possible.
This is achieved by limiting the preallocation, and deferring the
allocation to map load time when it fails. In the latter case, the
DVMA memory reserved for unloaded maps can be stolen to free up enough
memory for loading a map.
- allow NULL settings in the method tables, and search the parent tags
until an appropriate implementation is found. This allows to remove some
kluges in the old implementation.
the bus-dependent code and to be able to support more systems. The core
of the new code is mostly obtained from NetBSD.
Kluge the interrupt routing methods of the psycho and apb drivers so
that an intline of 0 can be handled for now; real routing is still not
possible (all intline registers are preinitialized instead); this will
require a sparc64-specific adaption of the driver for generic PCI-PCI
bridges with a custom routing method to work right.
not blocked by raising the pil, a reciever may be interrupted while holding
a spinlock. If the sender does not defer interrupts throughout the entire
operation it may be interrupted and try to acquire a spinlock held by a
reciever, leading to a deadlock due to the synchronization used by the
ipi handlers themselves.
Submitted by: tmm
Instead of caching the ucred reference, just go ahead and eat the
decerement and increment of the refcount. Now that Giant is pushed down
into crfree(), we no longer have to get Giant in the common case. In the
case when we are actually free'ing the ucred, we would normally free it on
the next kernel entry, so the cost there is not new, just in a different
place. This also removse td_cache_ucred from struct thread. This is
still only done #ifdef DIAGNOSTIC.
Tested on: i386, alpha
the user mappings from the tlb due to the context numbers rolling over. The
store to the internal mmu register must be followed by a membar #Sync before
much else happens to "avoid data corruption", so we use special inlines which
both disable interrupts and ensure that the compiler will not insert extra
instructions between the two. Also, load the tte tag and check if the context
is nucleus context, rather than relying on the priviledged bit which doesn't
actually serve any purpose in our design, and check the lock bit too for
sanity.
wait for those cpus, instead of all of them by using a count. Oops.
Make the pointer to the mask that the primary cpu spins on volatile, so
gcc doesn't optimize out an important load. Oops again.
Activate tlb shootdown ipi synchronization now that it works. We have
all involved cpus wait until all the others are done. This may not be
necessary, it is mostly for sanity.
Make the trigger level interrupt ipi handler work.
Submitted by: tmm
the number of physical pages per KVA page allocated scales properly with
memory size. This fixes problems with kmem_map being too small.
Noticed by: mike, wollman
Submitted by: tmm
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.
Submitted by: bde
o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.
Reviewed by: bde
than the other implementations; we have complete control over the tlb, so we
only demap specific pages. We take advantage of the ranged tlb flush api
to send one ipi for a range of pages, and due to the pm_active optimization
we rarely send ipis for demaps from user pmaps.
Remove now unused routines to load the tlb; this is only done once outside
of the tlb fault handlers.
Minor cleanups to the smp startup code.
This boots multi user with both cpus active on a dual ultra 60 and on a
dual ultra 2.
Due to allocating tlb contexts on the fly, we only ever need to demap the
primary context, non-primary contexts have already been implicitly flushed
by context switching. All we really need to tell is if its a kernel demap
or not, and its easier just to compare against the kernel_pmap which is a
constant.
the context is not actually stolen, as it would be for i386. Instead of
deactivating a user vmspace immediately when switching out, and recycling
its tlb context, wait until the next context switch to a different user
vmspace. In this way we can switch from a user process to any number of
kernel threads and back to the same user process again, without losing any
of its mappings in the tlb that would not already be knocked by the automatic
replacement algorithm. This is not expected to have a measurable performance
improvement on the machines we currently run on, but it sounds cool and makes
the sparc64 port SMPng buzz word compliant.
on the loader to do it. Improve smp startup code to be less racy and to
defer certain things until the right time. This almost boots single user
on my dual ultra 60, it is still very fragile:
SMP: AP CPU #1 Launched!
Enter full pathname of shell or RETURN for /bin/sh:
# ls
Debugger("trapsig")
Stopped at Debugger+0x1c: ta %xcc, 1
db> heh
No such command
db>
with pmaps. When the context numbers wrap around we flush all user mappings
from the tlb. This makes use of the array indexed by cpuid to allow a pmap
to have a different context number on a different cpu. If the context numbers
are then divided evenly among cpus such that none are shared, we can avoid
sending tlb shootdown ipis in an smp system for non-shared pmaps. This also
removes a limit of 8192 processes (pmaps) that could be active at any given
time due to running out of tlb contexts.
Inspired by: the brown book
Crucial bugfix from: tmm
clobbered by the child. This is more complicated than usual because the
window that could get clobbered is pushed in kernel mode, so a lot of
registers would have to be saved in other registers in userland and we
don't have enough. What we do have is space in the pcb to temporarily
store user windows that were spilled in kernel mode, but could not be
immediately stored to the user stack. So we copy in the parent's topmost
window and save it in the pcb, and arrange for it to be copied back out
when the child is done frobbing the stack.
Reviewed by: tmm
Previously, the UPAGES/KSTACK area of processes/threads would leak memory
at the time that a previously swapped process was terminated. Lukcily, the
leak was only 12K/proc, so it was unlikely to be a major problem unless you
had an undersized swap partition.
Submitted by: dillon
Reviewed by: silby
MFC after: 1 week
In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes. Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations. The algorithm has been changed so that
each page will be checked only 16 times.
Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period. With this change
in place, the vm_daemon completes in less than a second. Any system
with hundreds of processes sharing pages should benefit from this change.
Note that the vm_daemon is only run when the system is under extreme
memory pressure. It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.
Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.
PR: 33542, 20393
Reviewed by: dillon
MFC after: 1 week
device drivers for bus system with other endinesses than the CPU (using
interfaces compatible to NetBSD):
- bwap16() and bswap32(). These have optimized implementations on some
architectures; for those that don't, there exist generic implementations.
- macros to convert from a certain byte order to host byte order and vice
versa, using a naming scheme like le16toh(), htole16().
These are implemented using the bswap functions.
- stream bus space access functions, which do not perform a byte order
conversion (while the normal access functions would if the bus endianess
differs from the CPU endianess).
htons(), htonl(), ntohs() and ntohl() are implemented using the new
functions above for kernel usage. None of the above interfaces is currently
exported to user land.
Make use of the new functions in a few places where local implementations
of the same functionality existed.
Reviewed by: mike, bde
Tested on alpha by: mike
work loads. It tapers off after that as gcc's working set generally just fits.
compiling bin/csh:
TSB_PAGES = 2
213.33 real 77.59 user 110.01 sys
TSB_PAGES = 4
116.43 real 75.78 user 19.16 sys
TSB_PAGES = 8
119.27 real 76.38 user 18.12 sys
Testing by: tmm
due to them being faster in certain cases. Therefore we need to save
and restore the v8 %y register around traps in kernel mode as well as
traps in usermode.
Tested by: obrien, tmm
until we do some testing to see what's best. This gives a massive reduction
in system time for processes with a relatively large working set. The size
of the tsb directly affects the rss size that a user process can keep mapped.
When it starts to get full replacements occur and the process takes a lot of
soft vm faults. Increasing the default from 1 page to 2 gives the following
before and after numbers for compiling vfs_bio.c:
before:
14.27 real 6.56 user 5.69 sys
after:
8.57 real 6.11 user 1.62 sys
This should make self hosted builds more tolerable.
I don't believe anyone is quite using the sparc64 kernel sources in CVS
yet -- things aren't just quite ready (but almost). So this commit should
be OK to make.
window to the user stack while in a nested kernel trap. We do this for
entry to the kernel from user mode, but if we get an interrupt in kernel
mode while there are still user windows in the cpu, and we attempt to spill
to the user stack, we may take too many nested traps and overflow the trap
stack, causing a red state exception. This is needed by upcoming changes
to allow the user tsb to not be locked in the tlb.
Reviewed by: tmm
virtual page number in a much more convenient way; all in one piece. This
greatly simplifies the comparison for a matching tte, and allows the fault
handlers to be much simpler due to not having to load wierd masks.
Rewrite the tlb fault handlers to account for the new format. These are also
written to allow faults on the user tsb inside of the fault handlers; the
kernel fault handler must be aware of this and not clobber the other's
registers. The faults do not yet occur due to other support that is needed
(and still under my desk).
Bug fixes from: tmm
Most of the contents are commented out as they are as-yet untested.
However, I wanted the contents to match our other arches, so that when
people make changes to {i386,alpha,ia64}, they will also make the same
changes here.
pmap_qenter and pmap_qremove in preference to pmap_kenter/pmap_kremove.
The former maps in multiple pages at a time, and so can do a ranged
flush. Don't assume that pmap_kenter and pmap_kremove will flush the tlb,
even though they still do. It will not once the MI code is updated to use
pmap_qenter and pmap_qremove.
will be used to reduce the number of tlb shootdown ipis in an smp system
by sending one ipi for a whole range of pages, instead of one per page.
Munge the context demap operations slightly to support demapping a non-primary
context.
If we don't do this here there's a 1 instruction race where an interrupt
could come in and crash the user process due to having no stack.
2. Pass %fsr to the user trap handler in %l4. Since %fsr can only be loaded
from or stored to memory, we need to do some contortions and temporarily
save it to the alternate global stack.
3. Reload the pcb and pcpu registers for traps in kernel mode, for sanity.
Submitted by: tmm (1, 2)
While in userland, keep the thread's ucred reference in a shadow
field so that the usual place to store it is NULL.
If DIAGNOSTIC is not set, the thread ucred is kept valid until the next
kernel entry, at which time it is checked against the process cred
and possibly corrected. Produces a BIG speedup in
kernels with INVARIANTS set. (A previous commit corrected it
for the non INVARIANTS case already)
Reviewed by: dillon@freebsd.org
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.
Tested on: alpha, i386
Reviewed by: bde, jake, tmm
patch from a year ago: give file flags their own type. This does not
(yet) change the type used by system calls or library functions.
The underlying type was chosen to match what is returned by stat().
support for managing both streaming caches on psycho pairs).
Use explicit bus space accesses instead of mapping the device memory into
kva.
Move DVMA allocation to the map creation/dma memory allocation functions.
disable interrupts completely, and stxa_sync(), which performs a store
immediately followed by a membar #Sync with interrupts disabled (this
is needed for writes to diagnostic registers).
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
some arches and the syscall table is machine-independent. It was
(bogusly) conditional on COMPAT_43, so this usually makes no difference.
ia64: in addition:
- replace the bogus cloned comment before osigreturn() by a correct one.
osigreturn() is just a stub fo ia64's.
- fix the formatting of cloned comment before sigreturn().
- fix the return code. use nosys() instead of returning ENOSYS to get
the same semantics as if the syscall is not in the syscall table.
Generating SIGSYS is actually correct here.
- fix style bugs.
powerpc: copy the cleaned up ia64 stub. This mainly fixes a bogus comment.
sparc64: copy the cleaned up the ia64 stub, since there was no stub before.
cpu(s) into the kernel, and sync-ing them up to "kernel" mode so we can
send them ipis, which also work.
Thanks to John Baldwin for providing me with access to the hardware
that made this possible.
Parts obtained from: bsd/os
Call critical_enter/critical_exit around (fast) interrupt handlers. All
non-threaded interrupts are fast, and the threaded interrupt scheduler is
itself a fast interrupt.
Assert that an interrupt handler we are about to call is non-zero.
Be paranoid about restoring the users global registers. Do it as the
last thing before switching to alternate globals (when we magically get
our preloaded registers back), and do it with interrupts disabled. Any
kind of kernel trap when the globals are not setup properly is bad news.
Don't save and restore the kernel g6, it invariably points to the current
pcb now.
data word in an interrupt packet is non-zero, it points to code to execute
to handle the ipi, so jump to it instead of enqueueing the packet. It
is unclear if we will need queued ipis.
Interrupt g7 now points to pcpu, instead of to the per-cpu interrupt queue
itself, so use that instead. Interrupt g6 is no longer reserved.