the user mappings from the tlb due to the context numbers rolling over. The
store to the internal mmu register must be followed by a membar #Sync before
much else happens to "avoid data corruption", so we use special inlines which
both disable interrupts and ensure that the compiler will not insert extra
instructions between the two. Also, load the tte tag and check if the context
is nucleus context, rather than relying on the priviledged bit which doesn't
actually serve any purpose in our design, and check the lock bit too for
sanity.
wait for those cpus, instead of all of them by using a count. Oops.
Make the pointer to the mask that the primary cpu spins on volatile, so
gcc doesn't optimize out an important load. Oops again.
Activate tlb shootdown ipi synchronization now that it works. We have
all involved cpus wait until all the others are done. This may not be
necessary, it is mostly for sanity.
Make the trigger level interrupt ipi handler work.
Submitted by: tmm
the number of physical pages per KVA page allocated scales properly with
memory size. This fixes problems with kmem_map being too small.
Noticed by: mike, wollman
Submitted by: tmm
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.
Submitted by: bde
o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.
Reviewed by: bde
than the other implementations; we have complete control over the tlb, so we
only demap specific pages. We take advantage of the ranged tlb flush api
to send one ipi for a range of pages, and due to the pm_active optimization
we rarely send ipis for demaps from user pmaps.
Remove now unused routines to load the tlb; this is only done once outside
of the tlb fault handlers.
Minor cleanups to the smp startup code.
This boots multi user with both cpus active on a dual ultra 60 and on a
dual ultra 2.
Due to allocating tlb contexts on the fly, we only ever need to demap the
primary context, non-primary contexts have already been implicitly flushed
by context switching. All we really need to tell is if its a kernel demap
or not, and its easier just to compare against the kernel_pmap which is a
constant.
the context is not actually stolen, as it would be for i386. Instead of
deactivating a user vmspace immediately when switching out, and recycling
its tlb context, wait until the next context switch to a different user
vmspace. In this way we can switch from a user process to any number of
kernel threads and back to the same user process again, without losing any
of its mappings in the tlb that would not already be knocked by the automatic
replacement algorithm. This is not expected to have a measurable performance
improvement on the machines we currently run on, but it sounds cool and makes
the sparc64 port SMPng buzz word compliant.
on the loader to do it. Improve smp startup code to be less racy and to
defer certain things until the right time. This almost boots single user
on my dual ultra 60, it is still very fragile:
SMP: AP CPU #1 Launched!
Enter full pathname of shell or RETURN for /bin/sh:
# ls
Debugger("trapsig")
Stopped at Debugger+0x1c: ta %xcc, 1
db> heh
No such command
db>
with pmaps. When the context numbers wrap around we flush all user mappings
from the tlb. This makes use of the array indexed by cpuid to allow a pmap
to have a different context number on a different cpu. If the context numbers
are then divided evenly among cpus such that none are shared, we can avoid
sending tlb shootdown ipis in an smp system for non-shared pmaps. This also
removes a limit of 8192 processes (pmaps) that could be active at any given
time due to running out of tlb contexts.
Inspired by: the brown book
Crucial bugfix from: tmm
clobbered by the child. This is more complicated than usual because the
window that could get clobbered is pushed in kernel mode, so a lot of
registers would have to be saved in other registers in userland and we
don't have enough. What we do have is space in the pcb to temporarily
store user windows that were spilled in kernel mode, but could not be
immediately stored to the user stack. So we copy in the parent's topmost
window and save it in the pcb, and arrange for it to be copied back out
when the child is done frobbing the stack.
Reviewed by: tmm
Previously, the UPAGES/KSTACK area of processes/threads would leak memory
at the time that a previously swapped process was terminated. Lukcily, the
leak was only 12K/proc, so it was unlikely to be a major problem unless you
had an undersized swap partition.
Submitted by: dillon
Reviewed by: silby
MFC after: 1 week
In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes. Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations. The algorithm has been changed so that
each page will be checked only 16 times.
Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period. With this change
in place, the vm_daemon completes in less than a second. Any system
with hundreds of processes sharing pages should benefit from this change.
Note that the vm_daemon is only run when the system is under extreme
memory pressure. It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.
Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.
PR: 33542, 20393
Reviewed by: dillon
MFC after: 1 week
device drivers for bus system with other endinesses than the CPU (using
interfaces compatible to NetBSD):
- bwap16() and bswap32(). These have optimized implementations on some
architectures; for those that don't, there exist generic implementations.
- macros to convert from a certain byte order to host byte order and vice
versa, using a naming scheme like le16toh(), htole16().
These are implemented using the bswap functions.
- stream bus space access functions, which do not perform a byte order
conversion (while the normal access functions would if the bus endianess
differs from the CPU endianess).
htons(), htonl(), ntohs() and ntohl() are implemented using the new
functions above for kernel usage. None of the above interfaces is currently
exported to user land.
Make use of the new functions in a few places where local implementations
of the same functionality existed.
Reviewed by: mike, bde
Tested on alpha by: mike
work loads. It tapers off after that as gcc's working set generally just fits.
compiling bin/csh:
TSB_PAGES = 2
213.33 real 77.59 user 110.01 sys
TSB_PAGES = 4
116.43 real 75.78 user 19.16 sys
TSB_PAGES = 8
119.27 real 76.38 user 18.12 sys
Testing by: tmm
due to them being faster in certain cases. Therefore we need to save
and restore the v8 %y register around traps in kernel mode as well as
traps in usermode.
Tested by: obrien, tmm
until we do some testing to see what's best. This gives a massive reduction
in system time for processes with a relatively large working set. The size
of the tsb directly affects the rss size that a user process can keep mapped.
When it starts to get full replacements occur and the process takes a lot of
soft vm faults. Increasing the default from 1 page to 2 gives the following
before and after numbers for compiling vfs_bio.c:
before:
14.27 real 6.56 user 5.69 sys
after:
8.57 real 6.11 user 1.62 sys
This should make self hosted builds more tolerable.