This avoids that mem.c has to include ofw_machdep.h, including
all OFW related headers.
o Provide a stub for OF_decode_addr(), which is used by low-level
console drivers to obtain a tag and handle given a OFW phandle.
This is different from sparc64, where a fake bus tag needs to be
created explicitly.
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.
Reviewed by: alc,jhb,peter,ps
Rename struct thread's td_sticks to td_pticks, we will need the
other name for more appropriately named use shortly. Reduce it
from uint64_t to u_int.
Clear td_pticks whenever we enter the kernel instead of recording
its value as reference for userret(). Use the absolute value of
td->pticks in userret() and eliminate third argument.
to old-style signals, to be the DAR register for DSI miss exceptions.
This gives the address of the access rather than the instruction
address. The behaviour is now the same as on i386.
Found by: libsigsegv tests
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
handling code so the stack trace unwinders don't start trying to
go into user-space.
Found by trying to create core dumps with a KTR_COMPILE/KTR_GEOM
kernel, which results in a stack_save() call in the ast() coredump
path - this created a panic, and then calling 'trace' in ddb resulted
in the black screen of death after printing out most of the backtrace.
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly. In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures. Basically,
all the archs were taking a trapframe and converting it into a clockframe
one way or another. Now they can just extract the PC and usermode values
directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
timecounter, call hardclock() directly. This removes an extra
conditional check from every clock interrupt on Alpha on the BSP.
There is probably room for even further pruning here by changing Alpha
to use the simplified timecounter we use on x86 with the lapic timer
since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
is false, so add a KASSERT() to that affect and remove a condition
to slightly optimize the non-lapic case.
- Change prototypeof arm_handler_execute() so that it's first arg is a
trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.
Tested on: alpha, amd64, arm, i386, ia64, sparc64
Reviewed by: bde (mostly)
the interface. This allows run-time selection of MMU code, based
on CPU-type detection, or tunable-overrides when testing new code.
Pre-requisite for G5 support.
conf/files.powerpc
- remove pmap.c
- add mmu_if.h, mmu_oea.c, pmap_dispatch.c
powerpc/include/mmuvar.h
- definitions for MMU implementations
powerpc/include/pmap.h
- remove pmap_pte_spill declaration
- add pmap_mmu_install declaration
- size the phys_avail array
- pmap_bootstrapped is now global-scope
powerpc/powerpc/machdep.c
- call kobj_machdep_init early in the boot sequence to allow
kobj usage prior to SI_SUB_LOCK
- install the OEA pmap code. This will be moved to CPU-specific
init code in the future.
powerpc/powerpc/mmu_if.m
- Kobj MMU interface definitions
powerpc/powerpc/pmap_dispatch.c
- central dispatch for pmap calls
- contains the global mmu kobj and the routine to locate the
the mmu implementation and init the kobj
OpenFirmware. FreeBSD/ppc uses SPRG0 as the per-cpu data area pointer,
and SPRG1-3 as temporary registers during exception handling. There
have been a few instances where OpenFirmware does require these to
be part of it's context, such as cd-booting an eMac.
reported by: many
MFC after: 3 days
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.
2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.
3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.
4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.
5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.
6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.
Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
pmap_bootstrap by using the sync;isync big hammer to make sure
all prior operations have completed.
Reported by: Nathan Whitehorn <nathan at uchicago edu>
MFC after: 2 days
trap_subr.S: declare a stub for the a-unavailable trap
that does an absolute jump to the vector-assist trap.
This is due to the fact that the vec-unavail trap
doesn't start at a 256-byte boundary, so the trick of
masking the bottom 8 bits of the link register to identify
the interrupt doesn't work, so let the vec-assist
case handle Altivec-disabled for the time being.
Note that this will be fixed in the future with a much
smaller vector code-stub (< 16 bytes) that will allow
use of strange vector offsets that are also present in
4xx processors, and also allow smaller differences in
vector codepaths on the G5.
trap.c: Treat altivec-unavailable/assist process traps as SIGILL.
Not quite correct, since altivec-assist should really be a panic,
but it is fine for the moment due to the above measure.
machdep.c Install the stub code for the altivec-unavailable trap, and
the standard trap code at the altivec-assist.
Reported by: Andreas Tobler <toa at pop agri ch>
MFC after: 3 days
address, writting non-canonical address can cause kernel a panic,
by restricting base values to 0..VM_MAXUSER_ADDRESS, ensuring
only canonical values get written to the registers.
Reviewed by: peter, Josepha Koshy < joseph.koshy at gmail dot com >
Approved by: re (scottl)
vm_page's machine-dependent fields. Use this function in
vm_pageq_add_new_page() so that the vm_page's machine-dependent and
machine-independent fields are initialized at the same time.
Remove code from pmap_init() for initializing the vm_page's
machine-dependent fields.
Remove stale comments from pmap_init().
Eliminate the Boolean variable pmap_initialized from the alpha, amd64,
i386, and ia64 pmap implementations. Its use is no longer required
because of the above changes and earlier changes that result in physical
memory that is being mapped at initialization time being mapped without
pv entries.
Tested by: cognet, kensmith, marcel
in other codes. Add cpu_set_user_tls, use it to tweak user register
and setup user TLS. I ever wanted to merge it into cpu_set_kse_upcall,
but since cpu_set_kse_upcall is also used by M:N threads which may
not need this feature, so I wrote a separated cpu_set_user_tls.
critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.
Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.
This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).
Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
since there are often significant holes in the memory map due to the
kernel, loader and OFW data structures not being included: Maxmem is
the highest available, so can be misleading.
the last action of kern_exit(). Instead, it is a MD callout to cleanup
per-process state during exit.
- Add notes of concern to Alpha and ia64 about the possible need to drop
fp state in cpu_thread_exit() rather than in cpu_exit() since it is
per-thread state rather than per-process.
on entry and it assumes the responsibility for releasing the page queues
lock if it must sleep.
Remove a bogus comment from pmap_enter_quick().
Using the first change, modify vm_map_pmap_enter() so that the page queues
lock is acquired and released once, rather than each time that a page
is mapped.
In such cases, the busying of the page and the unlocking of the
containing object by vm_map_pmap_enter() and vm_fault_prefault() is
unnecessary overhead. To eliminate this overhead, this change
modifies pmap_enter_quick() so that it expects the object to be locked
on entry and it assumes the responsibility for busying the page and
unlocking the object if it must sleep. Note: alpha, amd64, i386 and
ia64 are the only implementations optimized by this change; arm,
powerpc, and sparc64 still conservatively busy the page and unlock the
object within every pmap_enter_quick() call.
Additionally, this change is the first case where we synchronize
access to the page's PG_BUSY flag and busy field using the containing
object's lock rather than the global page queues lock. (Modifications
to the page's PG_BUSY flag and busy field have asserted both locks for
several weeks, enabling an incremental transition.)
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
The removed argument could trivially be derived from the remaining one.
That in turn should be the same as curthread, but it is possible that curthread could be expensive to derive on some syste,s so leave it as an argument.
Having both proc and thread as an argumen tjust gives an opportunity for
them to get out sync.
MFC after: 3 days
in diagnostics. It has outlived its usefulness and has started
causing panics for people who turn on DIAGNOSTIC, in what is otherwise
good code.
MFC after: 2 days
those architectures without pmap locking.
- Eliminate the acquisition and release of Giant from vm_map_protect().
(Translation: mprotect(2) runs to completion without touching Giant on
alpha, amd64, i386 and ia64.)
and saved link register as per the ABI call sequence. Update code
that uses this (fork_trampoline etc) to use the correct genassym'd
offsets.
This fixes the 'invalid LR' message when backtracing kernel
threads in DDB.
pmap_protect() and pmap_remove(). In general, they require the lock in
order to modify a page's pv list or flags. In some cases, however,
pmap_protect() can avoid acquiring the lock.
- ddb -> db for low-level trapcode
- implement makectx. I think it only matters that the stack is setup
correctly.
- bring over ddb_trap_glue and rename to db_trap_glue
so setfault would return correctly when a page fault was invalid
(e.g. a syscall with a bad parameter).
This caused an endless DSI loop, seen when running sendmail which
does a setlogin() call with a NULL pointer.
- introduce KTR_SYSC tracing. expose the syscallnames[] array to
make the tracing more readable.
dependent function by the same name and a machine-independent function,
sf_buf_mext(). Aside from the virtue of making more of the code machine-
independent, this change also makes the interface more logical. Before,
sf_buf_free() did more than simply undo an sf_buf_alloc(); it also
unwired and if necessary freed the page. That is now the purpose of
sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used
as a general-purpose emphemeral map cache.
ever since alpha/alpha/pmap.c revision 1.81 introduced the list allpmaps,
there has been no reason for having this function on Alpha. Briefly,
when pmap_growkernel() relied upon the list of all processes to find and
update the various pmaps to reflect a growth in the kernel's valid
address space, pmap_init2() served to avoid a race between pmap
initialization and pmap_growkernel(). Specifically, pmap_pinit2() was
responsible for initializing the kernel portions of the pmap and
pmap_pinit2() was called after the process structure contained a pointer
to the new pmap for use by pmap_growkernel(). Thus, an update to the
kernel's address space might be applied to the new pmap unnecessarily,
but an update would never be lost.
for user copyinout down to 12, and keeping segments 13/14 for
kernel VA.
It would be nice to have more available, but segments lower than
this are reserved for either memory or 1:1 mapped device i/o,
and seg 15 is OpenFirmware ROM. Also, the effort to keep OpenFirmware
available for callbacks limits the use of VA-mapped segments.
Fortunately UMA_MD_SMALL_ALLOC takes away a lot of VM pressure.
Obtained from: NetBSD
include/ucontext.h
- remove trapframe and switch over to 'generic' description of machine
state. Include version field to help with future modifications.
Include floating point and altivec state, and hopefully align
correctly
powerpc/copyinout.c
- fill out casuptr() sync primitive, required by kern_umtx.c
powerpc/machdep.c
- shifted proc0/thread0/pcpu setup to before cninit, since
syscons -> make_dev -> devlock requires a valid curthread
- implemented get_mcontext/set_mcontext
- recast sendsig/sigreturn to use get/set_mcontext and new
ucontext struct. floating point now saved
- TODO: save/restore altivec state
powerpc/vm_machdep.c
- implemented cpu_thread_setup/cpu_set_upcall/cpu_set_upcall_kse
- eliminated trailing whitespace
Submitted by: Suleiman Souhlal <refugee@segfaulted.com>, ucontext by grehan
- culled long-dead #define's
- segment register defs moved to sr.h
- NPMAPS moved to pmap.h
- KERNBASE moved to vmparam.h
- removed include of <machine/cpu.h> and fixed src files that
relied on this.
Modifying segment register code no longer causes gcc rebuilds :-)
using the direct-mapping of physmem to force PTE data structures
to be physically addressable so the interrupt-time real-mode
DSI trap handler could perform PTE spills. However, the memory
may have been > 256Mb, which would have caused a BAT spill and
double-interrupt.
The new trap code no longer handles PTE spills, so the requirement
that these pages be direct-mapped no longer applies. The irony is
UMA_MD_SMALL_ALLOC will return direct mappings for these structs :-)
- remove unused 601 and tlb exception code
- remove interrupt-time PTE spill code. The pmap code
will now take care of pinning kernel PTEs, and there
are no longer issues about physical mapping of PTE
data structures
- All segment registers are switched on kernel entry/exit,
allowing the kernel to have more virtual space and for
user virtual space to extend to 4G.
- The temporary register save area has been shifted from
unused exception vector space to the per-cpu data area.
This allows interrupts to be delivered to multiple CPUs
- ISI traps no longer spill to BAT tables. It is assumed
that all of kernel instruction memory is pinned.
- shift from 'ldmw/stmw' instructions to individual register
loads/stores when saving context. All PPC manuals indicate
this should be much faster.
- use '%r' for register names throughout.
TODO: need to test if DSI traps were the result of kernel stack
guard-page hits.
Reworked from: NetBSD
for direct-mapped addresses. Assume that any address less than KVA
is one of these and return it. Also assert that an address is KVA
does have a valid mapping - callers of pmap_kextract don't check
the return value, since they assume that they have a valid virtual
address.
addressing of memory. Makes a substantial improvement for apps that
stress the limited amount of KVM on PPC (e.g. untarring the ports tree).
uma_machdep.c stolen from amd64/ia64.
and add_child entry point to allow devices to use the identify
method to add themselves if need be (e.g. openpic, syscons).
Export interrupt-controller-add routine for extern int cntlr drivers.
Eliminate recursive OFW device-tree walk and only iterate the
top-level ala sparc64. Allow child devices to set the device
type with write_ivars.
Step 1 of many in removing the hard-dependency on OpenFirmware.
is useless for threaded programs, multiple threads can not share same
stack.
The alternative signal stack is private for thread, no lock is needed,
the orignal P_ALTSTACK is now moved into td_pflags and renamed to
TDP_ALTSTACK.
For single thread or Linux clone() based threaded program, there is no
semantic changed, because those programs only have one kernel thread
in every process.
Reviewed by: deischen, dfr
correctly, resulting in the dreaded "vm_pageout_flush: partially
invalid page" panic. The caching issue will be revisited in the
future, but opt for safety over performance in the meantime.
Tested by: gallatin
- OpenFirmware returns overlapping memory regions. Use a simple
brute force algorithm to merge these into non-overlapping
regions. This fixes bugs in reporting of available memory
and also prevents pages from being added twice in the VM system.
reboot, as calling OF_exit() just hangs a mac.
FreeBSD on my G4 800Mhz mac behaves identically to OSX for halt
and reboot now.
Reviewed by: grehan (who also supplied the concept and sample code)
physical mapping.
- Move the sf_buf API to its own header file; make struct sf_buf's
definition machine dependent. In this commit, we remove an
unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
Ultimately, we may eliminate struct sf_buf on those architecures
except as an opaque pointer that references a vm page.
Since all callers either passed 0 or 1 for clear_ret, define bit 0 in
the flags for use as clear_ret. Reserve bits 1, 2 and 3 for use by MI
code for possible (but unlikely) future use. The remaining bits are for
use by MD code.
This change is triggered by a need on ia64 to have another knob for
get_mcontext().
A small helper function pmap_is_prefaultable() is added. This function
encapsulate the few lines of pmap_prefault() that actually vary from
machine to machine. Note: pmap_is_prefaultable() and pmap_mincore() have
much in common. Going forward, it's worth considering their merger.
sockets into machine-dependent files. The rationale for this
migration is illustrated by the modified amd64 allocator. It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space. On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page. Yuck.
Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.
created not only with UMA_ZONE_VM but also with UMA_ZONE_NOFREE. In
the i386 case in particular, the pmap code would hook a special
page allocation routine that allocated from kernel_map and not kmem_map,
and so when/if the pageout daemon drained the zones, it could actually
push out slabs from the PV ENTRY zone but call UMA's default page_free,
which resulted in pages allocated from kernel_map being freed to
kmem_map; bad. kmem_free() ignores the return value of the
vm_map_delete and just returns. I'm not sure what the exact
repercussions could be, but it doesn't look good.
In the PAE case on i386, we also set-up a zone in pmap, so be
conservative for now and make that zone also ZONE_NOFREE and
ZONE_VM. Do this for the pmap zones for the other archs too,
although in some cases it may not be entirely necessarily. We'd
rather be safe than sorry at this point.
Perhaps all UMA_ZONE_VM zones should by default be also
UMA_ZONE_NOFREE?
May fix some of silby's crashes on the PV ENTRY zone.
order to avoid the overhead of later page faults. In general, it
implements two cases: one for vnode-backed objects and one for
device-backed objects. Only the device-backed case is really
machine-dependent, belonging in the pmap.
This commit moves the vnode-backed case into the (relatively) new
function vm_map_pmap_enter(). On amd64 and i386, this commit only
amounts to code rearrangement. On alpha and ia64, the new machine
independent (MI) implementation of the vnode case is smaller and more
efficient than their pmap-based implementations. (The MI
implementation takes advantage of the fact that objects in -CURRENT
are ordered collections of pages.) On sparc64, pmap_object_init_pt()
hadn't (yet) been implemented.
implementation of a largely MI pmap_object_init_pt() for vnode-backed
objects. pmap_enter_quick() is implemented via pmap_enter() on sparc64
and powerpc.
- Correct a mismatch between pmap_object_init_pt()'s prototype and its
various implementations. (I plan to keep pmap_object_init_pt() as
the MD hook for device-backed objects on i386 and amd64.)
- Correct an error in ia64's pmap_enter_quick() and adjust its interface
to match the other versions. Discussed with: marcel
to the machine-independent parts of the VM. At the same time, this
introduces vm object locking for the non-i386 platforms.
Two details:
1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The
different machine-dependent implementations used various combinations
of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set
KSTACK_GUARD_PAGES to 0.
2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In
5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed
to vm_page_alloc() or vm_page_grab().