include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
- stop using the evil 'struct trapframe' argument for mi_startup()
(formerly main()). There are much better ways of doing it.
- do not use prepare_usermode() - setregs() in execve() will do it
all for us as long as the p_md.md_regs pointer is set. (which is
now done in machdep.c rather than init_main.c. The Alpha port did it
this way all along and is much cleaner).
- collect all the magic %cr0 etc register settings into one place and
have the AP's call that instead of using magic numbers (!!) that keep
changing over and over again.
- Make it safe to call kthread_create() earlier, including during the
device probe sequence. It doesn't need the callback mechanism that
NetBSD's version uses.
- kthreads created this way are root-less as they exist before the root
filesystem is mounted. init(1) is set up so that it aquires the root
pointers prior to running. If other kthreads want filesystem acccess
we can make this code more generic.
- set all threads start times once we have decided what time it is.
- init uses a trampoline rather than the evil prepare_usermode() hack.
- kern_descrip.c has a couple of tweaks to deal with forking when there
is no rootdir or cwd etc.
- adjust the early SYSINIT() sequence so that a few prereqisites are in
place. eg: make sure the run queue is initialized before doing forks.
With this, the USB code can easily create a kthread to do the device
tree discovery. (I have tested it, it works nicely).
There are still some open issues before this is truely useful.
- tsleep() does not like working before the clock is running. It
sort-of tries to spin wait, but it can do more useful things now.
- stopping a kthread in kld code at unload time is "interesting" but
we have a solution for that.
The Alpha code needs no changes for this. It already uses pretty much the
same strategies, but a little cleaner.
Don't allow cpu entries in the MP table to contain APIC IDs out of range.
Don't write outside array boundaries if an IO APIC entry in the MP table
contains an APIC ID out of range.
Assign APIC IDs for all IO APICs according to section 3.6.6 in the
Intel MP spec:
- If the current APIC ID on an IO APIC doesn't conflict with other
IO APICs or CPUs, that APIC ID should be used. The copy of the MP
table must be updated if the corresponding APIC ID in the MP table
is different.
- If the current APIC ID was in conflict with other units, the
corresponding APIC ID specified in the MP table is checked for conflict.
- If a conflict is still found then fall back to using a new unique ID.
The copy of the MP table must be updated.
- IDs out of range is considered to be in conflict.
During these operations, the IO_TO_ID array cannot be used, since any
conflict would have caused information loss. The array is then corrected,
since all APIC ID conflicts should have been resolved.
PR: 20312, 18919
Further experimentation showed that some Dell 2450 machines with the
prevention kludge installed still got T_RESERVED traps. CPU interrupt
vector 0x7A was observed to be triggered. This might have been the
bitwise OR of two different vectors sent from each of the IOAPICs at
the same time.
IOAPIC #0: 0x68 --> irq 8: RTC timer interrupt
IOAPIC #1: 0x32 --> irq 18: scsi host adapter or network interface
----
0x7a --> T_RESERVED
Both IOAPICs had ID 0.
Appendix B.3 in the MP spec indicates that the operating system is
responsible for assigning unique IDs to the IOAPICs.
The enclosed patch programs the IOAPIC IDs according to the IOAPIC
entries in the MP table.
Submitted by: tegge
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).
A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.
Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.
This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.
This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.
Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch
the low level interrupt handler number should be used. Change
setup_apic_irq_mapping() to allocate low level interrupt handler X (Xintr${X})
for any ISA interrupt X mentioned in the MP table.
Remove an assumption in the driver for the system clock (clock.c) that
interrupts mentioned in the MP table as delivered to IOAPIC #0 intpin Y
is handled by low level interrupt handler Y (Xintr${Y}) but don't assume
that low level interrupt handler 0 (Xintr0) is used.
Don't allocate two low level interrupt handlers for the system clock.
Reviewed by: NOKUBI Hirotaka <hnokubi@yyy.or.jp>
and extend. The new function containing the code is named schedclock()
as in NetBSD, but it has slightly different semantics (it already handles
incrementation of p->p_cpticks, and it should handle any calling frequency).
Agreed with in principle by: dufault
the caller to specify a function to be guarded between an entry and exit
barrier, as well as pre- and post-barrier functions.
The primary use for this function is synchronised update of per-cpu private
data. The implementation is almost (but not quite) MI; with a better
mechanism for masking per-CPU interrupts it could probably be hoisted.
Reviewed by: peter (partially)
range attributes after they have been extracted from the master.
Hook up the i686 MP code to do this for each AP.
Be more careful about printing the default memory type for the i686.
Suggestions from: luoqi
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.
Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>
numbers as chars or use bogus casts in an attempt to unmisrepresnt
them. In top, don't assume that 0xff is the only negative cpu
number when cpu numbers are (mis)represented.
is the preparation step for moving pmap storage out of vmspace proper.
Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>
and use this when masking/unmasking interrupts.
Maintain a mapping from (iopaic number, int pin) tuple to irq number,
and use this when configuring devices and programming the ioapics.
Previous code assumed that irq number was equal to int pin number, and
that the ioapic number was 0.
Don't let an AP enter _cpu_switch before all local apics are initialized.
in a SMP system. Unexpected things could happen if each cpu
has a different ldt setting and one cpu tries to use value
of currentldt set by another cpu.
The fix is to move currentldt to the per-cpu area. It includes
patches I filed in PR i386/6219 which are also user ldt related.
PR: i386/7591, i386/6219
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
Cast pointers to (vm_offset_t) instead of to (u_long) (as before) or to
(uintptr_t)(void *) (as would be more correct). Don't cast vm_offset_t's
to (u_long) just to do arithmetic on them.
mp_machdep.c:
Cast pointers to (uintptr_t) instead of to (u_long). Don't forget
to cast pointers to (void *) first or to recover from integral
possible integral promotions, although this is too much work for
machine-dependent code.
vm code generally avoids warnings for pointer vs long size mismatches
by using vm_offset_t to represent pointers; pmap.c often uses plain
`unsigned int' instead of vm_offset_t and didn't use u_long elsewhere,
but this style was messed up by code apparently imported from mp_machdep.c.
update of cpu usage as shown by top when one process is cpu bound
(no system calls) while the system is otherwise idle (except for top).
Don't attempt to switch to the BSP in boot(). If the system was idle when
an interrupt caused a panic, this won't work. Instead, switch to the BSP
in cpu_reset.
Remove some spurious forward_statclock/forward_hardclock warnings.
ints. Remove some no longer needed casts. Initialize the per-cpu
global data area using the structs rather than knowing too much about
layout, alignment, etc.
- Attempt to handle PCI devices where the interrupt is
an ISA/EISA interrupt according to the mp table.
- Attempt to handle multiple IO APIC pins connected to
the same PCI or ISA/EISA interrupt source. Print a
warning if this happens, since performance is suboptimal.
This workaround is only used for PCI devices.
With these two workarounds, the -SMP kernel is capable of running on
my Asus P/I-P65UP5 motherboard when version 1.4 of the MP table is disabled.
f00f_hack has run.
Use the global r_idt descriptor in f00f_hack when in SMP mode,
so the APs find the relocated interrupt descriptor table.
Submitted by: Partially from David A Adkins <adkin003@tc.umn.edu>
interrupts are masked, and EOI is sent iff the corresponding ISR bit
is set in the local apic. If the CPU cannot obtain the interrupt
service lock (currently the global kernel lock) the interrupt is
forwarded to the CPU holding that lock.
Clock interrupts now have higher priority than other slow interrupts.
the signal handling latency for cpu-bound processes that performs very
few system calls.
The IPI for forcing an additional software trap is no longer dependent upon
BETTER_CLOCK being defined.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.
All in all, SMP systems especially should show a significant
improvement in "snappyness."
- A nonprofiling version of s_lock (called s_lock_np) is used
by mcount.
- When profiling is active, more registers are clobbered in
seemingly simple assembly routines. This means that some
callers needed to save/restore extra registers.
- The stack pointer must have space for a 'fake' return address
in idle, to avoid stack underflow.
in a P6 SMP system. Some MB bios'es don't set the registers up correctly
for the AP's. Additionally, set the memory between 0xa0000 and 0xbffff
as write combining.
holding CPU along with the lock. When a CPU fails to get the lock
it compares its own id to the holder id. If they are the same it
panic()s, as simple locks are binary, and this would cause a deadlock.
Controlled by smptests.h: SL_DEBUG, ON by default.
Some minor cleanup.
Add a simplelock to deal with disable_intr()/enable_intr() as used in UP kernel.
UP kernel expects that this is enough to guarantee exclusive access to
regions of code bracketed by these 2 functions.
Add a simplelock to bracket clock accesses in clock.c: clock_lock.
Help from: Bruce Evans <bde@zeta.org.au>
smp_active = 1 used to indicate that the system had frozen previously
started AP's, while smp_active = 0 was "AP's not yet started". I have split
this into smp_started (which is set when the AP's come online), and
smp_active is left for turning on/off AP scheduling.
- We now have enough per-cpu idle context, the real idle loop has been
revived (cpu's halt now with nothing to do).
- Some preliminary support for running some operations outside the
global lock (eg: zeroing "free but not yet zeroed pages") is present
but appears to cause problems. Off by default.
- the smp_active sysctl now behaves differently. It's merely a 'true/false'
option. Setting smp_active to zero causes the AP's to halt in the idle
loop and stop scheduling processes.
- bootstrap is a lot safer. Instead of sharing a statically compiled in
stack a number of times (which has caused lots of problems) and then
abandoning it, we use the idle context to boot the AP's directly. This
should help >2 cpu support since the bootlock stuff was in doubt.
- print physical apic id in traps.. helps identify private pages getting
out of sync. (You don't want to know how much hair I tore out with this!)
More cleanup to follow, this is more of a checkpoint than a
'finished' thing.
Added a new variable, 'bsp_apic_ready', which is set as soon as the bootstrap
CPU has initialized its local APIC. Conditionalize the GENSPLR functions
to call ss_lock ONLY after bsp_apic_ready is TRUE; This should prevent
any problems with races between the time the 1st AP becomes ready and the
time smp_active is set.
Made NEW_STRATEGY default.
Removed misc. old cruft.
Centralized simple locks into mp_machdep.c
Centralized simple lock macros into param.h
More cleanup in the direction of making splxx()/cpl MP-safe.
Several new fine-grained locks.
New FAST_INTR() methods:
- separate simplelock for FAST_INTR, no more giant lock.
- FAST_INTR()s no longer checks ipending on way out of ISR.
sio made MP-safe (I hope).
- removed TEST_ALTTIMER.
- removed APIC_PIN0_TIMER.
- removed TIMER_ALL.
mplock.s:
- minor update of try_mplock for new algorithm where a CPU uses try_mplock
instead of get_mplock in the ISRs.
- s_lock_init()
- s_lock()
- s_lock_try()
- s_unlock()
Created lock for IO APIC and apic_imen (SMP version of imen)
- imen_lock
Code to use imen_lock for access from apic_ipl.s and apic_vector.s.
Moved this code *outside* of mp_lock.
It seems to work!!!
1) Make sure that the region mapped by a 4MB page is
properly aligned.
2) Don't turn on the PG_G flag in locore for SMP. I plan
to do that later in startup anyway.
3) Make sure the 2nd processor has PSE enabled, so that 4MB
pages don't hose it.
We don't use PG_G yet on SMP -- there is work to be done to make that
work correctly. It isn't that important anyway...
this code is controlled by smptests.h: TEST_CPUSTOP, OFF by default
new code for handling mixed-mode 8259/APIC programming without 'ExtInt'
this code is controlled by smptests.h: TEST_ALTTIMER, ON by default
General cleanup.
New functions to stop/start CPUs via IPIs:
- int stop_cpus( u_int map );
- int restart_cpus( u_int map );
Turned off by default, enabled via smptests.h:TEST_CPUSTOP.
Current version has a BUG, perhaps a deadlock?
Till now NMIs would be ignored. Now an NMI is caught by the BSP.
APs still ignore NMI, am working on code to allow a CPU to stop other CPUs
via an IPI.
This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])
There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.
This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.
Reviewed by: fsmp, dyson
Changes to pmap.c for lapic_t lapic && ioapic_t ioapic pointers,
currently equal to apic_base && io_apic_base, will stand alone with the
private page mapping.
This is now the default, it delays most of the MP startup to the function
machdep.c:cpu_startup(). It should be possible to move the 2 functions
found there (mp_start() & mp_announce()) even further down the path once
we know exactly where that should be...
Help from: Peter Wemm <peter@spinner.dialix.com.au>
- The 1st (preparse_mp_table()) counts the number of cpus, busses, etc. and
records the LOCAL and IO APIC addresses.
- The 2nd pass (parse_mp_table()) does the actual parsing of info and recording
into the incore MP table.
This will allow us to defer the 2nd pass untill malloc() & private pages
are available (but thats for another day!).
panic( "xxxxx\n" );
to:
printf( "xxxxx\n" );
panic( "\n" );
For some as yet undetermined reason the argument to panic() is often NOT
printed, and the system sometimes hangs before reaching the panic printout.
So we hopefully at least print some useful info before the hang, as oppossed to
leaving the user clueless as to what has happened.
This code re-numbers PCI busses in the MP table to match PCI semantics
when the MP BIOS fails to do it properly.
Reviewed by: Peter Wemm <peter@spinner.DIALix.COM>
Peter Wemm <peter@spinner.DIALix.COM>, Steve Passe <smp@csn.net>
removed all the IPI_INTS code.
made the XFAST_IPI32 code default, renaming Xfastipi32 to Xinvltlb.
There are various options documented in i386/conf/LINT, there is more to
come over the next few days.
The kernel should run pretty much "as before" without the options to
activate SMP mode.
There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.
This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!