this code is controlled by smptests.h: TEST_CPUSTOP, OFF by default
new code for handling mixed-mode 8259/APIC programming without 'ExtInt'
this code is controlled by smptests.h: TEST_ALTTIMER, ON by default
POSIX.4. Additionally, there is some initial code that supports LIO.
This code supports AIO/LIO for all types of file descriptors, with
few if any restrictions. There will be a followup very soon that
will support significantly more efficient operation for VCHR type
files (raw.) This code is also dependent on some kernel features
that don't work under SMP yet. After I commit the changes to the
kernel to support proper address space sharing on SMP, this code
will also work under SMP.
General cleanup.
New functions to stop/start CPUs via IPIs:
- int stop_cpus( u_int map );
- int restart_cpus( u_int map );
Turned off by default, enabled via smptests.h:TEST_CPUSTOP.
Current version has a BUG, perhaps a deadlock?
Till now NMIs would be ignored. Now an NMI is caught by the BSP.
APs still ignore NMI, am working on code to allow a CPU to stop other CPUs
via an IPI.
Specifically, don't allow a value < 1 for any of them (it doesn't make
sense), and don't let the low water mark be greater than the corresponding
high water mark.
Pre-Approved by: wollman
Obtained from: NetBSD
nothing good except of opening a can of (potential or real) security
holes. People maintaining a machine with higher security requirements
need to be on the console anyway, so there's no point in not forcing
them to reboot before starting maintenance.
Agreed by: hackers, guido
This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])
There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.
This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.
Reviewed by: fsmp, dyson
flag wasn't being respected during vref(), et. al. Note that this
isn't the eventual fix for the locking problem. Fine grained SMP
in the VM and VFS code will require (lots) more work.
cause a problem of spiraling death due to buffer resource limitations.
The vfs_bio code in general had little ability to handle buffer resource
management, and now it does. Also, there are a lot more knobs for tuning the
vfs_bio code now. The knobs came free because of the need that there
always be some immediately available buffers (non-delayed or locked) for
use. Note that the buffer cache code is much less likely to get bogged
down with lots of delayed writes, even more so than before.
It is possible for multiple process to sleep concurrently waiting
for a buffer. When the buffer shortage is a shortage of space but
not a shortage of buffer headers, the processes took turns creating
empty buffers and waking each other to advertise the brelse() of
the empties; progress was never made because tsleep() always found
another high-priority process to run and everything was done at
splbio(), so vfs_update never had a chance to flush delayed writes,
not to mention that i/o never had a chance to complete.
The problem seems to be rare in practice, but it can easily be
reproduced by misusing block devices, at least for sufficently slow
devices on machines with a sufficiently small buffer cache. E.g.,
`tar cvf /dev/fd0 /kernel' on an 8MB system with no disk in fd0
causes the problem quickly; the same command with a disk in fd0
causes the problem not quite as quickly; and people have reported
problems newfs'ing file systems on block devices.
Block devices only cause this problem indirectly. They are pessimized
for time and space, and the space pessimization causes the shortage
(it manifests as internal fragmentation in buffer_map).
This should be fixed in 2.2.
cost since it is only done in cpu_switch(), not for every exception.
The extra state is kept in the pcb, and handled much like the npx state,
with similar deficiencies (the state is not preserved across signal
handlers, and error handling loses state).
shared function.
- use p->p_sleepend to try and get more accurate "time remaining" results
when the time has been adjusted.
- verify writeability of return address so that we can fail before sleeping
if the address for the result is bogus.
Changes to pmap.c for lapic_t lapic && ioapic_t ioapic pointers,
currently equal to apic_base && io_apic_base, will stand alone with the
private page mapping.
be (eventually) architecture independent. It provides an emulation
of the ISA interrupt registration function register_intr(), but that
function does no longer manipulated the interrupt controller and
interrupt descriptor table, but calls the architecture dependent
function setup_icu() for that purpose.
After theISA/EISA bus code has been modified to directly call the new
interrupt registartion functions (intr_create() and intr_connect()),
the emulation of register_intr() should be dropped.
The C level interrupt handler function should take a (void*) argument,
and the function pointer type (inthand2_t) should defined in some other
place than isa_device.h.
This commit is a pre-requisite for the removal of the PCI specific shared
interrupt code.
Reviewed by: dfr,bde
This is now the default, it delays most of the MP startup to the function
machdep.c:cpu_startup(). It should be possible to move the 2 functions
found there (mp_start() & mp_announce()) even further down the path once
we know exactly where that should be...
Help from: Peter Wemm <peter@spinner.dialix.com.au>
- The 1st (preparse_mp_table()) counts the number of cpus, busses, etc. and
records the LOCAL and IO APIC addresses.
- The 2nd pass (parse_mp_table()) does the actual parsing of info and recording
into the incore MP table.
This will allow us to defer the 2nd pass untill malloc() & private pages
are available (but thats for another day!).
When a panic occurs early in the SMP boot process 'cpunumber()' hangs,
causing the panic string to be lost. Now the system appears to hang
in 'breakpoint()', but at least the user sees the panic string before the
hang.
switch. I needed 'LINT' to compile for other reasons so I kinda got the
blood on my hands. Note: I don't know how to test this, I don't know if
it works correctly.
panic( "xxxxx\n" );
to:
printf( "xxxxx\n" );
panic( "\n" );
For some as yet undetermined reason the argument to panic() is often NOT
printed, and the system sometimes hangs before reaching the panic printout.
So we hopefully at least print some useful info before the hang, as oppossed to
leaving the user clueless as to what has happened.
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.
PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson
to fill in the nfs_diskless structure, at the cost of some kernel
bloat. The advantage is that this code works on a wider range of
network adapters than netboot. Several new kernel options are
documented in LINT.
Obtained from: parts of the code comes from NetBSD.
Serious:
- An important timevalfix() in settime[ofday]() was lost.
Not so serious:
- There was a race initializing `delta' in the check for setting the
time backwards.
- The `#ifdef notyet' check for setting the time more than a day forwards
was back to front.
[[I deleted the code, it's useless because of iteration - Peter]]
- The timespec was not checked for validity in clock_settime().
- The timespec was not fully checked for validity in nanotime(). The
check in itimerfix() is too late, since the conversion from a timespec
to a timeval may overflow.
- A garbage timeval was checked in settimeofday() for the (uap->tv == NULL
&& uap->tzp != NULL) case. I added the broken check this some time ago.
Cosmetic:
- The "inadvertantly (sic) sleeping forever" test always failed. hzto()
always returns >= 1.
- The style wasn't very KNFish. (I only changed new code.)
Submitted by: bde
in NetBSD. The core of settimeofday() is moved to a seperate static
function settime() which both clock_settime() and settimeofday() call.
Note that I picked up the securelevel > 1 check from NetBSD that prevents
the clock being set backwards in high securelevel mode (this was a hole
that allowed resetting of inode access timestamps to arbitary values)
Obtained from: mostly from NetBSD, but the settime() function is from
our gettimeofday(), some tweaks by me.
the patches in freefall:/home/dfr/ld.diffs to your ld sources and set
BINFORMAT to aoutkld when linking the kernel.
Library changes and userland utilities will appear in a later commit.
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.
2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.
3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.
4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.
5. Remove the upper limit on namelength of namecache entries.
6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)
7. Assign v_id correctly in the face of 32bit rollover.
8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.
9. Use the vnode freelist as a true LRU list, also for namecache accesses.
10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)
11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.
12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.
Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.
Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.
Future work:
Straighten out the namecache statistics.
"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.
I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.
There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.
Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.
but now that we've widened the scope of the smp work to -current, it might
be an idea to warn new people that might not have read all the docs yet
that the SMP support needs to be activated via a sysctl.
This code re-numbers PCI busses in the MP table to match PCI semantics
when the MP BIOS fails to do it properly.
Reviewed by: Peter Wemm <peter@spinner.DIALix.COM>
replace invldebug with invltlb_ok for throttling smp_invltlb() during boot.
Reviewed by: informal discussion with Peter Wemm <peter@spinner.DIALix.COM>
Peter Wemm <peter@spinner.DIALix.COM>, Steve Passe <smp@csn.net>
removed all the IPI_INTS code.
made the XFAST_IPI32 code default, renaming Xfastipi32 to Xinvltlb.
This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.
2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
There are various options documented in i386/conf/LINT, there is more to
come over the next few days.
The kernel should run pretty much "as before" without the options to
activate SMP mode.
There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.
This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!