PR: 4486
Submitted by: tegge@idi.ntnu.no (Tor Egge)
Implement a function is_adapter_memory() in order to determine what
should nto be dumped at all. Currently, only populated with the ``ISA
memory hole''. Adapter regions of other busses should be added.
holding CPU along with the lock. When a CPU fails to get the lock
it compares its own id to the holder id. If they are the same it
panic()s, as simple locks are binary, and this would cause a deadlock.
Controlled by smptests.h: SL_DEBUG, ON by default.
Some minor cleanup.
Add a simplelock to deal with disable_intr()/enable_intr() as used in UP kernel.
UP kernel expects that this is enough to guarantee exclusive access to
regions of code bracketed by these 2 functions.
Add a simplelock to bracket clock accesses in clock.c: clock_lock.
Help from: Bruce Evans <bde@zeta.org.au>
smp_active = 1 used to indicate that the system had frozen previously
started AP's, while smp_active = 0 was "AP's not yet started". I have split
this into smp_started (which is set when the AP's come online), and
smp_active is left for turning on/off AP scheduling.
- We now have enough per-cpu idle context, the real idle loop has been
revived (cpu's halt now with nothing to do).
- Some preliminary support for running some operations outside the
global lock (eg: zeroing "free but not yet zeroed pages") is present
but appears to cause problems. Off by default.
- the smp_active sysctl now behaves differently. It's merely a 'true/false'
option. Setting smp_active to zero causes the AP's to halt in the idle
loop and stop scheduling processes.
- bootstrap is a lot safer. Instead of sharing a statically compiled in
stack a number of times (which has caused lots of problems) and then
abandoning it, we use the idle context to boot the AP's directly. This
should help >2 cpu support since the bootlock stuff was in doubt.
- print physical apic id in traps.. helps identify private pages getting
out of sync. (You don't want to know how much hair I tore out with this!)
More cleanup to follow, this is more of a checkpoint than a
'finished' thing.
Added a new variable, 'bsp_apic_ready', which is set as soon as the bootstrap
CPU has initialized its local APIC. Conditionalize the GENSPLR functions
to call ss_lock ONLY after bsp_apic_ready is TRUE; This should prevent
any problems with races between the time the 1st AP becomes ready and the
time smp_active is set.
Made NEW_STRATEGY default.
Removed misc. old cruft.
Centralized simple locks into mp_machdep.c
Centralized simple lock macros into param.h
More cleanup in the direction of making splxx()/cpl MP-safe.
Several new fine-grained locks.
New FAST_INTR() methods:
- separate simplelock for FAST_INTR, no more giant lock.
- FAST_INTR()s no longer checks ipending on way out of ISR.
sio made MP-safe (I hope).
We now tsleep() in kthread_init() between start_init()
and prepare_usermode() while waiting for ALL the idle_loop()
processes to come online.
Debugged & tested by: "Thomas D. Dean" <tomdean@ix.netcom.com>
Reviewed by: David Greenman <dg@root.com>
Work done by BSDI, Jonathan Lemon <jlemon@americantv.com>,
Mike Smith <msmith@gsoft.com.au>, Sean Eric Fagan <sef@kithrup.com>,
and probably alot of others.
Submitted by: Jnathan Lemon <jlemon@americantv.com>
This code was eliminated when the PEND_INTS algorithm was added. But it was
discovered that PEND_INTS only worsen latency for FAST_INTR() routines,
which can't be marked pending.
Noticed & debugged by: dave adkins <adkin003@gold.tc.umn.edu>
- removed TEST_ALTTIMER.
- removed APIC_PIN0_TIMER.
- removed TIMER_ALL.
mplock.s:
- minor update of try_mplock for new algorithm where a CPU uses try_mplock
instead of get_mplock in the ISRs.
Macros to convert the Lite2 lock manager primitives to the names used
in the kernel proper. This allows us to hide them from the lock
manager till they can be turned on.
smp.h:
declarations for the new simplelock functions.
- s_lock_init()
- s_lock()
- s_lock_try()
- s_unlock()
Created lock for IO APIC and apic_imen (SMP version of imen)
- imen_lock
Code to use imen_lock for access from apic_ipl.s and apic_vector.s.
Moved this code *outside* of mp_lock.
It seems to work!!!
1) Make sure that the region mapped by a 4MB page is
properly aligned.
2) Don't turn on the PG_G flag in locore for SMP. I plan
to do that later in startup anyway.
3) Make sure the 2nd processor has PSE enabled, so that 4MB
pages don't hose it.
We don't use PG_G yet on SMP -- there is work to be done to make that
work correctly. It isn't that important anyway...
of the kernel, and also most of the dynamic parts of the kernel. Additionally,
4MB pages will be allocated for display buffers as appropriate (only.)
The 4MB support for SMP isn't complete, but doesn't interfere with operation
either.
this code is controlled by smptests.h: TEST_CPUSTOP, OFF by default
new code for handling mixed-mode 8259/APIC programming without 'ExtInt'
this code is controlled by smptests.h: TEST_ALTTIMER, ON by default
- TEST_CPUSTOP adds stop_cpus()/restart_cpus(), OFF by default
- TEST_ALTTIMER new method for attaching 8259 PIC to APIC
this method avoids 'ExtInt' programming, ON by default
- TIMER_ALL sends 8259/8254 timer INTs to all CPUs, ON by default
- ASMPOSTCODExxx code to display bytes to POST hardware, OFF by default
General cleanup.
New functions to stop/start CPUs via IPIs:
- int stop_cpus( u_int map );
- int restart_cpus( u_int map );
Turned off by default, enabled via smptests.h:TEST_CPUSTOP.
Current version has a BUG, perhaps a deadlock?
Till now NMIs would be ignored. Now an NMI is caught by the BSP.
APs still ignore NMI, am working on code to allow a CPU to stop other CPUs
via an IPI.
available to the kernel (VM_KMEM_SIZE). The default (32 MB) is too low
when having 512 MB or more physical memory in a server environment. This is
relevant on systems where "panic: kmem_malloc: kmem_map too small" is a
problem.
This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])
There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.
This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.
Reviewed by: fsmp, dyson
cost since it is only done in cpu_switch(), not for every exception.
The extra state is kept in the pcb, and handled much like the npx state,
with similar deficiencies (the state is not preserved across signal
handlers, and error handling loses state).
Changes to pmap.c for lapic_t lapic && ioapic_t ioapic pointers,
currently equal to apic_base && io_apic_base, will stand alone with the
private page mapping.
apic.h has defines like:
#define lapic__id lapic->id
Once private pages and "known virtual addr" mapping of the APICs is
ready all 'lapic__XXX' will be changed to 'lapic.XXX', and the defines
will be removed.
Changes to smp.h for lapic_t lapic && ioapic_t ioapic pointers,
currently equal to apic_base && io_apic_base, will stand alone with the
private page mapping.
reality. There will be a new call interface, but for now the file
pci_compat.c (which is to be deleted, after all drivers are converted)
provides an emulation of the old PCI bus driver functions. The only
change that might be visible to drivers is, that the type pcici_t
(which had been meant to be just a handle, whose exact definition
should not be relied on), has been converted into a pcicfgregs* .
The Tekram AMD SCSI driver bogusly relied on the definition of pcici_t
and has been converted to just call the PCI drivers functions to access
configuration space register, instead of inventing its own ...
This code is by no means complete, but assumed to be fully operational,
and brings the official code base more in line with my development code.
A new generic device descriptor data type has to be agreed on. The PCI
code will then use that data type to provide new functionality:
1) userconfig support
2) "wired" PCI devices
3) conflicts checking against ISA/EISA
4) maps will depend on the command register enable bits
5) PCI to Anything bridges can be defined as devices,
and are probed like any "standard" PCI device.
The following features are currently missing, but will be added back,
soon:
1) unknown device probe message
2) suppression of "mirrored" devices caused by ancient, broken chip-sets
This code relies on generic shared interrupt support just commited to
kern_intr.c (plus the modifications of isa.c and isa_device.h).
This is now the default, it delays most of the MP startup to the function
machdep.c:cpu_startup(). It should be possible to move the 2 functions
found there (mp_start() & mp_announce()) even further down the path once
we know exactly where that should be...
Help from: Peter Wemm <peter@spinner.dialix.com.au>
- The 1st (preparse_mp_table()) counts the number of cpus, busses, etc. and
records the LOCAL and IO APIC addresses.
- The 2nd pass (parse_mp_table()) does the actual parsing of info and recording
into the incore MP table.
This will allow us to defer the 2nd pass untill malloc() & private pages
are available (but thats for another day!).
panic( "xxxxx\n" );
to:
printf( "xxxxx\n" );
panic( "\n" );
For some as yet undetermined reason the argument to panic() is often NOT
printed, and the system sometimes hangs before reaching the panic printout.
So we hopefully at least print some useful info before the hang, as oppossed to
leaving the user clueless as to what has happened.
all uses of it with the equivalent calls to setbits().
This change incidentally eliminates a problem building ELF kernels
that was caused by SETBITS.
Reviewed by: fsmp, peter
Submitted by: bde
simplifies some assumptions and stops some code compile problems.
This should fix the compile hiccup in PR#3491, but smp kernel profiling
isn't likely to be fixed by this.
- one-liners all become inline.
- multi-liners become functions.
- FAST_IPI defines go away.
re-worked APICIPI_BANDAID code.
- now refered to as DETECT_DEADLOCK.
- on by default.
This code re-numbers PCI busses in the MP table to match PCI semantics
when the MP BIOS fails to do it properly.
Reviewed by: Peter Wemm <peter@spinner.DIALix.COM>
Peter Wemm <peter@spinner.DIALix.COM>, Steve Passe <smp@csn.net>
removed all the IPI_INTS code.
made the XFAST_IPI32 code default, renaming Xfastipi32 to Xinvltlb.
There are various options documented in i386/conf/LINT, there is more to
come over the next few days.
The kernel should run pretty much "as before" without the options to
activate SMP mode.
There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.
This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!
for syscalls, so one frame was lost in backtraces from syscalls.
This is handled better in the kernel by using a different mcount
entry point for profiling before the frame pointer is set up.
Expand RCSID().
Use .p2align instead of the ambiguous .align.
Added idempotency ifdef.
Removed unused macros ALTENTRY(), ALTASENTRY(), ASENTRY(), _MID_ENTRY.
Cleaned up formatting.
Reviewed by: jdp reviewed an old version
Obtained from: parts from NetBSD
have successfully built, booted, and run a number of different ELF
kernel configurations, including GENERIC. LINT also builds and
links cleanly, though I have not tried to boot it.
The impact on developers is virtually nil, except for two things.
All linker sets that might possibly be present in the kernel must be
listed in "sys/i386/i386/setdefs.h". And all C symbols that are
also referenced from assembly language code must be listed in
"sys/i386/include/asnames.h". It so happens that failure to do
these things will have no impact on the a.out kernel. But it will
break the build of the ELF kernel.
The ELF bootloader works, but it is not ready to commit quite yet.
Rename the PT* index KSTK* #defines to UMAX*, since we don't have a kernel
stack there any more..
These are used to calculate VM_MAXUSER_ADDRESS and USRSTACK, and really
do not want to be changed with UPAGES since BSD/OS 2.x binary compatability
depends on it.
space. (!)
Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.
Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.
The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).
Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().
The following parts came from John Dyson:
Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.
Move the upages object (upobj) from the vmspace to the proc structure.
Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.
convenient and makes life difficult for my next commit. We still need
an i386tss to point to for the tss slot in the gdt, so we use a common
tss shared between all processes.
Note that this is going to break debugging until this series of commits
is finished. core dumps will change again too. :-( we really need
a more modern core dump format that doesn't depend on the pcb/upages.
This change makes VM86 mode harder, but the following commits will remove
a lot of constraints for the VM86 system, including the possibility of
extending the pcb for an IO port map etc.
Obtained from: bde
supports All Cyrix CPUs, IBM Blue Lightning CPU and NexGen (now AMD)
Nx586 CPU, and initialize special registers of Cyrix CPU and msr of
IBM Blue Lightning CPU.
If revision of Cyrix 6x86 CPU < 2.7, CPU cache is enabled in
write-through mode. This can be disabled by kernel configuration
options.
Reviewed by: Bruce Evans <bde@freebsd.org> and
Jordan K. Hubbard <jkh@freebsd.org>
at runtime.
etc/make.conf:
Nuked HAVE_FPU option.
lib/msun/Makefile:
Always build the i387 objects. Copy the i387 source files at build
time so that the i387 objects have different names. This is simpler
than renaming the files in the cvs repository or repeating half of
bsd.lib.mk to add explicit rules.
lib/msun/src/*.c:
Renamed all functions that have an i387-specific version by adding
`__generic_' to their names.
lib/msun/src/get_hw_float.c:
New file for getting machdep.hw_float from the kernel.
sys/i386/include/asmacros.h:
Abuse the ENTRY() macro to generate jump vectors and associated code.
This works much like PIC PLT dynamic initialization. The PIC case is
messy. The old i387 entry points are renamed. Renaming is easier
here because the names are given by macro expansions.
Changed it from 4 to 16 for i386's. It can be anything for i386's,
but compiler options limit it to a power of 2, and assembler and
linker deficiencies limit it to a small power of 2 (<= 16).
We use 16 in the kernel to get smaller tables (see Makefile.i386 and
<machine/asmacros.h>). We still use the default of 4 in user mode.
Use HISTCOUNTER instead of (*kcount) in the definition of KCOUNT()
for consistency with other macros.
complained so it cannot be entirely bad :-)
I include the email that probably explains it for people who already know:
> >Compiling with -O3 inlines functions. However the function that is being
> >inlined in makeinfo.c (add_word_args()) is a vararg function and must not be
> >inlined.
> >
> >The code in question is K&R style, and AFIK, there is no way for the compiler
> >to determine that the function uses vararg. Either change the code to use
> >prototypes, or use stdarg, or add a directive to prevent inlining.
>
> Not declaring a varargs function as varargs before it is used gives
> undefined behaviour.
>
> However, in practice the bug is probably in FreeBSD's <varargs.h>, which
> doesn't use gcc's __builtin_next_arg(). gcc should notice that it is
> used and not inline functions that have it. <stdarg.h.> uses it, but I
> think there's another gcc builtin that it should be using.
Patch attached. The ellipsis causes gcc to flag this as a varargs function,
and the name "__builtin_va_alist" is special cased in gcc to hide the last
argument in the arglist.
Reviewed by: bde & phk
Submitted by: jlemon@americantv.com (Jonathan Lemon)
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
also implies VM_PROT_EXEC. We support it that way for now,
since the break system call by default gives VM_PROT_ALL. Now
we have a better chance of coalesing map entries when mixing
mmap/break type operations. This was contributing to excessive
numbers of map entries on the modula-3 runtime system. The
problem is still not "solved", but the situation makes more
sense.
Eventually, when we work on architectures where VM_PROT_READ
is orthogonal to VM_PROT_EXEC, we will have to visit this
issue carefully (esp. regarding security issues.)
(1) deleted #if 0
pc98/pc98/mse.c
(2) hold per-unit I/O ports in ed_softc
pc98/pc98/if_ed.c
pc98/pc98/if_ed98.h
(3) merge more files by segregating changes into headers.
new file (moved from pc98/pc98):
i386/isa/aic_98.h
deleted:
well, it's already in the commit message so I won't repeat the
long list here ;)
Submitted by: The FreeBSD(98) Development Team
I decided to do this for every hardclock() call instead of lazily
in microtime(). The lazy method is simpler but has more overhead
if microtime() is called a lot.
CPU_THISTICKLEN() is now a no-op and should probably go away.
Previously it did nothing directly but had the side effect of
setting i586_last_tick for CPU_CLOCKUPDATE() and i586_avg_tick for
debugging. CPU_CLOCKUPDATE() now uses a better method and
i586_avg_tick is too much trouble to maintain.
Reduced nesting of #includes in the usual case.
Increased nesting of #includes when CLOCK_HAIR is defined. This
is a kludge to get typedefs for inline functions only when the
inline functions are used. Normally only kern_clock.c defines
this. kern_clock.c can't include the i386 headers directly.
Removed unused LOCORE support.
- use a more accurate and more efficient method of compensating for
overheads. The old method counted too much time against leaf
functions.
- normally use the Pentium timestamp counter if available.
On Pentiums, the times are now accurate to within a couple of cpu
clock cycles per function call in the (unlikely) event that there
are no cache misses in or caused by the profiling code.
- optionally use an arbitrary Pentium event counter if available.
- optionally regress to using the i8254 counter.
- scaled the i8254 counter by a factor of 128. Now the i8254 counters
overflow slightly faster than the TSC counters for a 150MHz Pentium :-)
(after about 16 seconds). This is to avoid fractional overheads.
files.i386:
permon.c temporarily has to be classified as a profiling-routine
because a couple of functions in it may be called from profiling code.
options.i386:
- I586_CTR_GUPROF is currently unused (oops).
- I586_PMC_GUPROF should be something like 0x70000 to enable (but not
use unless prof_machdep.c is changed) support for Pentium event
counters. 7 is a control mode and the counter number 0 is somewhere
in the 0000 bits (see perfmon.h for the encoding).
profile.h:
- added declarations.
- cleaned up separation of user mode declarations.
prof_machdep.c:
Mostly clock-select changes. The default clock can be changed by
editing kmem. There should be a sysctl for this.
subr_prof.c:
- added copyright.
- calibrate overheads for the new method.
- documented new method.
- fixed races and and machine dependencies in start/stop code.
mcount.c:
Use the new overhead compensation method.
gmon.h:
- changed GPROF4 counter type from unsigned to int. Oops, this should
be machine-dependent and/or int32_t.
- reorganized overhead counters.
Submitted by: Pentium event counter changes mostly by wollman
previous snap. Specifically, kern_exit and kern_exec now makes a
call into the pmap module to do a very fast removal of pages from the
address space. Additionally, the pmap module now updates the PG_MAPPED
and PG_WRITABLE flags. This is an optional optimization, but helpful
on the X86.
- fixed a sloppy common-style declaration.
- removed an unused macro.
- moved once-used macros to the one file where they are used.
- removed unused forward struct declarations.
- removed __pure.
- declared inline functions as inline in their prototype as well
as in theire definition (gcc unfortunately allows the prototype
to be inconsistent).
- staticized.
dependent operation, and not really a correct name. invltlb and invlpg
are more descriptive, and in the case of invlpg, a real opcode.
Additionally, fix the tlb management code for 386 machines.
with this quite a while ago when somebody reported a BSD/OS 2.1 binary
that wouldn't run. I'm pretty sure they tried it and I'm pretty sure
they mentioned to me that the patch worked.
comparisons in the inb() and outb() macros. I decided that int args
are OK here. Any type that can hold a u_int16_t without overflow
is correct, and 32-bit types are optimal.
Introduced a few tens of warnings (100 in LINT) for use of pessimized
(short) types for the port arg. Only a few drivers are affected by
this. u_short pessimizations aren't detected.
Added `__extension__' before the statement-expression in inb() so
that it can be compiled without warnings by gcc -pedantic.
(1) Add PC98 support to apm_bios.h and ns16550.h, remove pc98/pc98/ic
(2) Move PC98 specific code out of cpufunc.h (to pc98.h)
(3) Let the boot subtrees look more alike
Submitted by: The FreeBSD(98) Development Team
<freebsd98-hackers@jp.freebsd.org>
Changed i586_ctr_bias from long long to u_int. Only the low 32 bits
are used now that microtime uses a multiplication to do the scaling.
Previously the high 32 bits had to match those of rdtsc() to prevent
overflow traps and invalid timeval adjustments.
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.
performance issues.
1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.
Fixed profiling of system times. It was pre-4.4Lite and didn't support
statclocks. System times were too small by a factor of 8.
Handle deferred profiling ticks the 4.4Lite way: use addupc_task() instead
of addupc(). Call addupc_task() directly instead of using the ADDUPC()
macro.
Removed vestigial support for PROFTIMER.
switch.s:
Removed addupc().
resourcevar.h:
Removed ADDUPC() and declarations of addupc().
cpu.h:
Updated a comment. i386's never were tahoe's, and the deferred profiling
tick became (possibly) multiple ticks in 4.4Lite.
Obtained from: mostly from NetBSD
All new code is "#ifdef PC98"ed so this should make no difference to
PC/AT (and its clones) users.
Ok'd by: core
Submitted by: FreeBSD(98) development team
ansi and traditional cpp.
The nesting rules of macros are different, which required some changes.
Use __CONCAT(x,y) instead of /**/.
Redo some comments to use /* */ rather than "# comment" because the ansi
cpp cares about those, and also cares about quote matching.
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:
More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.
Macroize locore.s' page table setup even more, now it's almost readable.
Rename PG_U to PG_A (so that I can...)
Rename PG_u to PG_U. "PG_u" was just too ugly...
Remove some unused vars in pmap.c
Remove PG_KR and PG_KW
Remove SSIZE
Remove SINCR
Remove BTOPKERNBASE
This concludes my spring cleaning, modulus any bug fixes for messes I
have made on the way.
(Funny to be back here in pmap.c, that's where my first significant
contribution to 386BSD was... :-)
time. The results are currently ignored unless certain temporary options
are used.
Added sysctls to support reading and writing the clock frequency variables
(not the frequencies themselves). Writing is supposed to atomically
adjust all related variables.
machdep.c:
Fixed spelling of a function name in a comment so that I can log this
message which should have been with the previous commit.
Initialize `cpu_class' earlier so that it can be used in startrtclock()
instead of in calibrate_cyclecounter() (which no longer exists).
Removed range checking of `cpu'. It is always initialized to CPU_XXX
so it is less likely to be out of bounds than most variables.
clock.h:
Removed I586_CYCLECTR(). Use rdtsc() instead.
clock.c:
TIMER_FREQ is now a variable timer_freq that defaults to the old value of
TIMER_FREQ. #define'ing TIMER_FREQ should still work and may be the best
way of setting the frequency.
Calibration involves counting cycles while watching the RTC for one second.
This gives values correct to within (a few ppm) + (the innaccuracy of the
RTC) on my systems.
page dir+table index.
pmap.h: remove NUPDE, it was wrong and not used. Sanitize KSTKPTEOFF.
vmparam.h: Calculate virtual addr from PDI+PTI from pmap.h rather than
using magic math. Remove UPDT, not used.
regarding apm to LINT
- Disabled the statistics clock on machines which have an APM BIOS and
have the options "APM_BROKEN_STATCLOCK" enabled (which is default
in GENERIC now)
- move around some of the code in clock.c dealing with the rtc to make
it more obvios the effects of disabling the statistics clock
Reviewed by: bde
in a suboptimal manner. I had also noticed some panics that appeared
to be at least superficially caused by this problem. Also, included
are some minor mods to support more general handling of page table page
faulting. More details in a future commit.
Always delay using one inb(0x84) after each i/o in rtcin() - don't
do this conditional on the bogus option DUMMY_NOPS not being defined.
If you want an optionally slightly faster rtcin() again, then inline
it and use a better named option or sysctl variable. It only needs
to be fast in rtcintr().
(This code is as yet untested; to come after man page is written.)
This also adds inlines to cpufunc.h for the RDTSC, RDMSR, WRMSR, and RDPMC
instructions. The user-mode interface is via a subdevice of mem.c;
there is also a kernel-size interface which might be used to aid
profiling.
netscape-2.0 for Linux running all the Java stuff. The scrollbars are now
working, at least on my machine. (whew! :-)
I'm uncomfortable with the size of this commit, but it's too
inter-dependant to easily seperate out.
The main changes:
COMPAT_LINUX is *GONE*. Most of the code has been moved out of the i386
machine dependent section into the linux emulator itself. The int 0x80
syscall code was almost identical to the lcall 7,0 code and a minor tweak
allows them to both be used with the same C code. All kernels can now
just modload the lkm and it'll DTRT without having to rebuild the kernel
first. Like IBCS2, you can statically compile it in with "options LINUX".
A pile of new syscalls implemented, including getdents(), llseek(),
readv(), writev(), msync(), personality(). The Linux-ELF libraries want
to use some of these.
linux_select() now obeys Linux semantics, ie: returns the time remaining
of the timeout value rather than leaving it the original value.
Quite a few bugs removed, including incorrect arguments being used in
syscalls.. eg: mixups between passing the sigset as an int, vs passing
it as a pointer and doing a copyin(), missing return values, unhandled
cases, SIOC* ioctls, etc.
The build for the code has changed. i386/conf/files now knows how
to build linux_genassym and generate linux_assym.h on the fly.
Supporting changes elsewhere in the kernel:
The user-mode signal trampoline has moved from the U area to immediately
below the top of the stack (below PS_STRINGS). This allows the different
binary emulations to have their own signal trampoline code (which gets rid
of the hardwired syscall 103 (sigreturn on BSD, syslog on Linux)) and so
that the emulator can provide the exact "struct sigcontext *" argument to
the program's signal handlers.
The sigstack's "ss_flags" now uses SS_DISABLE and SS_ONSTACK flags, which
have the same values as the re-used SA_DISABLE and SA_ONSTACK which are
intended for sigaction only. This enables the support of a SA_RESETHAND
flag to sigaction to implement the gross SYSV and Linux SA_ONESHOT signal
semantics where the signal handler is reset when it's triggered.
makesyscalls.sh no longer appends the struct sysentvec on the end of the
generated init_sysent.c code. It's a lot saner to have it in a seperate
file rather than trying to update the structure inside the awk script. :-)
At exec time, the dozen bytes or so of signal trampoline code are copied
to the top of the user's stack, rather than obtaining the trampoline code
the old way by getting a clone of the parent's user area. This allows
Linux and native binaries to freely exec each other without getting
trampolines mixed up.
pmap_activate since it's not used anymore. Changed cpu_fork so that
it uses one line of inline assembly rather than calling mvesp() to
get the current stack pointer. Removed mvesp() since it is no longer
being used.
clock interrupts.
Keep a 1-in-16 smoothed average of the length of each tick. If the
CPU speed is correctly diagnosed, this should give experienced users
enough information to figure out a more suitable value for `tick'.
fd and wt drivers need bounce buffers, so this normally saves 32K-1K
of kernel memory.
Keep track of which DMA channels are busy. isa_dmadone() must now be
called when DMA has finished or been aborted.
Panic for unallocated and too-small (required) bounce buffers.
fd.c:
There will be new warnings about isa_dmadone() not being called after
DMA has been aborted.
sound/dmabuf.c:
isa_dmadone() needs more parameters than are available, so temporarily
use a new interface isa_dmadone_nobounce() to avoid having to worry
about panics for fake parameters. Untested.
looking at a high resolution clock for each of the following events:
function call, function return, interrupt entry, interrupt exit,
and interesting branches. The differences between the times of
these events are added at appropriate places in a ordinary histogram
(as if very fast statistical profiling sampled the pc at those
places) so that ordinary gprof can be used to analyze the times.
gmon.h:
Histogram counters need to be 4 bytes for microsecond resolutions.
They will need to be larger for the 586 clock.
The comments were vax-centric and wrong even on vaxes. Does anyone
disagree?
gprof4.c:
The standard gprof should support counters of all integral sizes
and the size of the counter should be in the gmon header. This
hack will do until then. (Use gprof4 -u to examine the results
of non-statistical profiling.)
config/*:
Non-statistical profiling is configured with `config -pp'.
`config -p' still gives ordinary profiling.
kgmon/*:
Non-statistical profiling is enabled with `kgmon -B'. `kgmon -b'
still enables ordinary profiling (and distables non-statistical
profiling) if non-statistical profiling is configured.
bzero.
Deprecated blkclr (removed it).
Removed some old cruft from cpufunc.h.
The optimized bzero was submitted by Torbjorn Granlund <tege@matematik.su.se>
The kernel adaption and other changes by me.
overflows.
It sure would be nice if there was an unmapped page between the PCB and
the stack (and that the size of the stack was configurable!). With the
way things are now, the PCB will get clobbered before the double fault
handler gets control, making somewhat of a mess of things. Despite this,
it is still fairly easy to poke around in the overflowed stack to figure
out the cause.
Protected them with `#ifdef KERNEL' so that <sys/queue.h> is valid C++.
Added the necessary #includes of <sys/queue.h>.
These functions are bogus and should be replaced by the queue macros.
- Don't print out meaningless iCOMP numbers, those are for droids.
- Use a shorter wait to determine clock rate to avoid deficiencies
in DELAY().
- Use a fixed-point representation with 8 bits of fraction to store
the rate and rationalize the variable name. It would be
possible to use even more fraction if it turns out to be
worthwhile (I rather doubt it).
The question of source code arrangement remains unaddressed.
misplaced extern declarations (mostly prototypes of interrupt handlers)
that this exposed. The prototypes should be moved back to the driver
sources when the functions are staticalized.
Added idempotency guards to <machine/conf.h>. "ioconf.h" can't be
included when building LKMs so define a wart in bsd.kmod.mk to help
guard against including it.
free-run and doing a subtract in microtime() rather than resetting the
counter to zero at every clock tick. In combination with the changes to
kern_clock.c, this should eliminate all the immediately obvious sources
of systematic jitter in timekeeping on Pentium machines.
changes to allow devices that don't probe (e.g. /dev/mem)
to create devfs entries
this required giving 'configure' its own SYSINIT entry
so we could duck in just before it with a DEVFS init
and some device inits..
my devfs now looks like:
./misc
./misc/speaker
./misc/mem
./misc/kmem
./misc/null
./misc/zero
./misc/io
./misc/console
./misc/pcaudio
./misc/pcaudioctl
./disks
./disks/rfloppy
./disks/rfloppy/fd0.1440
./disks/rfloppy/fd1.1200
./disks/floppy
./disks/floppy/fd0.1440
./disks/floppy/fd1.1200
also some sligt cleanups.. DEVFS needs a lot of work
but I'm getting back to it..
didn't work are somewhat bogusly optimized away before the constraint
is checked. We still expect constants passed to inline functions to
remain constant, but if the compiler ever decides that they aren't
constant then it will just generate slightly slower code instead of
an error.
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
include/signal.h:
There was massive namespace pollution from including <sys/types.h>.
POSIX functions were declared even when _ANSI_SOURCE is defined.
sys.sys/signal.h:
NSIG was declared even if _ANSI_SOURCE or _POSIX_SOURCE is defined.
sig_atomic_t wasn't declared if _POSIX_SOURCE is defined.
Declare a typedef for signal handling functions and use it to
unobfuscate declarations and to avoid half-baked function types
that cause unwanted compiler warnings at certain warning levels.
Fix confusing comment about SA_RESTART.
sys/i386/include/signal.h:
This has to be included to get the declaration of sig_atomic_t even
when _ANSI_SOURCE is defined, so be more careful about polluting
the ANSI namespace.
Uniformize idempotency ifdefs.
in machdep.c (it should use the global nmbclusters). Moved the calculation
of nmbclusters into conf/param.c (same place where nmbclusters has always
been assigned), and made the calculation include an extra amount based
on "maxusers". NMBCLUSTERS can still be overrided in the kernel config
file as always, but this change will make that generally unnecessary. This
fixes the "bug" reports from people who have misconfigured kernels seeing
the network hang when the mbuf cluster pool runs out.
Reviewed by: John Dyson
attempted to check for insecure and fatal eflags and segment
selectors, but missed many cases and got the IOPL check back to
front. The other syscalls didn't check at all.
sys_process.c, machdep.c:
Only allow PT_WRITE_U to write to the registers (ordinary and FP).
psl.h, locore.s, machdep.c:
Eliminate PSL_MBZ, PSL_MBO and PSL_USERCLR. We are not supposed
to assume anything about the reserved bits. Use PSL_USERCHANGE
and PSL_KERNEL instead. Rename PSL_USERSET to PSL_USER.
exception.s:
Define a private label for use by doreti when returning to user
mode fails.
machdep.c:
In syscalls, allow changing only the eflags that can be changed on
486's in user mode (no longer attempt to allow benign IOPL changes;
allow changing the nasty PSL_NT; don't allow changing the i586
bits).
Don't attempt to check all the cases involving invalid selectors
and %eip's. Just check for privilege violations and let the invalid
things cause a trap.
procfs_machdep.c:
Call the ptrace register functions to do all the work for reading
and writing ordinary registers and for single stepping.
trap.c:
Ignore traps caused by PSL_NT being set. Previously, users could
cause a fatal trap in user mode by setting PSL_NT and executing an
iret, and a fatal trap in kernel mode by setting PSL_NT and making
a syscall. PSL_NT was cleared too late and not in enough modes to
fix the problem.
Make all traps in user mode (except T_NMI) nonfatal.
Recover from traps caused by attempting to load invalid user
registers in doreti by restarting the traps so that they appear to
occur in user mode.
---
Fix bogons that I noticed while fixing the above:
psl.h:
Fix some comments.
Uniformize idempotency ifdef.
exception.s, machdep.c:
Remove rsvd[0-14]. rsvd0 hasn't been reserved since the 486 came
out. Replace rsvd0 by `align'. rsvd[0-11] used wrong (magic
non-unique) trap numbers. Replace rsvd[1-14] by rsvd.
locore.s:
Enable alignment check flag on 486's and 586's.
machdep.c:
Use a better type for kstack[].
Use TFREGP() to find the registers.
Reformat ptrace functions from SEF to something closer to KNF.
procfs_machdep.c:
The wrong pointer to the registers got fixed as a side effect.
Implement reading and writing of FP registers.
/proc/*/*regs now work (only) for processes that are in memory.
Clean up comments.
trap.c, trap.h:
Remove unused trap types.
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
Keep track of interrupt nesting level. It is normally 0
for syscalls and traps, but is fudged to 1 for their exit
processing in case they metamorphose into an interrupt
handler.
i386/genassym.c;
Remove support for the obsolete pcb_iml and pcb_cmap2.
Add support for pcb_inl.
i386/swtch.s:
Fudge the interrupt nesting level across context switches and in
the idle loop so that the work for preemptive context switches
gets counted as interrupt time, the work for voluntary context
switches gets counted mostly as system time (the part when
curproc == 0 gets counted as interrupt time), and only truly idle
time gets counted as idle time.
Remove obsolete support (commented out and otherwise) for pcb_iml.
Load curpcb just before curproc instead of just after so that
curpcb is always valid if curproc is. A few more changes like
this may fix tracing through context switches.
Remove obsolete function swtch_to_inactive().
include/cpu.h:
Use the new interrupt nesting level variable to implement a
non-fake CLF_INTR() so that accounting for the interrupt state
works.
You can use top, iostat or (best) an up to date systat to see
interrupt overheads. I see the expected huge interrupt overheads
for ISA devices (on a 486DX/33, about 55% for an IDE drive
transferring 1250K/sec and the same for a WD8013EBT network card
transferring 1100K/sec). The huge interrupt overheads for serial
devices are unfortunately normally invisible.
include/pcb.h:
Remove the obsolete pcb_iml and pcb_cmap2. Replace them by
padding to preserve binary compatibility.
Use part of the new padding for pcb_inl.
isa/icu.s:
isa/vector.s:
Keep track of interrupt nesting level.