page-zeroing code as well as from the general page-zeroing code and use a
lazy tlb page invalidation scheme based on a callback made at the end
of mi_switch.
A number of people came up with this idea at the same time so credit
belongs to Peter, John, and Jake as well.
Two-way SMP buildworld -j 5 tests (second run, after stabilization)
2282.76 real 2515.17 user 704.22 sys before peter's IPI commit
2266.69 real 2467.50 user 633.77 sys after peter's commit
2232.80 real 2468.99 user 615.89 sys after this commit
Reviewed by: peter, jhb
Approved by: peter
choosethread() in MI C code instead of doing it in in assembly in all the
various cpu_switch() functions. This fixes problems on ia64 and sparc64.
Reviewed by: julian, peter, benno
Tested on: i386, alpha, sparc64
- It actually works this time, honest!
- Fine grained TLB shootdowns for SMP on i386. IPI's are very expensive,
so try and optimize things where possible.
- Introduce ranged shootdowns that can be done as a single IPI.
- PG_G support for i386
- Specific-cpu targeted shootdowns. For example, there is no sense in
globally purging the TLB cache for where we are stealing a page from
the local unshared process on the local cpu. Use pm_active to track
this.
- Add some instrumentation for the tlb shootdown code.
- Rip out SMP code from <machine/cpufunc.h>
- Try and fix some very bogus PG_G and PG_PS interactions that were bad
enough to cause vm86 bios calls to break. vm86 depended on our existing
bugs and this was the cause of the VESA panics last time.
- Fix the silly one-line error that caused the 'panic: bad pte' last time.
- Fix a couple of other silly one-line errors that should have caused more
pain than they did.
Some more work is needed:
- pmap_{zero,copy}_page[_idle]. These can be done without IPI's if we
have a hook in cpu_switch.
- The IPI handlers need some cleanup. I have a bogus %ds load that can
be avoided.
- APTD handling is rather bogus and appears to be a large source of
global TLB IPI shootdowns for no really good reason.
I see speedups of between 1.5% and ~4% on buildworlds in a while 1 loop.
I expect to see a bigger difference when there is significant pageout
activity or the system otherwise has memory shortages.
I have backed out a few optimizations that I had been using over the last
few days in order to be a little more conservative. I'll revisit these
again over the next few days as the dust settles.
New option: DISABLE_PG_G - In case I missed something.
the default) is now the only method for i386.
Remove the paraphanalia that supported critmode. Remove td_critnest, clean
up the assembly, and clean up (mostly remove) the old junk from
cpu_critical_enter() and cpu_critical_exit().
ipl.s except doreti which really belongs in with the exceptions as it's
just the other side of the same coin. Will remove ipl.s in a separate commit.
Agreed by: several including bde@freebsd.org
hardly MD, since all our platforms share the same macro. It's not
really compiler dependent either, but this helps in reducing
<machine/ansi.h> to only type definitions.
threaded VM pagezero kthread outside of Giant. For some platforms, this
is really easy since it can just use the direct mapped region. For others,
IPI sending is involved or there are other issues, so grab Giant when
needed.
We still have preemption issues to deal with, but Alan Cox has an
interesting suggestion on how to minimize the problem on x86.
Use Luigi's hack for preserving the (lack of) priority.
Turn the idle zeroing back on since it can now actually do something useful
outside of Giant in many cases.
mappings from the page tables, which were mapped with PG_G! We could
reuse the page table entry for another mapping (pmap_mapdev) but it
would never have cleared any remaining PG_G TLB entries.
pmap_swapin_proc/pmap_swapout_proc functions from the MD pmap code
and use a single equivalent MI version. There are other cleanups
needed still.
While here, use the UMA zone hooks to keep a cache of preinitialized
proc structures handy, just like the thread system does. This eliminates
one dependency on 'struct proc' being persistent even after being freed.
There are some comments about things that can be factored out into
ctor/dtor functions if it is worth it. For now they are mostly just
doing statistics to get a feel of how it is working.
These functions are always called on new memory so they can
not already be set up, so don't bother testing for that.
(This was left over from before we used UMA (which is cool))
and function) with existing configuration choices. Arguably if
ALT_BREAK_TO_DEBUGGER was present, so should have been
BREAK_TO_DEBUGGER. Regardless, it broke the option sort order in
these kernel configuration files.
Requested by: bde
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
(from: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=832566+0+ \
current/freebsd-current)
"Too many pages were prefaulted in pmap_object_init_pt, thus
the wrong physical page was entered in the pmap for the virtual
address where the .dynamic section was supposed to be."
Submitted by: tegge
Approved by: tegge's patches never fail
types are not required, as the overhead is unnecessary:
o In the i386 pmap_protect(), `sindex' and `eindex' represent page
indices within the 32-bit virtual address space.
o In swp_pager_meta_build() and swp_pager_meta_ctl(), use a temporary
variable to store the low few bits of a vm_pindex_t that gets used
as an array index.
o vm_uiomove() uses `osize' and `idx' for page offsets within a
map entry.
o In vm_object_split(), `idx' is a page offset within a map entry.
64-bit file sizes. This step simply addresses the remaining overflows,
and does attempt to optimise performance. The details are:
o Use a 64-bit type for the vm_object `size' and the size argument
to vm_object_allocate().
o Use the correct type for index variables in dev_pager_getpages(),
vm_object_page_clean() and vm_object_page_remove().
o Avoid an overflow in the i386 pmap_object_init_pt().
sysctl (machdep.cpu_idle_hlt) to off in the SMP case. This allows you to
turn it on if you wish and do not particularly care about the small window
where a cpu will remain halted even when a job is placed on the run queue
(until the next clock tick).
obtained, when all other scheduling activity is suspended. This is needed
on sparc64 to deactivate the vmspace of the exiting process on all cpus.
Otherwise if another unrelated process gets the exact same vmspace structure
allocated to it (same address), its address space will not be activated
properly. This seems to fix some spontaneous signal 11 problems with smp
on sparc64.
checksumming. These bugs could possibly cause bad code to be
generated at elevated optimization levels.
First, eliminate the use of preprocessor magic to form the address
fields of asm instructions. It hid the actual addresses being
referenced from the compiler. Without knowledge of all the data
dependencies, the compiler might possibly use optimizations which
would result in incorrect code.
Use "__asm __volatile" rather than "__asm" for instruction sequences
that pass information through the condition codes (the carry bit, in
this case). Without __volatile, the compiler might add unrelated
code between consecutive __asm instructions, modifying the condition
codes. I have seen GCC insert stack pointer adjustments in this
way, for example. Unfortunately, GCC doesn't provide a way to
specify dependencies on the condition codes. You can specify that
they are clobbered, but not that you are going to use them as input.
Finally, simplify the LOAD macro. This macro is used as a poor
man's prefetch. The simpler version gives the compiler more leeway
about just how it performs the prefetch.
MFC after: 1 week
when machdep.tsc_freq returned a negative number on a 2.2GHz Xeon.
Submitted by: Brian Harrison <bharrison@ironport.com>
Reviewed by: phk
MFC after: 1 week
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.
Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.
Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).
Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
implementations can provide a base zero ffs function if they wish.
This changes
#define RQB_FFS(mask) (ffs64(mask))
foo = RQB_FFS(mask) - 1;
to
#define RQB_FFS(mask) (ffs64(mask) - 1)
foo = RQB_FFS(mask);
On some platforms we can get the "- 1" for free, eg: those that use the
C code for ffs64().
Reviewed by: jake (in principle)
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.
- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.
- Use the ZONE_VM flag on vm map entries and pv entries on x86.
do not blunder around enabling interrupts and running trap handlers.
trap_pfault() will normally pass control to ddb's fault handler which
will normally do the right thing.
This bug is very old. but in old versions of FreeBSD it is probably only
serious for trap handling that involves sleeping. In -current, attempting
to examine unmapped memory while stopped at a breakpoint at mi_switch()
was always fatal.
- ktrace no longer requires Giant so do ktrace syscall events before and
after acquiring and releasing Giant, respectively.
- For i386, ia32 syscalls on ia64, powerpc, and sparc64, get rid of the
goto bad hack and instead use the model on ia64 and alpha were we
skip the actual syscall invocation if error != 0. This fixes a bug
where if we the copyin() of the arguments failed for a syscall that
was not marked MP safe, we would try to release Giant when we had
not acquired it.
mask on both input and output to fpsetmask(), but this was only done for
input, so fpsetmask() returned the complement of the old mask (ANDed with
the mask bitfield).
PR: 38170
MFC after: 4 weeks
Don't require pin be non-zero before we map bogus intlines, always do it.
This fixes a number of problems on HP Omnibook computers.
Tested/Reviewed by: Brooks Davis
MI API with empty cpu_pause() functions on other arch's, but this
functionality is definitely unique to IA-32, so I decided to leave it
as i386-only and wrap it in #ifdef's. I should have dropped the cpu_
prefix when I made that decision.
Requested by: bde
make_dev() to create device nodes for each of the serial port channels
(ttym%d and cuam%d respectively, as borrowed from MAKEDEV). This allows
the rc driver to work in 5.0. I've tested it with only one card, but
will try sticking in a second card tomorrow and see what happens.
checking, followed by a lookup of the process. Do not call
ptrace() for permission checking, but do it inline.
Spotted by: rwatson
o While here, copy-in arguments before we lock. This fixes
a possible permanent lock.
Reviewed by: rwatson
with 16-bit ints, since u_short is promoted when it is passed to a
varargs function. gcc now warns about this. We always pass small
integers (this is well obuscated), so there are no conversion problems.
Fixed a related style bug (bogus cast).
the case of VM86 calls from the kernel was broken, so this bug was not
a security hole.
PR: 36710
Submitted by: David Xu <davidx@viasoft.com.cn> (version for RELENG_4)
MFC after: 3 days
i386/ia64/alpha - catch up to sparc64/ppc:
- replace pmap_kernel() with refs to kernel_pmap
- change kernel_pmap pointer to (&kernel_pmap_store)
(this is a speedup since ld can set these at compile/link time)
all platforms (as suggested by jake):
- gc unused pmap_reference
- gc unused pmap_destroy
- gc unused struct pmap.pm_count
(we never used pm_count - we track address space sharing at the vmspace)
ever connect a SCSI Cdrom/Tape/Jukebox/Scanner/Printer/kitty-litter-scooper
to your high-end RAID controller. The interface to the arrays is still
via the block interface; this merely provides a way to circumvent the
RAID functionality and access the SCSI buses directly. Note that for
somewhat obvious reasons, hard drives are not exposed to the da driver
through this interface, though you can still talk to them via the pass
driver. Be the first on your block to low-level format unsuspecting
drives that are part of an array!
To enable this, add the 'aacp' device to your kernel config.
MFC after: 3 days
timecounter will be used starting at the next second, which is
good enough for sysctl purposes. If better adjustment is needed
the NTP PLL should be used.
the symbol index defined by the relocation. The elf_lookup() support
function is to be used by elf_reloc() when symbol lookups need to be
done. The elf_lookup() function operates on the symbol index and
will do a symbol name based lookup when such is required, otherwise
it uses the symbol index directly. This solves the problem seen on
ia64 where the symbol hash table does not contain local symbols and
a symbol name based lookup would fail for those symbols.
Don't pass the symbol name to elf_reloc(), as it isn't used any more.
2, but that's not the case. This fixes the case where there were slots
in the PIR table that had no bits set, but we assumed they did and used
strange results as a result.
o Map invalid INTLINE registers to 255 in pci_cfgreg.c. This should allow
us to remove the bogus checks in MI code for non-255 values.
I put these changes out for review a while ago, but no one responded
to them, so into current they go.
This should help us work better on machines that don't route
interrupts in the traditional way.
MFC After: 4286 millifortnights
due to conditions that suggest the possible need for stack growth.
This has two beneficial effects: (1) we can
now remove calls to vm_map_growstack() from the MD trap handlers and (2)
simple page faults are faster because we no longer unnecessarily perform
vm_map_growstack() on every page fault.
o Remove vm_map_growstack() from the i386's trap_pfault().
o Remove the acquisition and release of Giant from i386's trap_pfault().
(vm_fault() still acquires it.)
environment needed at boot time to a dynamic subsystem when VM is
up. The dynamic kernel environment is protected by an sx lock.
This adds some new functions to manipulate the kernel environment :
freeenv(), setenv(), unsetenv() and testenv(). freeenv() has to be
called after every getenv() when you have finished using the string.
testenv() only tests if an environment variable is present, and
doesn't require a freeenv() call. setenv() and unsetenv() are self
explanatory.
The kenv(2) syscall exports these new functionalities to userland,
mainly for kenv(1).
Reviewed by: peter
and pmap_copy_page(). This gets rid of a couple more physical addresses
in upper layers, with the eventual aim of supporting PAE and dealing with
the physical addressing mostly within pmap. (We will need either 64 bit
physical addresses or page indexes, possibly both depending on the
circumstances. Leaving this to pmap itself gives more flexibilitly.)
Reviewed by: jake
Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)
trying to run X on some Athlon systems where the BIOS does odd things
(mines an ASUS A7A266, but it seems to also help on other systems).
Here's a description of the problem and my fix:
The problem with the old MTRR code is that it only expects
to find documented values in the bytes of MTRR registers.
To convert the MTRR byte into a FreeBSD "Memory Range Type"
(mrt) it uses the byte value and looks it up in an array.
If the value is not in range then the mrt value ends up
containing random junk.
This isn't an immediate problem. The mrt value is only used
later when rewriting the MTRR registers. When we finally
go to write a value back again, the function i686_mtrrtype()
searches for the junk value and returns -1 when it fails
to find it. This is converted to a byte (0xff) and written
back to the register, causing a GPF as 0xff is an illegal
value for a MTRR byte.
To work around this problem I've added a new mrt flag
MDF_UNKNOWN. We set this when we read a MTRR byte which
we do not understand. If we try to convert a MDF_UNKNOWN
back into a MTRR value, then the new function, i686_mrt2mtrr,
just returns the old value of the MTRR byte. This leaves
the memory range type unchanged.
I'd like to merge this before the 4.6 code freeze, so if people
can test this with XFree 4 that would be very useful.
PR: 28418, 25958
Tested by: jkh, Christopher Masto <chris@netmonger.net>
MFC after: 2 weeks
the indentation more like the other multi-line assembley in
this file.
Someone who understands gcc constraints could update the
constraints for do_cpuid.
o Recent changes to osigreturn() and sigreturn() have made them MPSAFE. Add
a comment to this effect.
Submitted by: bde (bullet #1)
Reviewed by: jhb (bullet #2)
_BYTE_ORDER. These are far more useful than their non-underscored
equivalents as these can be used in restricted namespace environments.
Mark the non-underscored variants as deprecated.
drivers with MI portions into the MI notes. Device drivers such as busses
like the isa, eisa, and pci devices are now in the MD NOTES section even
though they have some MI code. This will ensure that only the proper bits
of device drivers will be included due to the optional bits dependent on
the busses in sys/conf/files. This commit also takes the stance that since
hints are ignored in NOTES anyways, it is ok to include hints for a bus
that may not be present.
Advice from: bde
PCPU_LAZY_INC() which increments elements in it for cases where we
can afford the occassional inaccuracy. Use of per-cpu stats counters
avoids significant cache stalls in various critical paths that would
otherwise severely limit our cpu scaleability.
Adjust all sysctl's accessing cnt.* elements to now use a procedure
which aggregates the requested field for all cpus and for the global
vmmeter.
The global vmmeter is retained, since some stats counters, like v_free_min,
cannot be made per-cpu. Also, this allows us to convert counters from
the global vmmeter to the per-cpu vmmeter in a piecemeal fashion, so
have at it!
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
In the i386 case, options BOOTP requires options NFS_ROOT as well as
options NFSCLIENT. With *both* the NFS options, a bootpc_init()
prototype is brought in by nfsclient/nfsdiskless.h.
In the ia64 case, it just doesn't work and my change just pushes it
further away from working.
Suggested to be wrong by: bde
they aren't in the usual path of execution for syscalls and traps.
The main complication for this is that we have to set flags to control
ast() everywhere that changes the signal mask.
Avoid locking in userret() in most of the remaining cases.
Submitted by: luoqi (first part only, long ago, reorganized by me)
Reminded by: dillon
Unfortunately, this level doesn't really provide enough granularity. We
probably need several MI NOTES type files for things that are shared by
several architectures but not by all. For example, the PCI options could
live in a NOTES.pci.
This also updates the Makefile for i386 to generate LINT. The only changes
in the generated LINT are the order of various options.
Suggestions for improvement welcome.
in dump byte order (=network byte order). Swap blocksize and dumptime
to avoid extraneous padding on 64-bit architectures. Use CTASSERT
instead of runtime checks to make sure the header is 512 bytes large.
Various style(9) fixes.
Reviewed by: phk, bde, mike
various machdep.c's to being declared in kern_mutex.c.
- Add a new function mutex_init() used to perform early initialization
needed for mutexes such as setting up thread0's contested lock list
and initializing MI mutexes. Change the various MD startup routines
to call this function instead of duplicating all the code themselves.
Tested on: alpha, i386
and cpu_critical_exit() and moves associated critical prototypes into their
own header file, <arch>/<arch>/critical.h, which is only included by the
three MI source files that need it.
Backout and re-apply improperly comitted syntactical cleanups made to files
that were still under active development. Backout improperly comitted program
structure changes that moved localized declarations to the top of two
procedures. Partially re-apply one of the program structure changes to
move 'mask' into an intermediate block rather then in three separate
sub-blocks to make the code more readable. Re-integrate bug fixes that Jake
made to the sparc64 code.
Note: In general, developers should not gratuitously move declarations out
of sub-blocks. They are where they are for reasons of structure, grouping,
readability, compiler-localizability, and to avoid developer-introduced bugs
similar to several found in recent years in the VFS and VM code.
Reviewed by: jake
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
Caveats:
The new savecore program is not complete in the sense that it emulates
enough of the old savecores features to do the job, but implements none
of the options yet.
I would appreciate if a userland hacker could help me out getting savecore
to do what we want it to do from a users point of view, compression,
email-notification, space reservation etc etc. (send me email if
you are interested).
Currently, savecore will scan all devices marked as "swap" or "dump" in
/etc/fstab _or_ any devices specified on the command-line.
All architectures but i386 lack an implementation of dumpsys(), but
looking at the i386 version it should be trivial for anybody familiar
with the platform(s) to provide this function.
Documentation is quite sparse at this time, more to come.
Details:
ATA and SCSI drivers should work as the dump formatting code has been
removed. The IDA, TWE and AAC have not yet been converted.
Dumpon now opens the device and uses ioctl(DIOCGKERNELDUMP) to set
the device as dumpdev. To implement the "off" argument, /dev/null
is used as the device.
Savecore will fail if handed any options since they are not (yet)
implemented. All devices marked "dump" or "swap" in /etc/fstab
will be scanned and dumps found will be saved to diskfiles
named from the MD5 hash of the header record. The header record
is dumped in readable format in the .info file. The kernel
is not saved. Only complete dumps will be saved.
All maintainer rights for this code are disclaimed: feel free to
improve and extend.
Sponsored by: DARPA, NAI Labs
the osigcontext or ucontext_t rather than useracc() followed by direct user-
space memory accesses. This reduces (o)sigreturn()'s execution time by 5-
50%.
Submitted by: bde
back into the calling MD code. The MD code must ensure no races between
checking the astpening flag and returning to usermode.
Submitted by: peter (ia64 bits)
Tested on: alpha (peter, jeff), i386, ia64 (peter), sparc64
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.
Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.
Approved by: jhb
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).
Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.
This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.
Reviewed by: core
Approved by: core
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
Instead of caching the ucred reference, just go ahead and eat the
decerement and increment of the refcount. Now that Giant is pushed down
into crfree(), we no longer have to get Giant in the common case. In the
case when we are actually free'ing the ucred, we would normally free it on
the next kernel entry, so the cost there is not new, just in a different
place. This also removse td_cache_ucred from struct thread. This is
still only done #ifdef DIAGNOSTIC.
Tested on: i386, alpha
to copy the sigframe to the user's stack. Useracc() takes a non-trivial
amount of time. Eliminating it speeds up signal delivery by 15% or more.
o Update some comments.
Submitted by: bde
older PCI BIOSes hate this and this leads to panics when it is done. Also,
assume that a uniquely routed interrupt is already routed. This also
seems to help some older laptops with feable BIOSes cope.
Problem:
selwakeup required calling pfind which would cause lock order
reversals with the allproc_lock and the per-process filedesc lock.
Solution:
Instead of recording the pid of the select()'ing process into the
selinfo structure, actually record a pointer to the thread. To
avoid dereferencing a bad address all the selinfo structures that
are in use by a thread are kept in a list hung off the thread
(protected by sellock). When a selwakeup occurs the selinfo is
removed from that threads list, it is also removed on the way out
of select or poll where the thread will traverse its list removing
all the selinfos from its own list.
Problem:
Previously the PROC_LOCK was used to provide the mutual exclusion
needed to ensure proper locking, this couldn't work because there
was a single condvar used for select and poll and condvars can
only be used with a single mutex.
Solution:
Introduce a global mutex 'sellock' which is used to provide mutual
exclusion when recording events to wait on as well as performing
notification when an event occurs.
Interesting note:
schedlock is required to manipulate the per-thread TDF_SELECT
flag, however if given its own field it would not need schedlock,
also because TDF_SELECT is only manipulated under sellock one
doesn't actually use schedlock for syncronization, only to protect
against corruption.
Proc locks are no longer used in select/poll.
Portions contributed by: davidc
machdep.guessed_bootdev, and add code to sysctl to parse its value
and give a (not necessarily correct) name to the device we booted
from (the main motivation for this code is to use the info in the
PicoBSD boot scripts, and the impact on the kernel is minimal).
NOTE: the information available in bootdev is not always reliable,
so you should not trust it too much. The parsing code is the same
as in boot2.c, and cannot cover all cases -- as it is, it seems to
work fine with floppies and IDE disks recognised by the BIOS. It
_should_ work as well with SCSI disks recognised by the BIOS.
Booting from a CDROM in floppy emulation will return /dev/fd0 (because
this is what the BIOS tells us).
Booting off the network (e.g. with etherboot) leaves bootdev unset so
the value will be printed as "invalid (0xffffffff)".
Finally, this feature might go away at some point, hopefully when we
have a more reliable way to get the same information.
MFC-after: 5 days
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.
Submitted by: bde
o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.
Reviewed by: bde
and commenting of NETSMB, NETWMBCRYPTO, and SMBFS. In NOTES, they
had all floated to the bottom of the file with the list of seemingly
random and unclassified kernel options. This change moves them back
up to the network protocol and file system areas, and also documents
the dependencies.
be allocated as arrays indexed by the cpu id. Previously the only reliable
way to know the max cpu id was through MAXCPU. mp_ncpus isn't useful here
because cpu ids may be sparsely mapped, although x86 and alpha do not do this.
Also, call cpu_mp_probe much earlier so the max cpu id is known before the VM
starts up. This is intended to help support per cpu queues for the new
allocator, but may be useful elsewhere.
Reviewed by: jake
Approved by: jake
This makes other power-management system (APM for now) to be able to
generate power profile change events (ie. AC-line status changes), and
other kernel components, not only the ACPI components, can be notified
the events.
- move subroutines in acpi_powerprofile.c (removed) to kern/subr_power.c
- call power_profile_set_state() also from APM driver when AC-line
status changes
- add call-back function for Crusoe LongRun controlling on power
profile changes for a example
Previously, the UPAGES/KSTACK area of processes/threads would leak memory
at the time that a previously swapped process was terminated. Lukcily, the
leak was only 12K/proc, so it was unlikely to be a major problem unless you
had an undersized swap partition.
Submitted by: dillon
Reviewed by: silby
MFC after: 1 week
In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes. Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations. The algorithm has been changed so that
each page will be checked only 16 times.
Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period. With this change
in place, the vm_daemon completes in less than a second. Any system
with hundreds of processes sharing pages should benefit from this change.
Note that the vm_daemon is only run when the system is under extreme
memory pressure. It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.
Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.
PR: 33542, 20393
Reviewed by: dillon
MFC after: 1 week
device drivers for bus system with other endinesses than the CPU (using
interfaces compatible to NetBSD):
- bwap16() and bswap32(). These have optimized implementations on some
architectures; for those that don't, there exist generic implementations.
- macros to convert from a certain byte order to host byte order and vice
versa, using a naming scheme like le16toh(), htole16().
These are implemented using the bswap functions.
- stream bus space access functions, which do not perform a byte order
conversion (while the normal access functions would if the bus endianess
differs from the CPU endianess).
htons(), htonl(), ntohs() and ntohl() are implemented using the new
functions above for kernel usage. None of the above interfaces is currently
exported to user land.
Make use of the new functions in a few places where local implementations
of the same functionality existed.
Reviewed by: mike, bde
Tested on alpha by: mike
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.
boot and run (and indeed I am committing from it) instead of exploding
during the int 0x15 call from inside the atkbd driver to get the keyboard
repeat rates.
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.
Obtained from: jake (MI code, suggestions for MD part)
enabled in critical sections and streamline critical_enter() and
critical_exit().
This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.
This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.
This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.
The following changes have been made:
* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.
Other architectures are unaffected.
* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.
The new code places the burden of entering the critical section
in the trampoline code where it belongs.
* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.
* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.
* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.
* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.
* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.
* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.
Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.
* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.
Performance issues:
One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.
The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.
The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).
The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().
Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.
on for a while:
- fine grained TLB shootdown for SMP on i386
- ranged TLB shootdowns.. eg: specify a range of pages to shoot down with
a single IPI, since the IPI is very expensive. Adjust some callers
that used to trigger this inside tight loops to do a ranged shootdown
at the end instead.
- PG_G support for SMP on i386 (options ENABLE_PG_G)
- defer PG_G activation till after we decide what we are going to do with
PSE and the 4MB pages at the start of the kernel. This should solve
some rumored strangeness about stale PG_G entries getting stuck
underneath the 4MB pages.
- add some instrumentation for the fine TLB shootdown
- convert some asm instruction wrappers from functions to inlines. gcc
seems to do a fair bit better with this.
- [temporarily!] pessimize the tlb shootdown IPI handlers. I will fix
this again shortly.
This has been working fairly well for me for a while, but I have tweaked
it again prior to commit since my last major testing round. The only
outstanding problem that I know of is PG_G related, which is why there
is an option for it (not on by default for SMP). I have seen a world
speedups by a few percent (as much as 4 or 5% in one case) but I have
*not* accurately measured this - I am a bit sceptical of these numbers.
New locks are:
- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.
Please refer to sys/proc.h for the coverage of these locks.
Changes on the pgrp/session interface:
- pgfind() needs the pgrpsess_lock held.
- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.
- Call enterthispgrp() in order to enter an existing pgrp.
- pgsignal() requires a pgrp lock held.
Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)
While in userland, keep the thread's ucred reference in a shadow
field so that the usual place to store it is NULL.
If DIAGNOSTIC is not set, the thread ucred is kept valid until the next
kernel entry, at which time it is checked against the process cred
and possibly corrected. Produces a BIG speedup in
kernels with INVARIANTS set. (A previous commit corrected it
for the non INVARIANTS case already)
Reviewed by: dillon@freebsd.org