Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)
Reviewed by: phk (but all errors are mine)
be allocated as arrays indexed by the cpu id. Previously the only reliable
way to know the max cpu id was through MAXCPU. mp_ncpus isn't useful here
because cpu ids may be sparsely mapped, although x86 and alpha do not do this.
Also, call cpu_mp_probe much earlier so the max cpu id is known before the VM
starts up. This is intended to help support per cpu queues for the new
allocator, but may be useful elsewhere.
Reviewed by: jake
Approved by: jake
This makes other power-management system (APM for now) to be able to
generate power profile change events (ie. AC-line status changes), and
other kernel components, not only the ACPI components, can be notified
the events.
- move subroutines in acpi_powerprofile.c (removed) to kern/subr_power.c
- call power_profile_set_state() also from APM driver when AC-line
status changes
- add call-back function for Crusoe LongRun controlling on power
profile changes for a example
PAGE_SIZE / MCLBYTES == 1 crash. Fix them by changing the
appropriate "allocate new page and bucket" code in mb_alloc to use
the macro for properly grabbing an allocated object from a bucket,
the one that checks whether the bucket is empty.
This should allow ken to continue testing zero-copy stuff on -CURRENT.
Noticed and provided debug info: ken
fill out netc_anon (a `struct ucred'), and add an XXX around the
entire operation since it isn't clear whether it's doing the right
thing with things like cr_uidinfo and cr_prison.
the data was supplied as a uio or an mbuf. Previously the limit was
ignored for mbuf data, and NFS could run the kernel out of mbufs
when an ipfw rule blocked retransmissions.
the pipe is locked and shouldn't be.
initialize pipe->pipe_mtxp to NULL when creating pipes in order not
to trip the above assertions.
swap pipe lock with giant around calls to pipe_destroy_write_buffer()
pipe_destroy_write_buffer issue noticed by: jhb
fully protect p_ucred yet so Giant is needed until all the p_ucred
locking is done. This is the original reason td_ucred was not used
immediately after its addition. Unfortunately, not using td_ucred is
not enough to avoid problems. Since p_ucred could be stale, we could
actually be dereferencing a stale pointer to dink with the refcount, so
we really need Giant to avoid foot-shooting. This allows td_ucred to
be safely used as well.
as arguments. The correct hostname is copied into the buffer
while having the prison's lock acquired in a jailed process'
case.
Reviewed by: jhb, rwatson
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.
Both ends of the pipe share a pool_mutex, this makes allocation
and deadlock avoidance easy.
Remove some un-needed FILE_LOCK ops while I'm here.
There are some issues wrt to select and the f{s,g}etown code that
we'll have to deal with, I think we may also need to move the calls
to vfs_timestamp outside of the sections covered by PIPE_LOCK.
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.
Reviewed by: rwatson
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.
Obtained from: jake (MI code, suggestions for MD part)
enabled in critical sections and streamline critical_enter() and
critical_exit().
This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.
This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.
This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.
The following changes have been made:
* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.
Other architectures are unaffected.
* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.
The new code places the burden of entering the critical section
in the trampoline code where it belongs.
* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.
* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.
* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.
* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.
* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.
* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.
Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.
* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.
Performance issues:
One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.
The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.
The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).
The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().
Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.
on for a while:
- fine grained TLB shootdown for SMP on i386
- ranged TLB shootdowns.. eg: specify a range of pages to shoot down with
a single IPI, since the IPI is very expensive. Adjust some callers
that used to trigger this inside tight loops to do a ranged shootdown
at the end instead.
- PG_G support for SMP on i386 (options ENABLE_PG_G)
- defer PG_G activation till after we decide what we are going to do with
PSE and the 4MB pages at the start of the kernel. This should solve
some rumored strangeness about stale PG_G entries getting stuck
underneath the 4MB pages.
- add some instrumentation for the fine TLB shootdown
- convert some asm instruction wrappers from functions to inlines. gcc
seems to do a fair bit better with this.
- [temporarily!] pessimize the tlb shootdown IPI handlers. I will fix
this again shortly.
This has been working fairly well for me for a while, but I have tweaked
it again prior to commit since my last major testing round. The only
outstanding problem that I know of is PG_G related, which is why there
is an option for it (not on by default for SMP). I have seen a world
speedups by a few percent (as much as 4 or 5% in one case) but I have
*not* accurately measured this - I am a bit sceptical of these numbers.
but never accept'ed, so they must be destroyed. Originally, unp_drop()
detected this situation by checking if so->so_head is non-NULL.
However, since revision 1.54 of uipc_socket.c (Feb 1999), so->so_head
is set to NULL before calling soabort(), so any unix-domain sockets
waiting to be accept'ed are leaked if the server socket is closed.
Resolve this by moving the socket destruction code into uipc_abort()
itself, and making it unconditional (the other caller of unp_drop()
never needs the socket to be destroyed). Use unp_detach() to avoid
the original code duplication when destroying the socket.
PR: kern/17895
Reviewed by: dwmalone (an earlier version of the patch)
MFC after: 1 week
our feet when we look inside timecounter structures.
Make the "sync_other" code more robust by never overwriting the
tc_next field.
Add counters for the bin[up]time functions.
Call tc_windup() in tc_init() and switch_timecounter() to make sure
we all the fields set right.
New locks are:
- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.
Please refer to sys/proc.h for the coverage of these locks.
Changes on the pgrp/session interface:
- pgfind() needs the pgrpsess_lock held.
- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.
- Call enterthispgrp() in order to enter an existing pgrp.
- pgsignal() requires a pgrp lock held.
Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)
While in userland, keep the thread's ucred reference in a shadow
field so that the usual place to store it is NULL.
If DIAGNOSTIC is not set, the thread ucred is kept valid until the next
kernel entry, at which time it is checked against the process cred
and possibly corrected. Produces a BIG speedup in
kernels with INVARIANTS set. (A previous commit corrected it
for the non INVARIANTS case already)
Reviewed by: dillon@freebsd.org
Remove bowrite(), it is now unused.
This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.
Approved by: mckusick
- P_INMEM checks in all the functions. P_INMEM must be checked because
PHOLD() is broken. The old bits had bogus locking (using sched_lock)
to lock P_INMEM. After removing the P_INMEM checks, we were left with
just the bogus locking.
- large comments. They were too large, but better than nothing.
Remove obfuscations that were gained in transition in rev.1.76:
- PROC_REG_ACTION() is even more of an obfuscation than PROC_ACTION().
The change copies procfs_machdep.c rev.1.22 of i386/procfs_machdep.c
verbatim except for "fixing" the old-style function headers and adjusting
function names and comments. It doesn't remove the bogus locking.
Approved by: des
the next commit actually doing the:
return val; -> return (val);
changes. This commit was done in preparation for getting ``struct
modules'' locked down.
Reviewed by: bde
Approved by: dfr
call VOP_CLOSE() with vp unlocked; clean up the return path a little,
in as much as our namei/vnode operation return paths can be cleared
up. For a return case that was apparently never taken, this sure
is ugly.
Reviewed by: jeffr
- Leave 10 processes for root-only use, the previous
value of 1 was insufficient to run ps ax | more.
- Remove the printing of "proc: table full". When the table
really is full, this would flood the screen/logs, making
the problem tougher to deal with.
- Force any process trying to fork beyond its user's maximum
number of processes to sleep for .5 seconds before returning
failure. This turns 2000 rampaging fork monsters into 2000
harmlessly snoozing fork monsters.
Reviewed by: dillon, peter
MFC after: 1 week
step and the others are reservations for coming code.
All will be stubbed in this kernel in the next commit.
This will allow people to easily make KSE binaries for userland testing
(the syscalls will be in libc) but they will still need a real KSE kernel
to test it. (libc looks in /sys to decide what it should add stubs for).
VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls
are symmetric in all failure cases. This prevents an 'open' reference
from being leaked in that unlikely failure scenario.
should require a shared lock, rather than an exclusive lock, which can
improve performance. No actual code change here, since a number of
VFS locking fixes are in the works.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.
This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.
reaquiring it. In the same vein, don't bother dropping the thread cred
when goinf ot userland. We are guaranteed to nned it when we come back,
(which we are guaranteed to do).
Reviewed by: jhb@freebsd.org, bde@freebsd.org (slightly different version)
This escaped because DEVICE_POLLING is disabled in LINT being
not compatible with SMP. In fact, it is only a runtime problem,
so if we could recognize that we are building a LINT kernel
we could as well disable the check for SMP being defined.
Reported-by: Joe Clarke
to perform an ownership test in revoke(). This is also required for
MAC hooks so that the vnode lock is held during a call to the MAC
framework. Release the lock before calling VOP_REVOKE().
Discussed with: phk, mckusick
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.
o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.
o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.
o Update various filesystems (pseodofs, ufs) to DTRT.
These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
- Fix null-pointer dereference introduced when snapshotting
was introduced. This occured because unlike the previous code,
vn_start_write() doesn't always return a non-NULL mp, as
filesystems may not support the VOP_GETWRITEMOUNT() call. For
now, rely on two pointers, so that vn_finished_write() works
properly.
- Fix locking problems on exit, introduced at some past time,
some when snapshots came in, where a vnode might not be
unlocked before being vrele'd in various error situations.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
The binary format "bintime" is a 32.64 format, it will go to 64.64
when time_t does.
The bintime format is available to consumers of time in the kernel,
and is preferable where timeintervals needs to be accumulated.
This change simplifies much of the magic math inside the timecounters
and improves the frequency and time precision by a couple of bits.
I have not been able to measure a performance difference which was not
a tiny fraction of the standard deviation on the measurements.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
- Create a private list of active pmaps rather than abusing the list of all
processes when we need to look up pmaps. The process list needs a sx lock
and we can't be getting sx locks in the middle of cpu_switch()
(pmap_activate() can call pmap_get_asn() from cpu_switch()). Instead, we
protect the list with a spinlock. This also means the list is shorter
since a pmap can be used by more than one process and we could (at least
in thoery) dink with pmap's more than once, but now we only touch each
pmap once when we have to update all of them.
- Wrap pmap_activate()'s code to get a new ASN in an explicit critical section
so that when it is called while doing an exec() we can't get preempted.
- Replace splhigh() in pmap_growkernel() with a critical section to prevent
preemption while we are adjusting the kernel page tables.
- Fixes abuse of PCPU_GET(), which doesn't return an L-value.
- Also adds some slight cleanups to the ASN handling by adding some macros
instead of magic numbers in relation to the ASN and ASN generations.
Reviewed by: dfr
shared.
Also introduce vm_endcopy instead of using pointer tricks when
initializing new vmspaces.
The race occured because of how the reference was utilized:
test vmspace reference,
possibly block,
decrement reference
When sharing a vmspace between multiple processes it was possible
for two processes exiting at the same time to test the reference
count, possibly block and neither one free because they wouldn't
see the other's update.
Submitted by: green
HZ=BIGNUM will strain the assumptions behind timecounters to the
point where they break.
This may or may not help people seeing microuptime() backwards messages.
Make the global timecounter variable volatile, it makes no difference in
the code GCC generates, but it makes represents the intent correctly.
Thanks to: jdp
MFC after: 2 weeks
call VOP_INACTIVE before placing the vnode back on the free list.
Otherwise there is a race condition on SMP machines between
getnewvnode() locking the vnode to reclaim it and vrele()
locking the vnode to inactivate it. This window of vulnerability
becomes exaggerated in the presence of filesystems that have
been suspended as the inactive routine may need to temporarily
release the lock on the vnode to avoid deadlock with the syncer
process.
threads race for a file slot.
dup2(2) incorrectly assumes that if it needs to grow the ofiles
array that it will get what it wants. This assertion was valid
before we allowed shared filedescriptor tables but is now incorrect.
The assertion can trigger superfolous panics if the thread doing a
dup2 looses a race with another thread while possibly blocked in
the MALLOC call in fdalloc. Another thread may grab the slot we
are requesting which makes fdalloc return something other than what
we asked for, this will triggering the bogus assertion.
MFC after: 2 weeks
Reviewed by: phk
signal trampoline for old signals. The arches that support old signals
currently abuse sigreturn(2) instead. This mainly complicates things
and slightly breaks the the new sigreturn(2).
COMPAT is too limited to support the correct configuration of osigreturn,
and this commit doesn't attempt to fix it; it just moves the bogusness:
osigreturn() must now be provided unconditionally even on arches that
don't really need it; previously it had to be provided under the bogus
condition defined(COMPAT_43).
other threads as well as speed up the interfaces.
To fix the race and accomplish the speedup, remove selholddrop and
pollholddrop. The entire concept is somewhat bogus because holding
the individual struct file pointers offers us no guarantees that
another thread context won't close it on us thereby removing our
access to our own reference.
Selholddrop and pollholddrop also would do multiple locks and unlocks
of mutexes _per-file_ in the fd arrays to be scanned, this needed to
be sped up.
Instead of using selholddrop and pollholddrop, simply hold the
filedesc lock over the selscan and pollscan functions. This should
protect us against close(2)'s on the files as reduce the multiple
lock/unlock pairs per fd into a single lock over the filedesc.
from 1 megabyte of ram per user to 2 megabytes of ram per user, and
reduce the cap from 512 to 384. 512 leaves around 240 MB of KVM available
while 384 leaves 270 MB of KVM available. Available KVM is important
in order to deal with zalloc and kernel malloc area growth.
Reviewed by: mckusick
MFC: either before 4.5 if re's agree, or after 4.5
This allows obtaining crash dumps from the panics occured during late stages
of kernel initialisation before system enters into single-user mode.
MFC after: 2 weeks
replace mutex_lock calls on uidinfo with macro calls:
mtx_lock(&uidp->ui_mtx) -> UIDINFO_LOCK(uidp)
Terry Lambert <tlambert2@mindspring.com> helped with this.
sleeping on a process object but changed the corresponding
wakeup()s to the thread object. The result was that non-raw
aio ops waited for an aio daemon to timeout before action
was taken. Now, we sleep on the thread object.
PR: kern/34016
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.
Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day