in various extattr_*() calls to match the rest of the file. Originally,
these bits at the end looked more like style(9). This patch was submitted
by green by way of the TrustedBSD MAC tree, and I fixed a few problems
with it on the way through. Someone with more time on their hands should
convert the entire file to style(9); this commit is for diff reduction
purposes.
Submitted by: green
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
constructing a struct aio and invoking VOP_READ() directly. This cleans
up the code a little, but also has the advantage of making sure almost
all vnode read/write access in the kernel goes through the helper
function, meaning that instrumentation of that helper function can impact
almost all relevant read/write operations. In this case, it permits us
to put MAC hooks into vn_rdwr() and not modify uipc_syscalls.c (yet).
In general, if helper vn_*() functions exist, they should be used in
preference to direct VOP's in system call service code.
Submitted by: green
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
needed in the current code, in the MAC tree, create_init() relies on the
ability to modify the credentials present for initproc, and should not
perform that modification on a shared credential. Pro-active diff
reduction against MAC changes that are in the queue; also facilitates
other work, including the capabilities implementation.
Submitted by: green
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
environment needed at boot time to a dynamic subsystem when VM is
up. The dynamic kernel environment is protected by an sx lock.
This adds some new functions to manipulate the kernel environment :
freeenv(), setenv(), unsetenv() and testenv(). freeenv() has to be
called after every getenv() when you have finished using the string.
testenv() only tests if an environment variable is present, and
doesn't require a freeenv() call. setenv() and unsetenv() are self
explanatory.
The kenv(2) syscall exports these new functionalities to userland,
mainly for kenv(1).
Reviewed by: peter
where some client operations might be unexpectedly cancelled during
an unsuccessful non-forced unmount attempt. This causes problems
for amd(8), because it periodically attempts a non-forced unmount
to check if the filesystem is still in use.
Fix this by adding a new mountpoint flag MNTK_UNMOUNTF that is set
only during the operation of a forced unmount. Use this instead of
MNTK_UNMOUNT to trigger the cancellation of hung NFS operations.
Also correct a problem where dounmount() might inadvertently clear
the MNTK_UNMOUNT flag.
Reported by: simokawa
MFC after: 1 week
- Use temporary variables to hold a pointer to a pgrp while we dink with it
while not holding either the associated proc lock or proctree_lock. It
is in theory possible that p->p_pgrp could change out from under us.
sx lock. Trying to get the lock order between these locks was getting
too complicated as the locking in wait1() was being fixed.
- leavepgrp() now requires an exclusive lock of proctree_lock to be held
when it is called.
- fixjobc() no longer gets a shared lock of proctree_lock now that it
requires an xlock be held by the caller.
- Locking notes in sys/proc.h are adjusted to note that everything that
used to be protected by the pgrpsess_lock is now protected by the
proctree_lock.
Apply the change as a continuous slew rather than as a series of
discrete steps and make it possible to adjust arbitraryly huge
amounts of time in either direction.
In practice this is done by hooking into the same once-per-second
loop as the NTP PLL and setting a suitable frequency offset deducting
the amount slewed from the remainder. If the remaining delta is
larger than 1 second we slew at 5000PPM (5msec/sec), for a delta
less than a second we slew at 500PPM (500usec/sec) and for the last
one second period we will slew at whatever rate (less than 500PPM)
it takes to eliminate the delta entirely.
The old implementation stepped the clock a number of microseconds
every HZ to acheive the same effect, using the same rates of change.
Eliminate the global variables tickadj, tickdelta and timedelta and
their various use and initializations.
This removes the most significant obstacle to running timecounter and
NTP housekeeping from a timeout rather than hardclock.
information related to bucket size effeciency. Three things are printed on
each row:
Size is the size the user actually asked for rounded to 16 bytes.
Requests is the number of times this size was asked for.
Real Size is the size we actually handed out.
At the end the total memory used and total waste is displayed. Currently my
system displays about 33% wasted memory.
The intent of this code is to gather statistics for tuning the malloc bucket
sizes. It is not intended to be run with INVARIANTS and it is not entirely
mp safe. It can be enabled via 'options MALLOC_PROFILE' which was commited
earlier.
Updated the kmemzones logic such that the ks_size bitmap can be used as an
index into it to report the size of the zone used.
Create the kern.malloc sysctl which replaces the kvm mechanism to report
similar data. This will provide an easy place for statistics aggregation if
malloc_type statistics become per cpu data.
Add some code ifdef'd under MALLOC_PROFILING to facilitate a tool for sizing
the malloc buckets.
we can use td_ucred.
- In killpg1(), the proc lock is sufficient to check if p_stat is SZOMB
or not. We don't need sched_lock.
- Close some races in psignal(). In psignal() there is a big switch
statement based on p_stat. All the different cases are assuming that
the process (or thread) isn't going to change state out from under it.
To ensure this is true, just lock sched_lock for the entire switch. We
practically held it the entire time already anyways. This also
simplifies the locking somewhat and actually results in fewer lock
operations.
- Allow signotify() to be called with the sched_lock held since psignal()
now does that.
- Use td_ucred in a couple of places.
process so it can use td_ucred.
- Require the target process of donice() to be locked when donice() is
called.
- Use td_ucred.
- Lock the target process of p_cansee() and while reading the credentials
of a process.
- Change the logic of rtprio() slightly so it does it's copyin() if needed
prior to locking the target process.
- rtprio() no longer needs Giant. In theory with full KSE it would still
need Giant to protect p_ucred of curproc for the p_canfoo() functions
but p_canfoo() will be changing to using td_ucred of curthread before
full KSE hits the tree.
allocate a blank cred first, lock the process, perform checks on the
old process credential, copy the old process credential into the new
blank credential, modify the new credential, update the process
credential pointer, unlock the process, and cleanup rather than trying
to allocate a new credential after performing the checks on the old
credential.
- Cleanup _setugid() a little bit.
- setlogin() doesn't need Giant thanks to pgrp/session locking and
td_ucred.
and acquire the proctree_lock if needed first. Then we lock the process
if necessary and fiddle with it as appropriate. Finally we drop locks and
do any needed copyout's. This greatly simplifies the locking.
belong to a user virtual address; while this happens to work on some
architectures, it can't on sparc64, since user and kernel virtual
address spaces overlap there (the distinction between them is done via
separate address space identifiers).
Instead, look up the page in the vm_map of the process in question.
Reviewed by: jake
so it can use td_ucred.
- Push Giant down into the end of settime() where we actually set the time
on the timecounter and time of day clock.
- Remove Giant from clock_settime().
- Push Giant down in settimeofday() to just protect the 'tz' global
variable.
linker_search_module().
Without this, modules loaded from loader.conf that then try to load
in additional modules (such as digi.ko loading a card's BIOS) die
badly in the vn_open() called from linker_search_module().
It may be worth checking (KASSERTing?) that rootdev != NODEV in
vn_open() too.
mod_depend * (which may be NULL). The only consumer of this
function at the moment is digi_loadmoduledata(), and that passes
a NULL mod_depend *.
In linker_reference_module(), check to see if we've already got
the required module loaded. If we have, bump the reference count
and return that, otherwise continue the module search as normal.
is called.
- Change sysctl_out_proc() to require that the process is locked when it
is called and to drop the lock before it returns. If this proves too
complex we can change sysctl_out_proc() to simply acquire the lock at
the very end and have the calling code drop the lock right after it
returns.
- Lock the process we are going to export before the p_cansee() in the
loop in sysctl_kern_proc() and hold the lock until we call
sysctl_out_proc().
- Don't call p_cansee() on the process about to be exported twice in
the aforementioned loop.
p_pgrp since the pgrp locking went in. We also don't need it to check for
invalid values in the options argument to wait1(), so push Giant down
slightly.
behavior by default. Also, change the options line to reflect this.
If there are no problems reported this will become the only behavior and the
knob will be removed in a month or so.
Demanded by: obrien
separate strings instead of passing "foo=bar".
o Don't forget to clear the VMOUNT flag on the vnode when vfs_nmount()
fails because the fs doesn't implement VFS_NMOUNT (and in vfs_mount()
when the fs doesn't implement VFS_MOUNT) ; also decrement the vfs
refcount in the !MNT_UPDATE case.
that we can compile gcc. This is a hack because it adds a fixed 2MB to
each process's VSIZE regardless of how much is really being used since
there is no grow-up stack support. At least it isn't physical memory.
Sigh.
Add a sysctl to enable tweaking it for new processes.
a set of helper routines to deal with real-time clocks. The generic
functions access the clock diver using a kobj interface. This is intended
to reduce code reduplication and make it easy to support more than one
clock model on a single architecture.
This code is currently only used on sparc64, but it is planned to convert
the code of the other architectures to it later.
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
the generic lock type for use with witness. If this argument is NULL then
the lock name is used as the lock type. Add a macro for a lock type name
for network driver locks.
point to a more generic name for a lock that is more suitable for use by
witness when grouping locks. For example, although network driver locks
use the interface name for the name of each lock, they should all use the
same witness and be treated the same as witness. Another example is that
all UMA zone locks should be treated the same. The witness code has also
been updated to print out the lock type in addition to the lock name in a
few places where it is relevant.
This shrinks the size 4 bytes on alpha, down to the same 276 bytes
as all other platforms.
Construct a hack to make old ioctls work on new kernels.
Once world is recompiled only the new and correct sysctls will be
used.
This hack will become annoying around 1st of may to make people
rebuild their worlds and it will be gone before 5.0.
they aren't in the usual path of execution for syscalls and traps.
The main complication for this is that we have to set flags to control
ast() everywhere that changes the signal mask.
Avoid locking in userret() in most of the remaining cases.
Submitted by: luoqi (first part only, long ago, reorganized by me)
Reminded by: dillon
inline function sigsetmasked() and a new macro SIGPENDING(). CURSIG()
will soon be moved out of the normal path of execution for syscalls and
traps. Then its efficiency will be less important but the new interfaces
will be useful for checking for unmasked pending signals in more places.
Submitted by: luoqi (long ago, in a slightly different form)
Assert that sched_lock is not held in CURSIG().
We get enough protection from the lock on the individual lists that we
aquire later.
Noticed/Tested by: Steven G. Kargl <kargl@troutmask.apl.washington.edu>
Submitted by: Jonathan Mini <mini@haikugeek.com>
securelevel_*() to be NULL for a while now.
- Use KASSERT() instead of if (foo) panic(); to optimize the
!INVARIANTS case.
Submitted by: Martin Faxer <gmh003532@brfmasthugget.se>
without removing the buffer from the vnode's dirty buffer list, which
can result in a panic in NFS. Replaced the code with a call to bundirty()
which deals with it properly.
PR: kern/36108, kern/36174
Submitted by: various people
Special mention: to Danny Schales <dan@coes.LaTech.edu> for providing a core dump that helped me track this down.
MFC after: 1 day
even when the number of records approaches the size of the hash table.
Besides, the previous implementation (using linear probing) was broken :)
Also, use the newly introduced MTX_SYSINIT.
various machdep.c's to being declared in kern_mutex.c.
- Add a new function mutex_init() used to perform early initialization
needed for mutexes such as setting up thread0's contested lock list
and initializing MI mutexes. Change the various MD startup routines
to call this function instead of duplicating all the code themselves.
Tested on: alpha, i386
locks to be able to setup a SYSINIT call. This helps in places where
a lock is needed to protect some data, but the data is not truly
associated with a subsystem that can properly initialize it's lock.
The macros use the mtx_sysinit() and sx_sysinit() functions,
respectively, as the handler argument to SYSINIT().
Reviewed by: alfred, jhb, smp@
release times. Measurements are made and stored in nanoseconds but
presented in microseconds, which should be sufficient for the locks for
which we actually want this (those that are held long and / or often).
Also, rename some variables and structure members to unit-agnostic names.
Rename memlock to sysctllock, and MEMLOCK()/MEMUNLOCK() to SYSCTL_LOCK()/
SYSCTL_UNLOCK() and related changes to make the lock names make more
sense.
Submitted by: Jonathan Mini <mini@haikugeek.com>
following sysctl variables:
debug.mutex.prof.enable enable / disable profiling
debug.mutex.prof.acquisitions number of mutex acquisitions recorded
debug.mutex.prof.records number of acquisition points recorded
debug.mutex.prof.maxrecords max number of acquisition points
debug.mutex.prof.rejected number of rejections (due to full table)
debug.mutex.prof.hashsize hash size
debug.mutex.prof.collisions number of hash collisions
debug.mutex.prof.stats profiling statistics
The code records four numbers for each acquisition point (identified by
source file name and line number): longest time held, total time held,
number of non-recursive acquisitions, average time held. The measurements
are in clock cycles (as returned by get_cyclecount(9)); this may cause
measurements on some SMP systems to be unreliable. This can probably be
worked around by replacing get_cyclecount(9) by some incarnation of
nanotime(9).
This work was derived from initial patches by eivind.
and cpu_critical_exit() and moves associated critical prototypes into their
own header file, <arch>/<arch>/critical.h, which is only included by the
three MI source files that need it.
Backout and re-apply improperly comitted syntactical cleanups made to files
that were still under active development. Backout improperly comitted program
structure changes that moved localized declarations to the top of two
procedures. Partially re-apply one of the program structure changes to
move 'mask' into an intermediate block rather then in three separate
sub-blocks to make the code more readable. Re-integrate bug fixes that Jake
made to the sparc64 code.
Note: In general, developers should not gratuitously move declarations out
of sub-blocks. They are where they are for reasons of structure, grouping,
readability, compiler-localizability, and to avoid developer-introduced bugs
similar to several found in recent years in the VFS and VM code.
Reviewed by: jake
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
dump the trace buffer feasible.
- Remove KTR_EXTEND. This changes the format of the trace entries when
activated, making writing a userland tool which is not tied to a specific
kernel configuration difficult.
- Use get_cyclecount() for timestamps. nanotime() is much too heavy weight
and requires recursion protection due to ktr traces occuring as a result
of ktr traces. KTR_VERBOSE may still require recursion protection, which
is now conditional on it.
- Allow KTR_CPU to be overridden by MD code. This is so that it is possible
to trace early in startup before pcpu and/or curthread are setup.
- Add a version number for the ktr interface. A userland tool can check this
to detect mismatches.
- Use an array for the parameters to make decoding in userland easier.
- Add file and line recording to the non-extended traces now that the extended
version is no more.
These changes will break gdb macros to decode the extended version of the
trace buffer which are floating around. Users of these macros should either
use the show ktr command in ddb, or use the userland utility which can be run
on a core dump.
Approved by: jhb
Tested on: i386, sparc64
Caveats:
The new savecore program is not complete in the sense that it emulates
enough of the old savecores features to do the job, but implements none
of the options yet.
I would appreciate if a userland hacker could help me out getting savecore
to do what we want it to do from a users point of view, compression,
email-notification, space reservation etc etc. (send me email if
you are interested).
Currently, savecore will scan all devices marked as "swap" or "dump" in
/etc/fstab _or_ any devices specified on the command-line.
All architectures but i386 lack an implementation of dumpsys(), but
looking at the i386 version it should be trivial for anybody familiar
with the platform(s) to provide this function.
Documentation is quite sparse at this time, more to come.
Details:
ATA and SCSI drivers should work as the dump formatting code has been
removed. The IDA, TWE and AAC have not yet been converted.
Dumpon now opens the device and uses ioctl(DIOCGKERNELDUMP) to set
the device as dumpdev. To implement the "off" argument, /dev/null
is used as the device.
Savecore will fail if handed any options since they are not (yet)
implemented. All devices marked "dump" or "swap" in /etc/fstab
will be scanned and dumps found will be saved to diskfiles
named from the MD5 hash of the header record. The header record
is dumped in readable format in the .info file. The kernel
is not saved. Only complete dumps will be saved.
All maintainer rights for this code are disclaimed: feel free to
improve and extend.
Sponsored by: DARPA, NAI Labs
the non-GEOM code as well. This simplifies the the kernel-dumping
and disk-management tools as less compatibility cruft will be needed.
Sponsored by: DARPA and NAI Labs.
while holding the proc lock, and by holding the pargs structure when
accessing it from outside of the owner.
Submitted by: Jonathan Mini <mini@haikugeek.com>
These functions use DEV_STRATEGY() which can easily return a short
count (with no error) for reads near EOF. EOF happens for "disks" too
small to contain a label sector (mainly for empty slices). The functions
didn't understand this at all, and looked for labels in the garbage
in the buffer beyond what DEV_STRATEGY() returned. The recent UMA
changes combined with my local changes and configuration resulted in
the garbage often containing a valid but garbage label left over from
a previous call.
Bugs in EOF handling in -current limited the problem to "disks" with
size precisely LABELSECTOR sectors. LABELSECTOR happens to be a very
unusual "disk" size since it is only 0 for non-i386 arches that don't
usually have disks with DOS MBRs.
back into the calling MD code. The MD code must ensure no races between
checking the astpening flag and returning to usermode.
Submitted by: peter (ia64 bits)
Tested on: alpha (peter, jeff), i386, ia64 (peter), sparc64
in vfs_mount(), in particular revisions 1.215, 1.227 and 1.240.
- flag2 is a low quality variable name, change it to kern_flag.
- strncpy NUL-terminates f_fstypename and f_mntonname since the strings
have length <= <buffer length> - 1, so the explicit NUL-termination is
bogus.
- M_ZERO'ing space for fstype and fspath is stupid since we never use the
space beyond the end of the string.
- Do various style(9) cleanups in both functions.
Submitted by: bde
Reviewed by: phk
can be called both with and without the pipe mutex held. (For example,
if called by pipeselwakeup(), it is held. Whereas, if called by kqueue_scan(),
it is not.)
Reviewed by: alfred
There is still some locations where the PROC lock should be held
in order to prevent inconsistent views from outside (like the
proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be
fixed later.
Submitted by: Jonathan Mini <mini@haikugeek.com>
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.
Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.
Approved by: jhb
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).
Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.
This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.
Reviewed by: core
Approved by: core
- return error -> return (error);
- move a declaration to the top of the function.
- become bug for bug compatible with if (error) lines.
Submitted by: bde
new vfs_getopt()/vfs_copyopt() API. This is intended to be used
later, when there will be filesystems implementing the VFS_NMOUNT
operation. The mount(2) system call will disappear when all
filesystems will be converted to the new API. Documentation will
be committed in a while.
Reviewed by: phk
modules (ie. procfs.ko).
When the kernel loads dynamic filesystem module, it looks for any of the
VOP operations specified by the new filesystem that have not been registered
already by the currently known filesystems. If any of such operations exist,
vfs_add_vnops function calls vfs_opv_recalc function, which rebuilds vop_t
vectors for each filesystem and sets all global pointers like ufs_vnops_p,
devfs_specop_p, etc to the new values and then frees the old pointers. This
behavior is bad because there might be already active vnodes whose v_op fields
will be left pointing to the random garbage, leading to inevitable crash soon.
Submitted by: Alexander Kabaev <ak03@gte.com>
kern_linker.c and rev. 1.237 of vfs_syscalls.c since these are not the
source of the recent panics occuring around kldloading file system
support modules.
Requested by: rwatson
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
code that is still not safe. suser() reads p_ucred so it still needs
Giant for the time being. This should allow kern.giant.proc to be set
to 0 for the time being.
Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
to test req->td for NULL values and then do somewhat more bizarre things
relating to securelevel special-casing and suser checks. Remove the
testing and conditional security checks based on req->td!=NULL, and insert
a KASSERT that td != NULL. Callers to sysctl must always specify the
thread (be it kernel or otherwise) requesting the operation, or a
number of current sysctls will fail due to assumptions that the thread
exists.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Discussed with: bde
NULL, turn warning printf's into panic's, since this call has been
restructured such that a NULL cred would result in a page fault anyway.
There appears to be one case where NULL is explicitly passed in in the
sysctl code, and this is believed to be in error, so will be modified.
Securelevels now always require a credential context so that per-jail
securelevels are properly implemented.
Obtained from: TrustedBSD Project
Sponsored by: NAI Labs
Discussed with: bde
made aware in jail environments. Supposedly something is broken, so
this should be backed out until further investigation proves otherwise,
or a proper fix can be provided.
method-based inter-process security checks. To do this, introduce
a new cr_seeotheruids(u1, u2) function, which encapsulates the
"see_other_uids" logic. Call out to this policy following the
jail security check for all of {debug,sched,see,signal} inter-process
checks. This more consistently enforces the check, and makes the
check easy to modify. Eventually, it may be that this check should
become a MAC policy, loaded via a module.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Instead of caching the ucred reference, just go ahead and eat the
decerement and increment of the refcount. Now that Giant is pushed down
into crfree(), we no longer have to get Giant in the common case. In the
case when we are actually free'ing the ucred, we would normally free it on
the next kernel entry, so the cost there is not new, just in a different
place. This also removse td_cache_ucred from struct thread. This is
still only done #ifdef DIAGNOSTIC.
[ missed this file in the previous commit ]
Tested on: i386, alpha
- Add a cred_free_thread() function (conditional on DIAGNOSTICS) that drops
a per-thread ucred reference to be used in debugging code when leaving
the kernel.
against users within a jail attempting to load kernel modules.
- Add a check of securelevel_gt() to vfs_mount() in order to chop some
low hanging fruit for the repair of securelevel checking of linking and
unlinking files from within jails. There is more to be done here.
Reviewed by: rwatson
all the global bits of ``module'' data. This commit adds a few generic
macros, MOD_SLOCK, MOD_XLOCK, etc., that are meant to be used as ways
of accessing the SX lock. It is also the first step in helping to lock
down the kernel linker and module systems.
Reviewed by: jhb, jake, smp@
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.
pmap_qremove. pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus. If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb. This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings. This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem. This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.
Reviewed by: peter
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.
kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.
kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.
kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.
kern/sys_socket.c
Grab giant in the socket fo_read/write functions.
kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.
Missed a place where the pipe sleep lock was needed in order to safely grab
Giant, fix it and add an assertion to make sure this doesn't happen again.
Fix typos in the PIPE_GET_GIANT/PIPE_DROP_GIANT that could cause the
wrong mutex to get passed to PIPE_LOCK/PIPE_UNLOCK.
Fix a location where the wrong pipe was being passed to
PIPE_GET_GIANT/PIPE_DROP_GIANT.
Problem:
selwakeup required calling pfind which would cause lock order
reversals with the allproc_lock and the per-process filedesc lock.
Solution:
Instead of recording the pid of the select()'ing process into the
selinfo structure, actually record a pointer to the thread. To
avoid dereferencing a bad address all the selinfo structures that
are in use by a thread are kept in a list hung off the thread
(protected by sellock). When a selwakeup occurs the selinfo is
removed from that threads list, it is also removed on the way out
of select or poll where the thread will traverse its list removing
all the selinfos from its own list.
Problem:
Previously the PROC_LOCK was used to provide the mutual exclusion
needed to ensure proper locking, this couldn't work because there
was a single condvar used for select and poll and condvars can
only be used with a single mutex.
Solution:
Introduce a global mutex 'sellock' which is used to provide mutual
exclusion when recording events to wait on as well as performing
notification when an event occurs.
Interesting note:
schedlock is required to manipulate the per-thread TDF_SELECT
flag, however if given its own field it would not need schedlock,
also because TDF_SELECT is only manipulated under sellock one
doesn't actually use schedlock for syncronization, only to protect
against corruption.
Proc locks are no longer used in select/poll.
Portions contributed by: davidc
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.
After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system. To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.
This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe. One more
user of lockmgr down, many more to go :)
The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.
A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable. I have personally been running this
patch for over a year now, so it is believed to be fully stable.
Reviewed by: jake, obrien
Approved by: jake
in "missing dependencies" error when loading some kld modules. It is sad to
see how often these days style cleanus break doesn't broken things. Perhaps
people should recall good old principle: "don't fix it if it isn't broken".
fully instaniated.
Revert the logic in pipeclose so that we don't have the entire function
pretty much under a single if() statement, instead invert the test and
just return if it fails.
Submitted (in different form) by: bde
Don't use pool mutexes for pipes. We can not use pool mutexes
because we will need to grab the select lock while holding a pipe
lock which is not allowed because you may not aquire additional
mutexes when holding a pool mutex.
Instead malloc(9) space for the mutex that is shared between the
pipes.
simply need to prevent switching from another CPU and do not need
interrupts disabled.
- Add a comment to witness_list() about why displaying spin locks for
threads on other CPU's really is just a bad idea and probably shouldn't
be done.
to exhaust all kmaps. The only reward for setting maxproc
to a value which will cause kmap exhaustion is a panic
during a forkbomb attack.
MFC after: 3 days
- Move jail checks and some other checks involving constants and stack
variables out from under Giant. This isn't perfectly safe atm because
jail_sysvipc_allowed is read w/o a lock meaning that its value could be
stale. This global variable will soon become a per-jail flag, however,
at which time it will either not need a lock or will use the prison lock.
people working on the MAC tree from getting toasted whenever system call
numbers are allocated in the main tree (for example, for KSE :-).
Calls allocated: __mac_{get,set}_proc, __mac_{get,set}_{fd,file}().
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)
Reviewed by: phk (but all errors are mine)
be allocated as arrays indexed by the cpu id. Previously the only reliable
way to know the max cpu id was through MAXCPU. mp_ncpus isn't useful here
because cpu ids may be sparsely mapped, although x86 and alpha do not do this.
Also, call cpu_mp_probe much earlier so the max cpu id is known before the VM
starts up. This is intended to help support per cpu queues for the new
allocator, but may be useful elsewhere.
Reviewed by: jake
Approved by: jake
This makes other power-management system (APM for now) to be able to
generate power profile change events (ie. AC-line status changes), and
other kernel components, not only the ACPI components, can be notified
the events.
- move subroutines in acpi_powerprofile.c (removed) to kern/subr_power.c
- call power_profile_set_state() also from APM driver when AC-line
status changes
- add call-back function for Crusoe LongRun controlling on power
profile changes for a example
PAGE_SIZE / MCLBYTES == 1 crash. Fix them by changing the
appropriate "allocate new page and bucket" code in mb_alloc to use
the macro for properly grabbing an allocated object from a bucket,
the one that checks whether the bucket is empty.
This should allow ken to continue testing zero-copy stuff on -CURRENT.
Noticed and provided debug info: ken
fill out netc_anon (a `struct ucred'), and add an XXX around the
entire operation since it isn't clear whether it's doing the right
thing with things like cr_uidinfo and cr_prison.
the data was supplied as a uio or an mbuf. Previously the limit was
ignored for mbuf data, and NFS could run the kernel out of mbufs
when an ipfw rule blocked retransmissions.
the pipe is locked and shouldn't be.
initialize pipe->pipe_mtxp to NULL when creating pipes in order not
to trip the above assertions.
swap pipe lock with giant around calls to pipe_destroy_write_buffer()
pipe_destroy_write_buffer issue noticed by: jhb
fully protect p_ucred yet so Giant is needed until all the p_ucred
locking is done. This is the original reason td_ucred was not used
immediately after its addition. Unfortunately, not using td_ucred is
not enough to avoid problems. Since p_ucred could be stale, we could
actually be dereferencing a stale pointer to dink with the refcount, so
we really need Giant to avoid foot-shooting. This allows td_ucred to
be safely used as well.
as arguments. The correct hostname is copied into the buffer
while having the prison's lock acquired in a jailed process'
case.
Reviewed by: jhb, rwatson
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.
Both ends of the pipe share a pool_mutex, this makes allocation
and deadlock avoidance easy.
Remove some un-needed FILE_LOCK ops while I'm here.
There are some issues wrt to select and the f{s,g}etown code that
we'll have to deal with, I think we may also need to move the calls
to vfs_timestamp outside of the sections covered by PIPE_LOCK.
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.
Reviewed by: rwatson
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.
Obtained from: jake (MI code, suggestions for MD part)
enabled in critical sections and streamline critical_enter() and
critical_exit().
This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.
This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.
This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.
The following changes have been made:
* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.
Other architectures are unaffected.
* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.
The new code places the burden of entering the critical section
in the trampoline code where it belongs.
* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.
* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.
* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.
* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.
* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.
* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.
Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.
* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.
Performance issues:
One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.
The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.
The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).
The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().
Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.
on for a while:
- fine grained TLB shootdown for SMP on i386
- ranged TLB shootdowns.. eg: specify a range of pages to shoot down with
a single IPI, since the IPI is very expensive. Adjust some callers
that used to trigger this inside tight loops to do a ranged shootdown
at the end instead.
- PG_G support for SMP on i386 (options ENABLE_PG_G)
- defer PG_G activation till after we decide what we are going to do with
PSE and the 4MB pages at the start of the kernel. This should solve
some rumored strangeness about stale PG_G entries getting stuck
underneath the 4MB pages.
- add some instrumentation for the fine TLB shootdown
- convert some asm instruction wrappers from functions to inlines. gcc
seems to do a fair bit better with this.
- [temporarily!] pessimize the tlb shootdown IPI handlers. I will fix
this again shortly.
This has been working fairly well for me for a while, but I have tweaked
it again prior to commit since my last major testing round. The only
outstanding problem that I know of is PG_G related, which is why there
is an option for it (not on by default for SMP). I have seen a world
speedups by a few percent (as much as 4 or 5% in one case) but I have
*not* accurately measured this - I am a bit sceptical of these numbers.
but never accept'ed, so they must be destroyed. Originally, unp_drop()
detected this situation by checking if so->so_head is non-NULL.
However, since revision 1.54 of uipc_socket.c (Feb 1999), so->so_head
is set to NULL before calling soabort(), so any unix-domain sockets
waiting to be accept'ed are leaked if the server socket is closed.
Resolve this by moving the socket destruction code into uipc_abort()
itself, and making it unconditional (the other caller of unp_drop()
never needs the socket to be destroyed). Use unp_detach() to avoid
the original code duplication when destroying the socket.
PR: kern/17895
Reviewed by: dwmalone (an earlier version of the patch)
MFC after: 1 week
our feet when we look inside timecounter structures.
Make the "sync_other" code more robust by never overwriting the
tc_next field.
Add counters for the bin[up]time functions.
Call tc_windup() in tc_init() and switch_timecounter() to make sure
we all the fields set right.
New locks are:
- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.
Please refer to sys/proc.h for the coverage of these locks.
Changes on the pgrp/session interface:
- pgfind() needs the pgrpsess_lock held.
- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.
- Call enterthispgrp() in order to enter an existing pgrp.
- pgsignal() requires a pgrp lock held.
Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)
While in userland, keep the thread's ucred reference in a shadow
field so that the usual place to store it is NULL.
If DIAGNOSTIC is not set, the thread ucred is kept valid until the next
kernel entry, at which time it is checked against the process cred
and possibly corrected. Produces a BIG speedup in
kernels with INVARIANTS set. (A previous commit corrected it
for the non INVARIANTS case already)
Reviewed by: dillon@freebsd.org