timeout. If DIAGNOSTIC is turned on, then display a message to the console
with a map of which CPUs failed to stop or restart. This gives an SMP box
at least a fighting chance of getting into DDB if one of the other CPUs has
interrupts disabled.
Alter consumers of this method to conform to the new convention.
Minor cosmetic adjustments to bus.h.
This isn't of concern as this interface isn't in use yet.
1) mpsafe (protect the refcount with a mutex).
2) reduce duplicated code by removing the inlined crdup() from crcopy()
and make crcopy() call crdup().
3) use M_ZERO flag when allocating initial structs instead of calling bzero
after allocation.
4) expand the size of the refcount from a u_short to an u_int, by using
shorts we might have an overflow.
Glanced at by: jake
use a mutex lock when looking up/deleting entries on the hashlist
use a mutex lock on each uidinfo when updating fields
make uifree() a void function rather than 'int' since no one cares
allocate uidinfo structs with the M_ZERO flag and don't explicitly initialize
them
Assisted by: eivind, jhb, jakeb
a kevent upon completion of the I/O. Specifically, introduce a new type
of sigevent notification, SIGEV_EVENT. If sigev_notify is SIGEV_EVENT,
then sigev_notify_kqueue names the kqueue that should receive the event
and sigev_value contains the "void *" is copied into the kevent's udata
field.
In contrast to the existing interface, this one: 1) works on
the Alpha 2) avoids the extra copyin() call for the kevent because all
of the information needed is in the sigevent and 3) could be
applied to request a single kevent upon completion of an entire lio_listio().
Reviewed by: jlemon
could not compress into clusters. This could result in lots of
wasted clusters while recieving small packets from an interface
that uses clusters for all it's packets.
Patch is partially from BSDi (limiting the size of the copy) and
based on a patch for 4.1 by Ian Dowse <iedowse@maths.tcd.ie> and
myself.
Reviewed by: bmilekic
Obtained From: BSDi
Submitted by: iedowse
- Use the mutex in hardclock to ensure no races between it and
softclock.
- Make softclock be INTR_MPSAFE and provide a flag,
CALLOUT_MPSAFE, which specifies that a callout handler does not
need giant. There is still no way to set this flag when
regstering a callout.
Reviewed by: -smp@, jlemon
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...
PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>
are in softclock() for a long time. The old code already did an
splx()/slphigh() pair here, I just missed adding in the equivalent mutex
operations on sched_lock earlier.
may block on a mutex while on the sleep queue without corrupting
it.
- Move dropping of Giant to after the acquire of sched_lock.
Tested by: John Hay <jhay@icomtek.csir.co.za>
jhb
untimeout() not being called with Giant in those functions. For now,
use the sched_lock to protect the callout wheel in softclock() and in
the various timeout and callout functions.
Noticed by: tegge
acquire Giant as needed in functions that call mi_switch(). The releases
need to be done outside of the sched_lock to avoid potential deadlocks
from trying to acquire Giant while interrupts are disabled.
Submitted by: witness
idea to be holding the sched_lock while we are calling it. As such,
release sched_lock before calling CURSIG() in msleep() and mawait() and
reacquire it after CURSIG() returns.
Submitted by: witness
tsleep(). Namely, mawait() takes an extra argument which is a mutex
to drop when going to sleep. Just as with msleep(), if the priority
argument includes the PDROP flag, then the mutex will be dropped and will
not be reacquired when the process wakes up.
- Add in a backwards compatible macro await() that passes in NULL as the
mutex argument to mawait().
macros which provide the same functionality and are a bit more
efficient, convert use of CIRCLEQ's in resource manager to TAILQ's.
Approved by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
whether the given ext_buf is shared.
* Have the sf_bufs be setup with the mbuf subsystem using MEXTADD() with the
two new arguments.
Note: m_pulldown() is somewhat crotchy; the added comment explains the
situation.
Reviewed by: jlemon
could debug processes after the commit that introduced the typo.
Security is good, but security is not always the same as turning things
off :-).
PR: kern/22711
Obtained from: brooks@one-eyed-alien.net
device tree and resource manager contents. This is the kernel side of
the upcoming libdevinfo, which will expose this information to userspace
applications in a trivial fashion.
Remove the now-obsolete DEVICE_SYSCTLS code.
syscall compare against a variable sv_minsigstksz in struct
sysentvec as to properly take the size of the machine- and
ABI dependent struct sigframe into account.
The SVR4 and iBCS2 modules continue to have a minsigstksz of
8192 to preserve behavior. The real values (if different) are
not known at this time. Other ABI modules use the real
values.
The native MINSIGSTKSZ is now defined as follows:
Arch MINSIGSTKSZ
---- -----------
alpha 4096
i386 2048
ia64 12288
Reviewed by: mjacob
Suggested by: bde
function declared in kern_ktr.c. The only inline checks left are the
checks that compare KTR_COMPILE with the supplied mask and thus should
be optimized away into either nothing or a direct call to ktr_tracepoint().
- Move several KTR-related options to opt_ktr.h now that they are only
needed by kern_ktr.c and not by ktr.h.
- Add in the ktr_verbose functionality if KTR_EXTEND is turned on. If the
global variable 'ktr_verbose' is non-zero, then KTR messages will be
dumped to the console. This variable can be set by either kernel code
or via the 'debug.ktr_verbose' sysctl. It defaults to off unless the
KTR_VERBOSE kernel option is specified in which case it defaults to on.
This can be useful when the machine locks up spinning in a loop with
interrupts disabled as you might be able to see what it is doing when it
locks up.
Requested by: phk
wakeup all of the sleeping threads when we free only one buffer. This
avoids us having to needlessly try again (and fail, and go back to
sleep) for all the threads sleeping. We will now only wakeup the
thread we know will succeed.
Reviewed by: green
freelist. Should now be thread-friendly, in part.
Note: More work is needed in uipc_syscalls.c, but it will have to wait until
the socket locking issues are at least 80% implemented and committed.
getnewvnode(). Otherwise routines called from VOP_INACTIVE() might
attempt to remove the vnode from a free list the vnode isn't on,
causing corruption.
PR: 18012
This removes support for booting current kernels with very old bootblocks.
Device driver writers: Please remove initializations for the d_bmaj
field in your cdevsw{}.
current implementation, jail neither virtualizes the Sys V IPC namespace,
nor provides inter-jail protections on IPC objects.
o Support for System V IPC can be enabled by setting jail.sysvipc_allowed=1
using sysctl.
o This is not the "real fix" which involves virtualizing the System V
IPC namespace, but prevents processes within jail from influencing those
outside of jail when not approved by the administrator.
Reported by: Paulo Fragoso <paulo@nlink.com.br>
in the p_candebug() function. Synchronize with sef's CHECKIO()
macro from the old procfs, which seems to be a good source of security
checks.
Obtained from: TrustedBSD Project
This allows writing to DVD-RAM, PD and similar drives that probe as CD
devices. Note that these are randomly writeable devices, not
sequential-only devices like CD-R drives, which are supported by cdrecord.
Add a new flag value for dsopen(), DSO_COMPATLABEL. The cd(4) driver now
uses this flag instead of the DSO_NOLABELS flag. The DSO_NOLABELS always
used a "fake" disklabel for the entire disk, provided by the caller.
With the DSO_COMPATLABEL flag, dsopen() will first search the media for a
label, and if it finds a label, it will use that label. Otherwise it will
use the fake disklabel provided by the caller. This provides backwards
compatibility, since we will still have labels for ISO9660 media.
It also provides new functionality, since you can now have a regular BSD
disklabel on read-only media, or on writeable media (e.g. DVD-RAM).
Bruce and I both think that we should eventually (in a few years) get
away from using disklabels for ISO9660 media, and just use the whole disk
device (/dev/cd0). At that point disklabel handling in the cd(4) driver
could follow the "normal" model, as used in the da(4) driver.
Also, clean up the path in a couple of places in cdregister(). (Thanks to
Nick Hibma for catching that bug.)
Reviewed by: bde
Otherwise, aio_read() and aio_write() on sockets are broken if a kevent is
registered. (The code after kevent registration for handling sockets assumes
that the struct file pointer "fp" still refers to the socket, not the kqueue.)
<sys/proc.h> to <sys/systm.h>.
Correctly document the #includes needed in the manpage.
Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.
Define __offsetof() in <machine/ansi.h>
Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>
Remove myriad of local offsetof() definitions.
Remove includes of <stddef.h> in kernel code.
NB: Kernelcode should *never* include from /usr/include !
Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.
Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.
Paritials reviews by: various.
Significant brucifications by: bde
it can function before malloc(9) is up and running.
- Add two new options WITNESS_DDB and WITNESS_SKIPSPIN. If WITNESS_SKIPSPIN
is enabled, then spin mutexes are ignored by the WITNESS code. If
WITNESS_DDB is turned on and DDB is compiled into the kernel, then the
kernel will drop into DDB when either a lock hierarchy violation occurs
or mutexes are held when going to sleep.
- Add some new sysctls:
debug.witness_ddb is a read-write sysctl that corresponds to WITNESS_DDB.
The kernel option merely changes the default value to on at boot.
debug.witness_skipspin is a read-only sysctl that one can use to determine
if the kernel was compiled with WITNESS_SKIPSPIN.
- Wipe out the BSD/OS-specific lock order lists. We get to build our own
lists now as we add mutexes to the kernel.
type of software interrupt. Roughly, what used to be a bit in spending
now maps to a swi thread. Each thread can have multiple handlers, just
like a hardware interrupt thread.
- Instead of using a bitmask of pending interrupts, we schedule the specific
software interrupt thread to run, so spending, NSWI, and the shandlers
array are no longer needed. We can now have an arbitrary number of
software interrupt threads. When you register a software interrupt
thread via sinthand_add(), you get back a struct intrhand that you pass
to sched_swi() when you wish to schedule your swi thread to run.
- Convert the name of 'struct intrec' to 'struct intrhand' as it is a bit
more intuitive. Also, prefix all the members of struct intrhand with
'ih_'.
- Make swi_net() a MI function since there is now no point in it being
MD.
Submitted by: cp
Instead of:
foo = malloc(sizeof(foo), M_WAIT);
bzero(foo, sizeof(foo));
You can now (and please do) use:
foo = malloc(sizeof(foo), M_WAIT | M_ZERO);
In the future this will enable us to do idle-time pre-zeroing of
malloc-space.
reducues the maintenance load for the mutex code. The only MD portions
of the mutex code are in machine/mutex.h now, which include the assembly
macros for handling mutexes as well as optionally overriding the mutex
micro-operations. For example, we use optimized micro-ops on the x86
platform #ifndef I386_CPU.
- Change the behavior of the SMP_DEBUG kernel option. In the new code,
mtx_assert() only depends on INVARIANTS, allowing other kernel developers
to have working mutex assertiions without having to include all of the
mutex debugging code. The SMP_DEBUG kernel option has been renamed to
MUTEX_DEBUG and now just controls extra mutex debugging code.
- Abolish the ugly mtx_f hack. Instead, we dynamically allocate
seperate mtx_debug structures on the fly in mtx_init, except for mutexes
that are initiated very early in the boot process. These mutexes
are declared using a special MUTEX_DECLARE() macro, and use a new
flag MTX_COLD when calling mtx_init. This is still somewhat hackish,
but it is less evil than the mtx_f filler struct, and the mtx struct is
now the same size with and without mutex debugging code.
- Add some micro-micro-operation macros for doing the actual atomic
operations on the mutex mtx_lock field to make it easier for other archs
to override/optimize mutex ops if needed. These new tiny ops also clean
up the code in some places by replacing long atomic operation function
calls that spanned 2-3 lines with a short 1-line macro call.
- Don't call mi_switch() from mtx_enter_hard() when we block while trying
to obtain a sleep mutex. Calling mi_switch() would bogusly release
Giant before switching to the next process. Instead, inline most of the
code from mi_switch() in the mtx_enter_hard() function. Note that when
we finally kill Giant we can back this out and go back to calling
mi_switch().
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.
Reviewed by: eivind
Obtained from: TrustedBSD Project