Commit Graph

9300 Commits

Author SHA1 Message Date
John Baldwin
6b81555744 Axe KTR_ALQ_MASK now that KTR_WITNESS is off unless you hack an #ifdef
in subr_witness.c.  I did add a comment in subr_witness.c noting that
KTR_WITNESS is incompatible with KTR_ALQ.
2006-01-25 14:57:23 +00:00
Stephan Uphoff
6807424d19 Back out changes made in rev. 1.151.
They were bogus.

Cluebat applied by: jhb@
2006-01-25 02:05:47 +00:00
Don Lewis
f4af687a3b Touch all the pages wired by sysctl_wire_old_buffer() to avoid PTE
modified bit emulation traps on Alpha while holding locks in the
sysctl handler.

A better solution would be to pass a hint to the Alpha pmap code to
tell mark these pages as modified when they as they are being wired,
but that appears to be more difficult to implement.

Suggested by: jhb
MFC after:	3 days
2006-01-25 01:03:34 +00:00
John Baldwin
67f7fe8c01 Whitespace fix. 2006-01-24 22:24:05 +00:00
John Baldwin
2b604e82b2 - Add a new KTR_SUBSYS in place of KTR_SPARE1 to serve as a subsystem
placeholder similar to KTR_DEV.  Explain the use of KTR_DEV and
  KTR_SUBSYS in a comment as well.
- Retire KTR_WITNESS and instead have KTR_WITNESS default to off but use
  KTR_SUBSYS if it is enabled.
2006-01-24 22:23:45 +00:00
David Xu
1aa4c324ee Add locking annotation and comments about socket, pipe, fifo problem.
Temporarily fix a locking problem for socket I/O.
2006-01-24 07:24:24 +00:00
David Xu
e6bdc05ff7 Er, rescure a deleted comment line. 2006-01-24 02:50:42 +00:00
David Xu
bd793be3c6 More cleanup for aio code:
1) unregsiter kqueue filter for EVFILT_LIO.
2) free uma_zones.
3) call setsid directly to enter another session rather than
   implementing by itself.

Submitted by: jhb
2006-01-24 02:46:15 +00:00
David Xu
7f34b521c7 Add bracket. 2006-01-23 23:46:30 +00:00
John Baldwin
704c9f00fb Fix a vnode reference leak in the ktrace code. We always grab a reference
to the vnode at the start of ktr_writerequest() but were missing the
corresponding vrele() after we finished the write operation.

Reported by:	jasone
2006-01-23 21:45:32 +00:00
Stephan Uphoff
03001f59c8 Hopefully fix the "calcru: runtime went backwards from ..." problem by
keeping the resource values locked (where needed) while we use them
for calculations.

MFC after:	3 days
2006-01-23 19:15:13 +00:00
Andre Oppermann
fd2413398c In mb_zinit_pack() explicitly ignore the return value of uma_zalloc_arg().
The success of the cluster allocation is checked through a field in the
mbuf structure.  This change is non-functional but properly silences code
inspection tools.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID807
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-01-23 15:49:01 +00:00
David Xu
68d7111884 Verify all supported notification types. 2006-01-23 10:27:15 +00:00
David Xu
a9bf5e37ae 1) Merge _aio_aqueue and aio_aqueue, check quota in aio_aqueue,
so that lio_listio won't exceed the quota.
2) Remove lio_ref_count, it is no longer used.
2006-01-23 02:49:34 +00:00
Alan Cox
bb53e2bf27 Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.

Reviewed by: tegge
2006-01-23 00:00:45 +00:00
Don Lewis
1dd5fc0fde Tweak previous vfs_lookup.c commit to return an EINVAL error from
lookup() instead of EPERM when a DELETE or RENAME operation is
attempted on "..".

In kern_unlink(), remap EINVAL errors returned from namei() to EPERM
to match existing (and POSIX required) behaviour.

Discussed with:	bde
MFC after:	3 days
2006-01-22 19:37:02 +00:00
David Xu
8c0d9af5bf Fix a bogus panic. 2006-01-22 09:39:59 +00:00
David Xu
9b84335c84 Decrease kaio_active_count first, because user process may go away
after we notified it.
2006-01-22 09:25:52 +00:00
David Xu
4ca4c9ee68 Regen. 2006-01-22 06:01:48 +00:00
David Xu
1ce9182407 Make aio code MP safe. 2006-01-22 05:59:27 +00:00
Nate Lawson
a2b31c5b4f Add a devd(8) event that is sent after the system resumes. This can be
used by utilities to reset moused(8), for example.  The syntax is:

    !system=kern subsystem=power type=resume

Note that it would be nice to have notification of suspend, but it's more
difficult since there would have to be a method of doing request/ack
to userland and automatically timing out if no response.  apm(4) has a
similar mechanism.

MFC after:	2 weeks
2006-01-22 01:06:25 +00:00
Robert Watson
c1250af683 Convert remaining functions to ANSI C function declarations.
MFC after:	1 week
2006-01-22 00:30:46 +00:00
Alan Cox
e5e6093ba9 Avoid a vm object reference leak in a rarely used code path.
An executable contains at most one PT_INTERP program header.  Therefore,
the loop that searches for it can terminate after it is found rather than
iterating over the entire set of program headers.

Eliminate an unneeded initialization.

Reviewed by: tegge
2006-01-21 20:11:49 +00:00
Don Lewis
bea7a8d75c Return EPERM from lookup() if cn_nameiop is DELETE or RENAME and
the last component of the path name is "..".  This keeps VOP_LOOKUP()
from locking vnodes in reverse order.

Tested by:	Denis Shaposhnikov <dsh AT vlink DOT ru>
MFC after:	3 days
2006-01-21 19:57:56 +00:00
Robert Watson
6be2c41a22 Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C
prototypes, as the majority of new functions added have been in this
style.  Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition.  In practice
this didn't matter as only poll flags in the lower 16 bits are used.

MFC after:	1 week
2006-01-21 19:42:10 +00:00
John Baldwin
267ec43593 When loading a driver that is a subclass of another driver don't set the
devclass's parent pointer if the two drivers share the same devclass.  This
can happen if the drivers use the same new-bus name.  For example, we
currently have 3 drivers that use the name "pci": the generic PCI bus
driver, the ACPI PCI bus driver, and the OpenFirmware PCI bus driver.  If
the ACPI PCI bus driver was defined as a subclass of the generic PCI bus
driver, then without this check the "pci" devclass would point to itself
as its parent and device_probe_child() would spin forever when it
encountered the first PCI device that did have a matching driver.

Reviewed by:	dfr, imp, new-bus@
2006-01-20 21:59:13 +00:00
Julian Elischer
11f4763dd4 Return the thread name in the kinfo_proc structure.
Also correct the comment describing what the value is.
2006-01-18 20:27:43 +00:00
John Baldwin
25e498b4b0 Always include the lock_classes[] array in the kernel. The
"is it a spinlock" test in mtx_destroy() needs it even in non-debug
kernels.

Reported by:	danfe
2006-01-18 18:02:50 +00:00
Juli Mallett
b241b0a239 Since p_cansee will end up dereferencing p_ucred, don't check for p_ucred
equal to NULL several times later.  p_ucred "should probably not" be NULL
if the process isn't PRS_NEW anyway.  This is strongly reinforced by the fact
that we don't see frequent crashes here.  Remove the checks after p_cansee and
add a KASSERT right before it.

Found by:	Coverity Prevent (tm)

Also trim one nearby trailing space.
2006-01-17 20:25:01 +00:00
John Baldwin
6ef970a972 Bah. Fix 'show lock' to actually be compiled in. I had just fixed this in
p4 but had an older subr_lock.c on the machine I committed to CVS from.
2006-01-17 16:58:32 +00:00
John Baldwin
83a81bcb14 Add a new file (kern/subr_lock.c) for holding code related to struct
lock_obj objects:
- Add new lock_init() and lock_destroy() functions to setup and teardown
  lock_object objects including KTR logging and registering with WITNESS.
- Move all the handling of LO_INITIALIZED out of witness and the various
  lock init functions into lock_init() and lock_destroy().
- Remove the constants for static indices into the lock_classes[] array
  and change the code outside of subr_lock.c to use LOCK_CLASS to compare
  against a known lock class.
- Move the 'show lock' ddb function and lock_classes[] array out of
  kern_mutex.c over to subr_lock.c.
2006-01-17 16:55:17 +00:00
John Baldwin
550d1c9392 Initialize thread0.td_contested in init_turnstiles() rather than
mutex_init() as it is used by the turnstile code and is not mutex-specific.
2006-01-17 16:47:42 +00:00
John Baldwin
3eb9cab0c6 Garbage collect turnstile_empty() since it is unused. 2006-01-17 16:40:20 +00:00
Poul-Henning Kamp
25a14196dd Fix an 11 year old mistake: Let the hash functions take a void* instead
of unsigned char* argument.
2006-01-17 15:35:57 +00:00
Tor Egge
dffaf91aa3 Set flag in needsbuffer while still holding bqlock to avoid lost wakeup. 2006-01-16 22:09:47 +00:00
Christian S.J. Peron
323203d389 vfs_busy can only return something useful if MNTK_UNMOUNT has been set.
Since we are using vfs_busy() on a freshly allocated mount structure, use
(void) to show that we do not care about the return value.

Found with:	Coverity Prevent (tm)
MFC after:	2 weeks
2006-01-15 20:14:11 +00:00
Robert Watson
6994eebcab Cast VFS_STATFS() in vfs_domount() to (void) to indicate that ignoring the
return value is intentional: this is simply an attempt to pre-cache the
statfs state.

Found with:	Coverity Prevent (tm)
MFC after:	3 days
2006-01-15 20:01:05 +00:00
Christian S.J. Peron
8213baf002 Initialize ki to p->p_aioinfo after we know it's going to be referencing
a valid kaioinfo structure. This avoids a potential NULL pointer dereference.

Found with:	Coverity Prevent(tm)
MFC after:	2 weeks
2006-01-15 01:55:45 +00:00
Ruslan Ermilov
6a61c14ee1 AMD64 also supports disk slices. 2006-01-14 20:47:11 +00:00
Poul-Henning Kamp
c9df826b0a Correct STAILQ usage in purge of resourcelist.
Found with:   Coverity Prevent(tm)
2006-01-14 09:41:35 +00:00
Scott Long
0f92108d32 Add the following to the taskqueue api:
taskqueue_start_threads(struct taskqueue **, int count, int pri,
			const char *name, ...);

This allows the creation of 1 or more threads that will service a single
taskqueue.  Also rework the taskqueue_create() API to remove the API change
that was introduced a while back.  Creating a taskqueue doesn't rely on
the presence of a process structure, and the proc mechanics are much better
encapsulated in taskqueue_start_threads().  Also clean up the
taskqueue_terminate() and taskqueue_free() functions to safely drain
pending tasks and remove all associated threads.

The TASKQUEUE_DEFINE and TASKQUEUE_DEFINE_THREAD macros have been changed
to use the new API, but drivers compiled against the old definitions will
still work.  Thus, recompiling drivers is not a strict requirement.
2006-01-14 01:55:24 +00:00
Robert Watson
bc03ea7f49 When calling bioq_first() to see if a queue is empty in bioq_disksort(),
don't save the return value as we won't use it.

Noticed by:	Coverity Prevent analysis tool
MFC after:	3 days
2006-01-13 23:27:12 +00:00
Robert Watson
b8ae1cd619 Add sosend_dgram(), a greatly reduced and simplified version of sosend()
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin().  Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
follow-up.
2006-01-13 10:22:01 +00:00
Robert Watson
d7dca9034c XXX a comment in uipc_usrreq.c that requires updating. 2006-01-13 00:00:32 +00:00
Alfred Perlstein
7d7e053c21 Novel idea, don't print a string if it is NULL!
This protects people from loading _really_ old modules, like say from
5.x to a 6.x or 7.x system, like for instance right after an upgrade.
2006-01-12 19:15:14 +00:00
Scott Long
1c3a3b0bd0 The interlock in taskqueue_terminate() is completely wrong for taskqueues
that use spinlocks.  Remove it for now.
2006-01-11 00:37:13 +00:00
Poul-Henning Kamp
d3e64681d6 Move the old BSD4.3 tty compatibility from (!BURN_BRIDGES && COMPAT_43)
to COMPAT_43TTY.

Add COMPAT_43TTY to NOTES and */conf/GENERIC

Compile tty_compat.c only under the new option.

Spit out
	#warning "Old BSD tty API used, please upgrade."
if ioctl_compat.h gets #included from userland.
2006-01-10 09:19:10 +00:00
Scott Long
9df1a6dd61 Add functions and macros and refactor code to make it easier to manage
fast taskqueues.  The following have been added:

TASKQUEUE_FAST_DEFINE() - create a global task queue.
    an arbitrary execution context.
TASKQUEUE_FAST_DEFINE_THREAD() - create a global taskqueue that uses a
    dedicated kthread.
taskqueue_create_fast() - create a local/private taskqueue.

These are all complimentary of the standard taskqueue functions.  They are
primarily useful for fast interrupt handlers that can only use spinlock for
synchronization.

I personally think that the taskqueue API is starting to get too narrow and
hairy, but fixing it will require a major redesign on the API.  Such a
redesign would be good but would break compatibility with FreeBSD 6.x, so
it really isn't desirable at this time.

Submitted by: sam
2006-01-10 06:31:12 +00:00
Tor Egge
82be0a5a24 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
Scott Long
861a23087b If destroying a spinlock, make sure that it is exited properly.
Submitted by: jhb
MFC After: 3 days
2006-01-08 00:18:34 +00:00
John Baldwin
3b783acd2a Revert an untested local change that crept in with the lo_class changes
and subsequently broke the build.  This change is supposed to fix the
case where doing a mtx_destroy() off a spin mutex while you hold it fails.
If it had been tested I would just leave it in, but it hasn't been tested
yet, so it will have to wait until later.
2006-01-07 14:03:15 +00:00
David Xu
0a5cd498bb Add a new feature to thr_kill, if thread ID argument is -1, send
signals to all threads except current sender.
2006-01-07 03:15:21 +00:00
Tai-hwa Liang
75d6a87fa3 Trying to fix compilation bustage introduced in rev1.160 by converting
a missing lo_class to LO_CLASSINDEX().
2006-01-07 02:07:08 +00:00
John Baldwin
3c6decc327 Trim another pointer from struct lock_object (and thus from struct mtx and
struct sx).  Instead of storing a direct pointer to a our lock_class
struct in lock_object, reserve 4 bits in the lo_flags field to serve as an
index into a global lock_classes array that contains pointers to the lock
classes.  Only debugging code such as WITNESS or INVARIANTS checks and KTR
logging need to access the lock_class member, so this shouldn't add any
overhead to production kernels.  It might add some slight overhead to
kernels using those debug options however.

As with the previous set of changes to lock_object, this is going to
completely obliterate the kernel ABI, so be sure to recompile all your
modules.
2006-01-06 18:07:32 +00:00
John Baldwin
af56abaab5 Return error from fget_write() rather than hardcoding EBADF now that
fget_write() DTRT.

Requested by:	bde
2006-01-06 16:34:22 +00:00
John Baldwin
38f63f7e47 Return EBADF rather than EINVAL for FWRITE failure as per POSIX.
MFC after:	1 week
2006-01-06 16:30:30 +00:00
John Baldwin
e730167f16 Remove XXX comments complaining that write(2) on a read-only descriptor
returns EBADF.  That errno is correct and is mandated by POSIX.  It also
goes back to revision 1.1 of our CVS history (i.e. 4.4BSD).

The _fget() function should probably also be upated as it currently returns
EINVAL in that case rather than EBADF.  (It does return EBADF for reads
on a write-only descriptor without any XXX comments oddly enough.)

Discussed with:	scottl, grog, mjacob, bde
2006-01-05 22:20:31 +00:00
Bjoern A. Zeeb
ba0b6851b4 Minor whitespace cleanup. 2006-01-04 17:40:54 +00:00
Poul-Henning Kamp
d5f1e0d1ef Deorbit ttymalloc() in preference for ttyalloc() 2006-01-04 09:59:07 +00:00
Poul-Henning Kamp
8607f52b66 Use ttyalloc() instead of ttymalloc() 2006-01-04 09:09:46 +00:00
Poul-Henning Kamp
246b8d448a Use MTX_SYSINIT to set up the tty list mutex. 2006-01-04 08:22:39 +00:00
Diomidis Spinellis
c3d78136c9 Fix style bug.
Prompted by:	bde
2006-01-04 07:50:54 +00:00
Diomidis Spinellis
f8ccc6ceb9 Replace tv_usec normalization with the return of EINVAL.
This addresses two objections to the previous behavior,
and unbreaks the alpha tinderbox build.

TODO: update the utimes(2) man page.
2006-01-04 00:47:13 +00:00
Diomidis Spinellis
51339e8593 Normalize the tv_usec part of the utimes(2) arguments to ensure
that a file's atime and mtime are only set to correct fractional
second values (0-999999000ns with the current interface).
Prior to this change users could create files with values outside
that range.  Moreover, on 32-bit machines tv_usec offsets larger than
4.3s would result in an unnormalized AND wrong timestamp value,
due to overflow.

MFC after:	1 week
2006-01-03 21:58:21 +00:00
Alexander Leidinger
ef39c05baa MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
   this allows to try different coloring algorithms without the need to
   touch every file [1]
 - make the page queue tuning values readable: sysctl vm.stats.pagequeue
 - autotuning of the page coloring values based upon the cache size instead
   of options in the kernel config (disabling of the page coloring as a
   kernel option is still possible)

MD changes:
 - detection of the cache size: only IA32 and AMD64 (untested) contains
   cache size detection code, every other arch just comes with a dummy
   function (this results in the use of default values like it was the
   case without the autotuning of the page coloring)
 - print some more info on Intel CPU's (like we do on AMD and Transmeta
   CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by:	Chad David <davidc@acns.ab.ca> [1]
Reviewed by:		alc, arch (in 2004)
Discussed with:		alc, Chad David, arch (in 2004)
2005-12-31 14:39:20 +00:00
Pawel Jakub Dawidek
d362c40d3a Improve memguard a bit:
- Provide tunable vm.memguard.desc, so one can specify memory type without
  changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
  vm.memguard.desc, which can be changed to short description of memory
  type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
  short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
  doesn't exist (will be loaded later) or when no memory is allocated yet.
  If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
  default.
2005-12-30 11:45:07 +00:00
Pawel Jakub Dawidek
e7736557d6 Print a warning when we miss vinactive() call, because of race in vget().
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.

For now, verify if it is really hard to trigger.

Discussed with:	kan
2005-12-29 22:52:09 +00:00
John Baldwin
8963150678 patch(1) and I aren't friends today. Axe a duplicate copy of
the msleep_spin() function definition.

Spotted by:	pjd
2005-12-29 21:15:32 +00:00
John Baldwin
0cb7e6aec8 Add a new function msleep_spin() which is a slightly stripped down version
of msleep().  msleep_spin() doesn't support changing the priority of the
thread while it is asleep nor does it support interruptible sleeps (PCATCH)
or the PDROP flag.  It does support timeouts however.  It differs from
msleep() in that the passed in mutex is a spin mutex.  This means one can
use msleep_spin() and wakeup() with a spin mutex similar to msleep() and
wakeup() with a regular mutex.  Note that the spin mutex in question needs
to come before sched_lock and the sleepq locks in lock order.
2005-12-29 20:57:45 +00:00
John Baldwin
b0e9883e2f Teach WITNESS_SAVE() and WITNESS_RESTORE() to work with spin locks instead
of only sleep locks.
2005-12-29 20:54:25 +00:00
John Baldwin
0a46ed7d56 Fix a deadlock I introduced with the recently added printf to warn about
spin locks that are not in the static order list.  It is not safe to call
printf while holding the witness spin mutex since the console drivers that
back printf may need to use their own spin locks which would try to talk
to witness when they were locked.  Given this, it is possible for one
CPU to lock a console driver lock (such as sio) which then tries to lock
the witness lock while another CPU is doing the printf while holding the
witness lock.  Fix this by moving the printf outside of the witness lock.
All other printf's in witness are already correct.

MFC after:	3 days
2005-12-29 20:53:01 +00:00
John Baldwin
42b6a681bc Increment kobj_lookup_misses on a miss rather than decrementing it.
Otherwise, the miss count is actually -kobj_lookup_misses.  Mostly a
pedantic change as KOBJ_STATS isn't on by default.
2005-12-29 18:00:42 +00:00
David Xu
3357835a46 Add code to report zombie state.
PR: threads/91044
MFC after: 3 days
2005-12-29 13:00:42 +00:00
Alexander Kabaev
3f34977614 Trim trailing whitespace. 2005-12-28 17:13:31 +00:00
Pawel Jakub Dawidek
619f284195 In realloc(9), determine size of the original block based on
UMA_SLAB_MALLOC flag.
In some circumstances (I observed it when I was doing a lot of reallocs)
UMA_SLAB_MALLOC can be set even if us_keg != NULL.

If this is the case we have wonderful, silent data corruption, because less
data is copied to the newly allocated region than should be.

I'm not sure when this bug was introduced, it could be there undetected
for years now, as we don't have a lot of realloc(9) consumers and it was
hard to reproduce it...
...but what I know for sure, is that I don't want to know who introduce
the bug:) It took me two/three days to track it down (of course most of
the time I was looking for the bug in my own code).
2005-12-28 01:53:13 +00:00
David Xu
9f8eb3cb52 Use variable i instead of variable cpus as an index to get correct kseq. 2005-12-27 12:02:03 +00:00
Maxim Sobolev
d49b21093c Fix breakage introduced in the previous commit. 2005-12-26 22:32:52 +00:00
Maxim Sobolev
900b28f9f6 Remove kern.elf32.can_exec_dyn sysctl. Instead extend Brandinfo structure
with flags bitfield and set BI_CAN_EXEC_DYN flag for all brands that usually
allow executing elf dynamic binaries (aka shared libraries). When it is
requested to execute ET_DYN elf image check if this flag is on after we
know the elf brand allowing execution if so.

PR:		kern/87615
Submitted by:	Marcin Koziej <creep@desk.pl>
2005-12-26 21:23:57 +00:00
Alan Cox
60bb39431a Maintain the lock on the vnode for most of exec_elfN_imgact().
Specifically, it is required for the I/O that may be performed by
elfN_load_section().

Avoid an obscure deadlock in the a.out, elf, and gzip image
activators.  Add a comment describing why the deadlock does not occur
in the common case and how it might occur in less usual circumstances.

Eliminate an unused variable from exec_aout_imgact().

In collaboration with: tegge
2005-12-24 04:57:50 +00:00
David Xu
d7bc12b096 Avoid kernel panic when attaching a process which may not be stopped
by debugger, e.g process is dumping core. Only access p_xthread if
P_STOPPED_TRACE is set, this means thread is ready to exchange signal
with debugger, print a warning if P_STOPPED_TRACE is not set due to
some bugs in other code, if there is.

The patch has been tested by Anish Mistry mistry.7 at osu dot edu, and
is slightly adjusted.
2005-12-24 02:59:29 +00:00
Jeff Roberson
49bdcff518 - Remove and unused include.
Submitted by:	Antoine Brodin <antoine.brodin@laposte.net>
2005-12-23 21:32:40 +00:00
Poul-Henning Kamp
25f6e35a05 Regenerate sysent with new abort2 system call.
Implement abort2(const char *reason, int narg, void **args);

Submitted by:	"Wojciech A. Koszek" <dunstan@freebsd.czest.pl>
2005-12-23 11:58:42 +00:00
Poul-Henning Kamp
5a56b437ec Add abort2() systemcall. 2005-12-23 11:54:11 +00:00
Poul-Henning Kamp
49091c48d5 Make sbuf_copyin() return the number of bytes copied on success.
Submitted by:	"Wojciech A. Koszek" <dunstan@freebsd.czest.pl>
2005-12-23 11:49:53 +00:00
Scott Long
d2a401cb70 Create the taskqueue_fast handler with INTR_MPSAFE so that it doesn't run
with Giant.

MFC After: 3 days
2005-12-23 06:18:33 +00:00
John Baldwin
b439e431bf Tweak how the MD code calls the fooclock() methods some. Instead of
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly.  In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures.  Basically,
  all the archs were taking a trapframe and converting it into a clockframe
  one way or another.  Now they can just extract the PC and usermode values
  directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
  accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
  the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
  timecounter, call hardclock() directly.  This removes an extra
  conditional check from every clock interrupt on Alpha on the BSP.
  There is probably room for even further pruning here by changing Alpha
  to use the simplified timecounter we use on x86 with the lapic timer
  since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
  is false, so add a KASSERT() to that affect and remove a condition
  to slightly optimize the non-lapic case.
- Change prototypeof  arm_handler_execute() so that it's first arg is a
  trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.

Tested on:	alpha, amd64, arm, i386, ia64, sparc64
Reviewed by:	bde (mostly)
2005-12-22 22:16:09 +00:00
Alan Cox
373d1a3f8c Maintain the vnode lock throughout elfN_load_file() rather than releasing
it and reacquiring it in vrele().  Consequently, there is no reason to
increase the reference count on the vm object caching the file's pages.
Reviewed by: tegge

Eliminate unused parameters to elfN_load_file().
2005-12-21 18:58:40 +00:00
Alan Cox
ff6f03c7cd Eliminate an unneeded (vm_prot_t) parameter from two functions. Eliminate
unnecessary uses of a local variable.

Reviewed by: tegge
2005-12-20 23:42:18 +00:00
Pawel Jakub Dawidek
c505fe7a0f Reduce Giant scope a bit, as fdrop() is believed to be MPSAFE.
The purpose of this change is consistency (not performance improvement:)),
as it was hard to tell if fdrop() is MPSAFE or not when I saw it sometimes
under the Giant and sometimes without it.

Glanced at by:	ssouhlal, kan
2005-12-20 00:49:59 +00:00
Pawel Jakub Dawidek
ade9b797a0 vfs_mount_alloc() always returns 0, but what we really want is newly
allocated 'struct mount *' pointer, so simplify code a bit and return
the pointer directly.

Reviewed by:	ssouhlal
2005-12-20 00:43:51 +00:00
Pawel Jakub Dawidek
003ba8a000 Use 'td' instead of 'curthread'. 2005-12-19 16:27:13 +00:00
David Xu
a1d4fe69d2 Fix a bug in slice calculation code, current code uses hz but
sched_clock() is called by state clock.

Submitted by: taku at tackymt dot homeip dot net
2005-12-19 08:26:09 +00:00
Nate Lawson
bd6b217753 Remove the KTR for hardclock completely. It seems to not be useful.
Requested by:	jhb
2005-12-18 18:11:55 +00:00
Nate Lawson
1335c4df32 Restore KTR_CRITICAL but conditionally compile it in as KTR_SCHED.
Requested by:	scottl, jhb
2005-12-18 18:10:57 +00:00
Marcel Moolenaar
757686b115 Make our ELF64 type definitions match standards. In particular this
means:
o  Remove Elf64_Quarter,
o  Redefine Elf64_Half to be 16-bit,
o  Redefine Elf64_Word to be 32-bit,
o  Add Elf64_Xword and Elf64_Sxword for 64-bit entities,
o  Use Elf_Size in MI code to abstract the difference between
   Elf32_Word and Elf64_Word.
o  Add Elf_Ssize as the signed counterpart of Elf_Size.

MFC after: 2 weeks
2005-12-18 04:52:37 +00:00
Alan Cox
044bbbb523 Correct a long-standing problem in elfN_map_insert(): In order to copy a
page to user space, the user space mapping must allow write access.

In collaboration with: tegge@
MFC after: 3 weeks
2005-12-17 19:40:47 +00:00
Nate Lawson
8615fd8696 Clean up unused or poorly utilized KTR values. Remove KTR_FS, KTR_KGDB,
and KTR_IO as they were never used.  Remove KTR_CLK since it was only
used for hardclock firing and use KTR_INTR there instead.  Remove
KTR_CRITICAL since it was only used for crit enter/exit and use
KTR_CONTENTION instead.
2005-12-17 03:57:10 +00:00
John Baldwin
5c8b444153 - Use uintfptr_t rather than int for the kernel profiling index (though it
really should be a fptrdiff_t if we had that) in profclock().
- Don't try to profile kernel pc's that are >= the kernel lowpc to avoid
  underflows when computing a profiling index.
- Use the PC_TO_I() macro to compute the kernel profiling index rather than
  doing it inline.

Discussed with:	bde
2005-12-16 22:11:52 +00:00
John Baldwin
cb49fcd145 Change the addupc_*() functions to use the uintfptr_t type for pc rather
than uintptr_t as that is technically more correct.
2005-12-16 22:08:32 +00:00
Alan Cox
584716b08a Style: The second argument to vm_map_find() should be NULL instead of 0. 2005-12-16 19:14:25 +00:00
Alan Cox
da61b9a69e Use sf_buf_alloc() instead of vm_map_find() on exec_map to create the
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space.  There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail.  When it fails, the nascent process is aborted
(SIGABRT).  Whereas, this reimplementation using sf_buf_alloc()
sleeps.  (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster.  Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.

Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks
2005-12-16 18:34:14 +00:00
Xin LI
6ba9ec2d09 In pipe_write(): when uiomove() fails, do not spin on it forever.
Submitted by:	Kostik Belousov <kostikbel at gmail.com> on -current@
Message-ID:	<20051216151016.GE84442@deviant.zoral.local>
MFC After:	3 weeks
2005-12-16 18:32:39 +00:00
David Xu
03f70aec67 Replace selwakeuppri with selwakeup, let scheduler figure out
appropriate thread priority.
2005-12-16 15:01:16 +00:00
Ed Maste
63e6f39011 When using m_dup(9) to copy more than MHLEN bytes of data, don't create an
mbuf chain that starts with a cluster containing just MHLEN bytes.  This
happened because m_dup called m_get or m_getcl depending on the amount of
data to copy, but then always set the size available in the first mbuf to
MHLEN.

Submitted by:	Matt Koivisto <mkoivisto at sandvine dot com>
Approved by:	jmg
Silence from:	rwatson (mentor)
2005-12-14 23:34:26 +00:00
Maxime Henrion
e59898ff36 Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by:	Thomas Hurst <tom@hur.st>
MFC after:	3 days
2005-12-14 22:27:48 +00:00
Dag-Erling Smørgrav
0430a5e289 Eradicate caddr_t from the VFS API. 2005-12-14 00:49:52 +00:00
John Baldwin
d272fe53a4 Add a new 'show lock' command to ddb. If the argument has a valid lock
class, then it displays various information about the lock and calls a
new function pointer in lock_class (lc_ddb_show) to dump class-specific
information about the lock as well (such as the owner of a mutex or
xlock'ed sx lock).  This is easier than staring at hex dumps of locks to
figure out who owns the lock, etc.  Note that extending lock_class doesn't
affect the ABI for any kernel modules as the only code that deals with
lock_class structures directly is kern_mutex.c, kern_sx.c, and witness.

MFC after:	1 week
2005-12-13 23:14:35 +00:00
David Xu
dd1a6f53ac Stop fiddling thread priority with msleep, eliminating unnecessary
context switching. This improves performance about 30% on UP machine.
2005-12-12 05:04:56 +00:00
Craig Rodrigues
92f44a3f3c Contributions from XFS for FreeBSD project:
- Implement cv_wait_unlock() method which has semantics compatible
  with the sv_wait() method in IRIX.  For cv_wait_unlock(), the lock
  must be held before entering the function, but is not held when the
  function is exited.

- Implement the existing cv_wait() function in terms of cv_wait_unlock().

Submitted by:	kan
Feedback from:	jhb, trhodes, Christoph Hellwig <hch at infradead dot org>
2005-12-12 00:02:22 +00:00
Alan Cox
05406e6f33 Remove unneeded calls to pmap_remove_all(). The given page is not mapped.
Reviewed by: tegge
2005-12-11 22:06:57 +00:00
Andre Oppermann
36ae3fd3c3 Hide the 4k mbuf clusters if the normal clusters are defined to be
4k already.

This unbreaks tinderbox.

Submitted by:	ru
2005-12-10 15:21:04 +00:00
David Xu
3e70c6f047 Fix compiling warning on 64 bits system. 2005-12-09 13:16:48 +00:00
David Xu
f71a882f15 Add a sysctl to force a process to sigexit if a trap signal is
being hold by current thread or ignored by current process,
otherwise, it is very possible the thread will enter an infinite loop
and lead to an administrator's nightmare.
2005-12-09 08:29:29 +00:00
David Xu
d26b1a1fb9 Register itimers_event_hook as a kernel event handler, so I don't
have to duplicate code to call it in exec() and exit1().
2005-12-09 05:43:26 +00:00
David Xu
102178d0df Comment out mqfs_create_link. Inline some small functions. 2005-12-09 02:38:29 +00:00
David Xu
5c4745177f Now SIGCHLD is always queued. 2005-12-09 02:27:55 +00:00
David Xu
761a4d9423 Cleanup sigqueue sysctl. 2005-12-09 02:26:44 +00:00
Andre Oppermann
d5269a636b Add an API for jumbo mbuf cluster allocation and also provide
4k clusters in addition to 9k and 16k ones.

 struct mbuf *m_getjcl(int how, short type, int flags, int size)
 void *m_cljget(struct mbuf *m, int how, int size)

m_getjcl() returns an mbuf with a cluster of the specified size attached
like m_getcl() does for 2k clusters.

m_cljget() is different from m_clget() as it can allocate clusters
without attaching them to an mbuf.  In that case the return value
is the pointer to the cluster of the requested size.  If an mbuf was
specified, it gets the cluster attached to it and the return value
can be safely ignored.

For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES.

Reviewed by:	glebius
Tested by:	glebius
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-12-08 13:13:06 +00:00
Craig Rodrigues
d5989f64cf In devfs_first(), set mp->mnt_opt to a valid empty list of mount options
instead of leaving it NULL.  This eliminates a kernel panic
when trying to do a mount -o update of /dev.

Noticed by:	cjsp
Reviewed by:	phk
2005-12-08 04:27:53 +00:00
Craig Rodrigues
8539ca4cde Add "errmsg" to list of global mount options. 2005-12-08 04:09:29 +00:00
Craig Rodrigues
6951bea6c8 Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
    - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
    - b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by:	kan
Reviewed by:		phk
Silence on:		arch@
2005-12-07 03:39:08 +00:00
Alan Cox
8ad398d089 Reduce the scope of the page queues lock in exec_map_first_page(). The vm
object lock is sufficient for reading a page's PG_BUSY and busy flags.

MFC after: 1 week
2005-12-06 07:39:36 +00:00
David Xu
9da8a32aae o Turn on MPSAFE flag for mqueuefs.
o Reuse si_mqd field in siginfo_t, this also gives userland
  information about which descriptor is notified.
2005-12-06 06:22:12 +00:00
David Xu
027f760408 Fix a lock leak in childproc_continued(). 2005-12-06 05:30:13 +00:00
John Baldwin
5d2162b2f8 Tweak witness handling of lock object to shave 2 pointers off of each
lock object (and thus off of each mutex and sx lock):
- Rename the all_locks list to pending_locks and only put locks initialized
  before SI_SUB_WITNESS on the list so that the SI_SUB_WITNESS can add them
  to witness once it starts up.
- Now that pending_locks is only used during early startup, change it from
  a TAILQ to an STAILQ.  This removes a pointer from the STAILQ_ENTRY in
  struct lock_object.
- Since the pending_locks list is only used during the single-threaded
  early boot it no longer needs to be protected by a mutex, so remove
  all_mtx.
- Since the lo_list member of struct lock_object is now only used during
  early boot before witness is running, collapse lo_list and lo_witness
  into a union.  This shaves the second pointer off of struct lock_object.
- Axe lock_cur_cnt and lock_max_cnt.

With these changes, struct mtx shrinks from 36 to 28 bytes on 32-bit
platforms and from 72 to 56 bytes on 64-bit platforms.  Note that this
commit will completely and utterly destroy the kernel ABI, so no MFC.

Tested on:	alpha, amd64, i386, sparc64
2005-12-05 20:45:24 +00:00
David Xu
052ea11c71 After reading some documents, I realized SIGEV_NONE != NULL, also
fix code in mqueue_send_notification to handle SIGEV_NONE.
2005-12-05 04:41:32 +00:00
David Xu
9947b45978 Handle SIGEV_NONE, if notification is SIGEV_NONE, error status and
return status will be set, but no notification will be registered.
Increase hard limit of maxmsg to 100, so posixtestsuite ports can run.
2005-12-05 03:23:27 +00:00
Ruslan Ermilov
f4e9888107 Fix -Wundef. 2005-12-04 02:12:43 +00:00
Craig Rodrigues
1245b3433e Add "rdonly" to global_opts, and parse it in vfs_donmount().
Requested by:	rwatson
2005-12-03 12:04:20 +00:00
Craig Rodrigues
ec528a3472 - Add "rw" mount option to global_opts.
- In vfs_donmount(), parse "ro", "noro", and "rw", in order to set or
  unset the MNT_RDONLY filesystem flag.
2005-12-03 01:26:27 +00:00
David Xu
5ee2d4ac5a 1. Cleanup including.
2. Set configuration value for CTL_P1003_1B_MESSAGE_PASSING.
2005-12-02 14:09:32 +00:00
David Xu
a6de716d7e 1. Check if message priority is less than MQ_PRIO_MAX.
2. Use getnanotime instead of getnanouptime.
3. Don't free message in _mqueue_send, mqueue_send will free it.
2005-12-02 08:23:49 +00:00
David Xu
77e718f773 1. Set timer configuration values for sysconf().
2. Set overrun limit to INT_MAX, report ERANGE error if overrun will be
   greater than INT_MAX.
2005-12-01 07:56:15 +00:00
David Xu
b51d237a67 set signal queue values for sysconf(). 2005-12-01 00:25:50 +00:00
David Xu
b2f92ef96b Last step to make mq_notify conform to POSIX standard, If the process
has successfully attached a notification request to the message queue
via a queue descriptor, file closing should remove the attachment.
2005-11-30 05:12:03 +00:00
John Baldwin
398293a8de Fix snderr() to not leak the socket buffer lock if an error occurs in
sosend().  Robert accidentally changed the snderr() macro to jump to the
out label which assumes the lock is already released rather than the
release label which drops the lock in his previous change to sosend().
This should fix the recent panics about returning from write(2) with the
socket lock held and the most recent LOR on current@.
2005-11-29 23:07:14 +00:00
Robert Watson
66dd8a6f99 Move zero copy statistics structure before sosend_copyin().
MFC after:	1 month
Reported by:	tinderbox, sam
2005-11-28 21:45:36 +00:00
John Baldwin
ef627e7da0 When checking to see if a process has exceeded its time limit, flag the
process as over the limit when its time is >= to the limit rather than >
the limit.  Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit
and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit
yet.  However, having the fraction exactly equal to 0 is rather rare, and
it is not worth the overhead to handle that edge case.  With just the >
comparison, the process would have to exceed its limit by almost a second
before it was killed.

PR:		kern/83192
Submitted by:	Maciej Zawadzinski mzawadzinski at gmail dot com
Reviewed by:	bde
MFC after:	1 week
2005-11-28 19:09:08 +00:00
Robert Watson
a725629cf8 Break out functionality in sosend() responsible for building mbuf
chains and copying in mbufs from the body of the send logic, creating
a new function sosend_copyin().  This changes makes sosend() almost
readable, and will allow the same logic to be used by tailored socket
send routines.

MFC after:	1 month
Reviewed by:	andre, glebius
2005-11-28 18:09:03 +00:00
David Xu
f72b11a40c Fix a stupid compiler warining, remove a redundant line. 2005-11-27 22:59:47 +00:00
David Xu
47bf2cf9fe Change filesystem name from mqueue to mqueuefs for style consistent.
Suggested by: rwatson
2005-11-27 08:30:12 +00:00
David Xu
6829585c43 Regen. 2005-11-27 01:23:31 +00:00
David Xu
94e1294b06 Don't use OpenBSD syscall numbers, instead, use new syscall numbers
for POSIX message queue.

Suggested by: rwatson
2005-11-27 01:13:00 +00:00
Robert Watson
5e758b9561 Add several aliases for existing clockid_t names to indicate that the
application wishes to request high precision time stamps be returned:

Alias                           Existing

CLOCK_REALTIME_PRECISE          CLOCK_REALTIME
CLOCK_MONOTONIC_PRECISE         CLOCK_MONOTONIC
CLOCK_UPTIME_PRECISE            CLOCK_UPTIME

Add experimental low-precision clockid_t names corresponding to these
clocks, but implemented using cached timestamps in kernel rather than
a full time counter query.  This offers a minimum update rate of 1/HZ,
but in practice will often be more frequent due to the frequency of
time stamping in the kernel:

New clockid_t name              Approximates existing clockid_t

CLOCK_REALTIME_FAST             CLOCK_REALTIME
CLOCK_MONOTONIC_FAST            CLOCK_MONOTONIC
CLOCK_UPTIME_FAST               CLOCK_UPTIME

Add one additional new clockid_t, CLOCK_SECOND, which returns the
current second without performing a full time counter query or cache
lookup overhead to make sure the cached timestamp is stable.  This is
intended to support very low granularity consumers, such as time(3).

The names, visibility, and implementation of the above are subject
to change, and will not be MFC'd any time soon.  The goal is to
expose lower quality time measurement to applications willing to
sacrifice accuracy in performance critical paths, such as when taking
time stamps for the purpose of rescheduling select() and poll()
timeouts.  Future changes might include retrofitting the time counter
infrastructure to allow the "fast" time query mechanisms to use a
different time counter, rather than a cached time counter (i.e.,
TSC).

NOTE: With different underlying time mechanisms exposed, using
different time query mechanisms in the same application may result in
relative non-monoticity or the appearance of clock stalling for a
single clockid_t, as a cached time stamp queried after a precision
time stamp lookup may be "before" the time returned by the earlier
live time counter query.
2005-11-27 00:55:18 +00:00
David Xu
7023331e59 Regen. 2005-11-26 12:45:22 +00:00
David Xu
655291f2ae Bring in experimental kernel support for POSIX message queue. 2005-11-26 12:42:35 +00:00
Craig Rodrigues
5e6b93a014 In nmount() and vfs_donmount(), do not strcmp() the options in the iovec
directly.  We need to copyin() the strings in the iovec before
we can strcmp() them.  Also, when we want to send the errmsg back
to userspace, we need to copyout()/copystr() the string.

Add a small helper function vfs_getopt_pos() which takes in the
name of an option, and returns the array index of the name in the iovec,
or -1 if not found.  This allows us to locate an option in
the iovec without actually manipulating the iovec members. directly via
strcmp().

Noticed by:	kris on sparc64
2005-11-23 20:51:15 +00:00
John Polstra
ba3612cd5c Fix a bug in the loop in sonewconn that makes room on the incomplete
connection queue for a new connection.  It was removing connections
from the wrong list.

Submitted by:	Paul Mikesell
Sponsored by:	Isilon Systems
MFC after:	1 week
2005-11-22 01:55:29 +00:00
Marcel Moolenaar
60b7823989 Fix bug introduced in revision 1.186:
When all file systems have a time stamp of zero, which is the case
for example when the root file system is on a read-only medium, we
ended up not calling inittodr() at all.  A potential uncleanliness
existed as well. If multiple file systems had a non-zero time stamp,
we would call inittodr() multiple times. While this should not be
harmful, it's definitely not ideal.
Fix both issues by iterating over the mounted file systems to find
the largest time stamp and call inittodr() exactly once with that
time stamp. This could of course be a zero time stamp if none of the
mounted file systems have a non-zero time stamp. In that case the
annoying errors mentioned in the commit log for revision 1.186 still
haven't been avoided. The bottom line is that inittodr() should not
complain when it gets a time base of zero. At the time of this
commit only alpha seems to have that problem.

Reported by: Dario Freni (saturnero at freesbie dot org)
MFC after: 1 week
2005-11-19 21:51:45 +00:00
Craig Rodrigues
425e5b6268 Parse more mount options in vfs_donmount(), before vfs_domount()
is called.  It looks like there are lots of different mount flags checked
in vfs_domount(), so we need to do the parsing for these particular
mount flags earlier on.  The new flags parsed are:
async, force, multilabel, noasync, noatime, noclusterr, noclusterw,
noexec, nosuid, nosymfollow, snapshot, suiddir, sync, union.

Existing code which uses mount() to mount UFS filesystems is not
affected, but new code which uses nmount() to mount UFS filesystems
should behave better.
2005-11-19 21:22:21 +00:00
Andre Oppermann
5eefd88949 Add CLOCK_UPTIME to clock_gettime(2) reporting the current
uptime measured in SI seconds.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 16:51:13 +00:00
Craig Rodrigues
8fd860cfa1 In vfs_nmount(), check to see if "update" mount option was passed
in, and if so, set MNT_UPDATE filesystem flag.
vfs_nmount() calls vfs_domount(), and there is special logic
inside vfs_domount() if MNT_UPDATE is set.  This is very important
when we want to do an update mount of the root filesystem, using nmount().
2005-11-18 01:31:10 +00:00
Pyun YongHyeon
dc22aef2ae Prefer NULL to 0.
Add missing lock/unlock in sysctl handler.
Protect accessing NULL pointer when resource allocation was failed.
style(9)

Reviewed by:	scottl
MFC after:	1 week
2005-11-17 08:56:21 +00:00
Olivier Houchard
481a1fe19e Add a new sysctl, kern.elf[32|64].can_exec_dyn. When set to 1, one can
execute a ET_DYN binary (shared object).
This does not make much sense, but some linux scripts expect to be able to
execute /lib/ld-linux.so.2 (ldd comes to mind).
The sysctl defaults to 0.

MFC after:	3 days
2005-11-14 22:24:00 +00:00
Robert Watson
c5c9bd5b72 In ktr_getrequest(), acquire ktrace_mtx earlier -- while the race
currently present is minor and offers no real semantic issues, it also
doesn't make sense since an earlier lockless check has already
occurred.  Also hold the mutex longer, over a manipulation of
per-process ktrace state, which requires synchronization.

MFC after:	1 month
Pointed out by:	jhb
2005-11-14 19:30:09 +00:00
Robert Watson
2c255e9df6 Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler).  Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
  are no longer used.  Keep the global free record list, as records
  are still used.

- Add a per-process record queue, which will hold any asynchronously
  generated records, such as from context switches.  This replaces the
  global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
  process.

- When a record is committed synchronously, first drain any pending
  per-process records in order to maintain ordering as best we can.
  Currently ordering between competing threads is provided via a global
  ktrace_sx, but a per-process flag or lock may be desirable in the
  future.

- When a process returns to user space following a system call, trap,
  signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
  prevent the recursive generation of records, as well as generating
  traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on.  I.e., performing a
drain every (n) records.

MFC after:	1 month
Discussed with:	jhb
Requested by:	Marc Olzheim <marcolz at stack dot nl>
2005-11-13 13:27:44 +00:00
Craig Rodrigues
d5328381f1 style(9) cleanups.
Spotted by:	njl, bde
2005-11-12 14:41:44 +00:00
Robert Watson
71909edec8 Significant refactoring of the accounting code to improve locking and VFS
happiness, as well as correct other bugs:

- Replace notion of current and saved accounting credential/vnode with a
  single credential/vnode and an acct_suspended flag.  This simplifies the
  accounting logic substantially.

- Replace acct_mtx with acct_sx, a sleepable lock held exclusively during
  reconfiguration and space polling, but shared during log entry
  generation.  This avoids holding a mutex over sleepable VFS operations.

- Hold the sx lock over the duration of the I/O so that the vnode I/O
  cannot occur after vnode close, which could occur previously if
  accounting was disabled as a process exited.

- Write the accounting log entry with Giant conditionally acquired based
  on the file system where the log is stored.  Previously, the accounting
  code relied on the caller acquiring Giant.

- Acquire Giant conditionally in the accounting callout based on the file
  system where the accounting log is stored.  Run the callout MPSAFE.

- Expose acct_suspended via a read-only sysctl so it is possibly to
  programmatically determine whether accounting is suspended or not without
  attempting to parse logs.

- Check both acct_vp and acct_suspended lock-free before entering the
  accounting sx lock in acct().

- When accounting is disabled due to a VBAD vnode (i.e., forceable unmount),
  generate a log message indicating accounting has been disabled.

- Correct a long-standing bug in how free space is calculated and compared
  to the required space: generate and compare signed results, not unsigned
  results, or negative free space will cause accounting to not be suspended
  when required, or worse, incorrectly resumed once negative free space is
  reached.

MFC after:	2 weeks
2005-11-12 10:45:13 +00:00
David Xu
413cf3bbe1 Make sure only remove one signal by debugger. 2005-11-12 04:22:16 +00:00
Robert Watson
a0ec558af0 Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received.  The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism.  In order to
fix these problems, make the following changes:

- When there are in-flight sockets and a UNIX domain socket is destroyed,
  asynchronously schedule the garbage collector, rather than running it
  synchronously in the current context.  This avoids lock order issues
  when the garbage collection code reenters the UNIX domain socket code,
  avoiding lock order reversals, deadlocks, etc.  Run the code
  asynchronously in a task queue.

- In the garbage collector, when skipping file descriptors that have
  entered a closing state (i.e., have f_count == 0), re-test the FDEFER
  flag, and decrement unp_defer.  As file descriptors can now transition
  to a closed state, while the garbage collector is running, it is no
  longer the case that unp_defer will remain an accurate count of
  deferred sockets in the mark portion of the GC algorithm.  Otherwise,
  the garbage collector will loop waiting waiting for unp_defer to reach
  zero, which it will never do as it is skipping file descriptors that
  were marked in an earlier pass, but now closed.

- Acquire the UNIX domain socket subsystem lock in unp_discard() when
  modifying the unp_rights counter, or a read/write race is risked with
  other threads also manipulating the counter.

While here:

- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
  the garbage collector, this is not required as we are able to use the
  socket buffer receive lock to protect scanning the receive buffer for
  in-flight file descriptors on the socket buffer.

- Annotate that the description of the garbage collector implementation
  is increasingly inaccurate and needs to be updated.

- Add counters of the number of deferred garbage collections and recycled
  file descriptors.  This will be removed and is here temporarily for
  debugging purposes.

With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.

Reported by:	Philip Kizer <pckizer at nostrum dot com>
MFC after:	2 weeks
2005-11-10 16:06:04 +00:00
Robert Watson
742be7821c Add the f_msgcount field to the set of struct file fields printed in show
files.

MFC after:	1 week
2005-11-10 13:26:29 +00:00
Robert Watson
2be165c93e Expanet of details printed for each file descriptor to include it's
garbage collection flags.  Reformat generally to make this fit and
leave some room for future expansion.

MFC after:	1 week
2005-11-10 11:35:59 +00:00
Robert Watson
b4e507aafa Add a DDB "show files" command to list the current open file list, some
state about each open file, and identify the first process in the process
table that references the file.  This is helpful in debugging leaks of
file descriptors.

MFC after:	1 week
2005-11-10 10:42:50 +00:00
Doug White
16e35dcc39 This is a workaround for a complicated issue involving VFS cookies and devfs.
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.

This issue affects HEAD and RELENG_6 only.

PR:		88249
Submitted by:	"Devon H. O'Dell" <dodell@ixsystems.com>
MFC after:	3 days
2005-11-09 22:03:50 +00:00
Robert Watson
f8a9ed1fa7 Fix typo in recent comment tweak.
Submitted by:	jkim
MFC after:	1 week
2005-11-09 22:02:02 +00:00
Robert Watson
923633b4b5 In closef(), remove the assumption that there is a thread associated
with the file descriptor.  When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate.  Check the thread for NULL to
avoid this scenario.  Expand an existing comment to say a bit more about
this.

MFC after:	1 week
2005-11-09 20:54:25 +00:00
Warner Losh
5d56add2ba General consensus is that it would be even better to run this in a
thread context.  While it doesn't matter too much at the moment, in
the future we could be back in the same boat if/when more restrictions
are placed (or enforced) in a SWI.

Suggested by: njl, bde, jhb, scottl
2005-11-09 16:22:56 +00:00
John Baldwin
26b7bd707c Use intptr_t casts to convert void * <--> int to make 64-bit archs happy. 2005-11-09 15:15:59 +00:00
Ruslan Ermilov
303989a2f3 Use sparse initializers for "struct domain" and "struct protosw",
so they are easier to follow for the human being.
2005-11-09 13:29:16 +00:00
David Xu
f4d8522334 WIFxxx macros requires an int type but p_xstat is short, convert it
to int before using the macros.

Bug reported by : Pyun YongHyeon pyunyh at gmail dot com
2005-11-09 07:58:16 +00:00
Warner Losh
161604d863 Kick off the suspend sequence from the keyboard in a SWI rather than
in the hardware interrupt context (even if it is likely just an
ithread).  We don't document that suspend/resume routines are run from
such a context and some of the things that happen in those routines
aren't interrupt safe.  Since there's no real need to run from that
context, this restores assumptions that suspend routines have made.

This fixes Thierry Herbelot's 'Trying to sleep while sleeping is
prohibited' problem.
2005-11-09 07:32:01 +00:00
Warner Losh
2002eaadb7 Clarify panic message, I parsed the old one 'trying to sleep while sleeping' 2005-11-09 07:28:52 +00:00
Craig Rodrigues
4560dfb5b1 For nmount(), allow a text string error message to be propagated back
to user-space if a parameter named "errmsg" is passed into the iovec.
Used in conjunction with vfs_mount_error(), more useful error messages
than errno can be passed back to userspace when mounting a filesystem
fails.

Discussed with:		phk, pjd
2005-11-09 02:26:38 +00:00
David Xu
323fe56580 In aio_waitcomplete, do not return EAGAIN if no other threads
have started aio, instead, initialize aio management structure
if it hasn't been done, the reason to adjust this behavior is
to make it a bit friendly for threaded program, consider two
threads, one submits aio_write, and another just calls
aio_waitcomplete to wait any I/O to be completed and recycle the
aio requests, before submitter doing any I/O, the recycler wants
to wait in kernel. This also fixes inconsistency with other aio
syscalls.
2005-11-08 23:48:32 +00:00
David Xu
c20cedbfc9 Make sure pending SIGCHLD is removed from previous parent when process
is attached or detached.
2005-11-08 23:28:12 +00:00
John Baldwin
2a522eb9d3 Various and sundry cleanups:
- Use curthread for calls to knlist_delete() and add a big comment
  explaining why as well as appropriate assertions.
- Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them.
- Use fget() family of functions to lookup file objects instead of
  grovelling around in file descriptor tables.
- Destroy the aio_freeproc mutex if we are unloaded.

Tested on:	i386
2005-11-08 17:43:05 +00:00
Christian S.J. Peron
576068804d Giant clean up for exit(2)
-Change unconditional aquisition of Giant to only pickup Giant if the vnode
 for the controlling tty resides on a non-mpsafe file system.
-Pickup Giant around executable vnode reference counting operations only if
 the executable resides on a non-mpsafe file system.
-If this process is being traced, pickup Giant for trace file reference count
 operations only if it resides on a non-mpsafe file system.

Discussed with:	jhb
Tested by:	kris
2005-11-08 17:11:03 +00:00
David Xu
ebceaf6dc7 Add support for queueing SIGCHLD same as other UNIX systems did.
For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.

The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.

There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.

Tested on: i386 (SMP and UP)
2005-11-08 09:09:26 +00:00
Craig Rodrigues
84e69560b6 Add utility function to propagate mount errors as text string messages.
Discussed with:		phk
2005-11-08 04:13:39 +00:00
Gleb Smirnoff
49d46b616e Fix panic string in last revision. 2005-11-06 16:47:59 +00:00
Andre Oppermann
cd5bb63b3d Free only those mbuf+clusters back to the packet zone that were allocated
from there.  All others get broken up and free'd individually to the mbuf
and cluster zones.

The packet zone is a secondary zone to the mbuf zone.  There is currently
a limitation in UMA which prevents decreasing the packet zone stock when
the mbuf and cluster zone are drained and all their members are part of
packets.  When this is fixed this change may be reverted.
2005-11-05 19:43:55 +00:00
Andre Oppermann
a5f7708723 Fix a logic error introduced with mandatory mbuf cluster refcounting and
freeing of mbufs+clusters back to the packet zone.
2005-11-04 17:20:53 +00:00
David Xu
8f0371f19d Fix name compatible problem with POSIX standard. the sigval_ptr and
sigval_int really should be sival_ptr and sival_int.
Also sigev_notify_function accepts a union sigval value but not a
pointer.
2005-11-04 09:41:00 +00:00
John Baldwin
091e8307d0 Add stoppcbs[] arrays on Alpha and sparc64 and have each CPU save its
current context in the IPI_STOP handler so that we can get accurate stack
traces of threads on other CPUs on these two archs like we do now on i386
and amd64.

Tested on:	alpha, sparc64
2005-11-03 21:08:20 +00:00
John Baldwin
55de4dcab6 Fix 'show allpcpu' ddb command on non-x86. CPU IDs are in the range 0 ..
mp_maxid, not 0 .. mp_maxid - 1.  The result was that the highest numbered
CPU was skipped on Alpha and sparc64.

MFC after:	1 week
2005-11-03 21:06:29 +00:00
Pawel Jakub Dawidek
2a143d5bf5 Detect memory leaks when memory type is being destroyed.
This is very helpful for detecting kernel modules memory leaks on unload.

Discussed and reviewed by:	rwatson
2005-11-03 13:48:59 +00:00
David Xu
4c0fb2cfff Support sending realtime signal information via signal queue, realtime
signal memory is pre-allocated, so kernel can always notify user code.
2005-11-03 05:25:26 +00:00
David Xu
6d7b314b14 Cleanup some signal interfaces. Now the tdsignal function accepts
both proc pointer and thread pointer, if thread pointer is NULL,
tdsignal automatically finds a thread, otherwise it sends signal
to given thread.
Add utility function psignal_event to send a realtime sigevent
to a process according to the delivery requirement specified in
struct sigevent.
2005-11-03 04:49:16 +00:00
David Xu
d1f16b4d2e Oops, don't change tdsignal call. 2005-11-03 01:38:49 +00:00
David Xu
44355392b4 Add thread_find() function to search a thread by lwpid. 2005-11-03 01:34:08 +00:00
Paul Saab
1471f287e1 Calling setrlimit from 32bit apps could potentially increase certain
limits beyond what should be capiable in a 32bit process, so we
must fixup the limits.

Reviewed by:	jhb
2005-11-02 21:18:07 +00:00
Andre Oppermann
56a4e45ab3 Mandatory mbuf cluster reference counting and groundwork for UMA
based jumbo 9k and jumbo 16k cluster support.

All mbuf's with external storage attached are mandatory reference
counted.  For clusters and jumbo clusters UMA provides the refcnt
storage directly.  It does not have to be separatly allocated.  Any
other type of external storage gets its own refcnt allocated from
an UMA mbuf refcnt zone instead of normal kernel malloc.

The refcount API MEXT_ADD_REF() and MEXT_REM_REF() is no longer
publically accessible.  The proper m_* functions have to be used.

mb_ctor_clust() and mb_dtor_clust() both handle normal 2K as well
as 9k and 16k clusters.

Clusters and jumbo clusters may be obtained without attaching it
immideatly to an mbuf.  This is for high performance cluster
allocation in network drivers where mbufs are attached after the
cluster has been filled.

Tested by:	rwatson
Sponsored by:	TCP/IP Optimizations Fundraise 2005
2005-11-02 16:20:36 +00:00
Andre Oppermann
34333b16cd Retire MT_HEADER mbuf type and change its users to use MT_DATA.
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it.  It only adds a layer of confusion.  The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit.  For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-02 13:46:32 +00:00
John Baldwin
68a17869c1 Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by:	csjp
MFC after:	1 week
2005-11-01 17:13:05 +00:00
Robert Watson
2bdeb3f9f2 Reuse ktr_unused field in ktr_header structure as ktr_tid; populate
ktr_tid as part of gathering of ktr header data for new ktrace
records.  The continued use of intptr_t is required for file layout
reasons, and cannot be changed to lwpid_t at this point.

MFC after:	1 month
Reviewed by:	davidxu
2005-11-01 14:46:37 +00:00
Robert Watson
d977a58307 Replace ktr_buffer pointer in struct ktr_header with a ktr_unused
intptr_t.  The buffer length needs to be written to disk as part
of the trace log, but the kernel pointer for the buffer does not.
Add a new ktr_buffer pointer to the kernel-only ktrace request
structure to hold that pointer.  This frees up an integer in the
ktrace record format that can be used to hold the threadid,
although older ktrace files will have a garbage ktr_buffer field
(or more accurately, a kernel pointer value).

MFC after:		2 weeks
Space requested by:	davidxu
2005-11-01 12:36:19 +00:00
Paul Saab
ecc44de7a2 Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by:	ps, ups
Reviewed by:	jhb
2005-10-31 21:09:56 +00:00
John Baldwin
f6494f2e13 Check to see if the hash table is present in link_elf_lookup_symbol()
before dereferencing it.  Certain corrupt kernel modules might not have
a valid hash table, and would cause a kernel panic when they were loaded.
Instead of panic'ing, the kernel now prints out a warning that it is
missing the symbol hash table.

Tested by:	Benjamin Close Benjamin dot Close at clearchain dot com
MFC after:	1 week
2005-10-31 19:17:32 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Robert Watson
d374e81efd Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00
David Xu
56c06c4b67 Let itimer store itimerspec instead of itimerval, so I don't have to
convert to or from timeval frequently.

Introduce function itimer_accept() to ack a timer signal in signal
acceptance code, this allows us to return more fresh overrun counter
than at signal generating time. while POSIX says:
"the value returned by timer_getoverrun() shall apply to the most
recent expiration signal delivery or acceptance for the timer,.."
I prefer returning it at acceptance time.

Introduce SIGEV_THREAD_ID notification mode, it is used by thread
libary to request kernel to deliver signal to a specified thread,
and in turn, the thread library may use the mechanism to implement
SIGEV_THREAD which is required by POSIX.

Timer signal is managed by timer code, so it can not fail even if
signal queue is full filled by sigqueue syscall.
2005-10-30 02:56:08 +00:00
David Xu
01790d850d Regen. 2005-10-30 02:14:37 +00:00
David Xu
0972628aff Fix sigevent's POSIX incompatible problem by adding member fields
sigev_notify_function and sigev_notify_attributes. AIO syscalls
use sigevent, so they have to be adjusted.

Reviewed by:	alc
2005-10-30 02:12:49 +00:00
Ed Maste
5d89e1d0af In watchdog_config enable the software watchdog iff the WD_ACTIVE flag is
set.  When watchdogd(1) is terminated intentionally it clears the bit,
which should then disable it in the kernel.

PR:		kern/74386
Submitted by:	Alex Hoff <ahoff at sandvine dot com>
Approved by:	phk, rwatson (mentor)
2005-10-27 17:22:47 +00:00
John Baldwin
2851f51eb1 Revert most of revision 1.235 and fix the problem a different way. We
can't acquire an sx lock in ttyinfo() because ttyinfo() can be called
from interrupt handlers (such as atkbd_intr()).  Instead, go back to
locking the process group while we pick a thread to display information for
and hold that lock until after we drop sched_lock to make sure the
process doesn't exit out from under us.  sched_lock ensures that the
specific thread from that process doesn't go away.  To protect against
the process exiting after we drop the proc lock but before we dereference
it to lookup the pid and p_comm in the call to ttyprintf(), we now copy
the pid and p_comm to local variables while holding the proc lock.

This problem was found by the recently added TD_NO_SLEEPING assertions for
interrupt handlers.

Tested by:	emaste
MFC after:	1 week
2005-10-27 16:47:28 +00:00
Paul Saab
53f5742d33 Allow 32bit get/setsockopt with SO_SNDTIMEO or SO_RECVTIMEO to work. 2005-10-27 04:26:35 +00:00
Peter Wemm
40c9966a37 Commit something we found useful at work at one point. Add sysctls for
debug.kdb.panic and debug.kdb.trap alongside the existing debug.kdb.enter
sysctl.  'panic' causes a panic, and 'trap' causes a page fault.  We used
these to ensure that crash dumps succeed from those two common failure
modes.  This avoids the need for creating a 'panic' kld module.
2005-10-26 22:40:07 +00:00
John Baldwin
fe486a370a Add a swi_remove() function to teardown software interrupt handlers. For
now it just calls intr_event_remove_handler(), but at some point it might
also be responsible for tearing down interrupt events created via swi_add.
2005-10-26 15:51:05 +00:00
Gleb Smirnoff
c0bc2867c1 - Fix leak of struct nlminfo on process exit.
- Fix malloc type collision, that made the above problem
  difficult to understand.

Reported by:	Vladimir Sharun <sharun ukr.net>
2005-10-26 07:18:37 +00:00
David Xu
4938faa635 do umtx_wake at userland thread exit address, so that others userland
threads can wait for a thread to exit, and safely assume that the thread
has left userland and is no longer using its userland stack, this is
necessary for pthread_join when a thread is waiting for another thread
to exit which has user customized stack, after pthread_join returns,
the userland stack can be reused for other purposes, without this change,
the joiner thread has to spin at the address to ensure the thread is really
exited.
2005-10-26 06:55:46 +00:00
John Baldwin
e0f66ef861 Reorganize the interrupt handling code a bit to make a few things cleaner
and increase flexibility to allow various different approaches to be tried
in the future.
- Split struct ithd up into two pieces.  struct intr_event holds the list
  of interrupt handlers associated with interrupt sources.
  struct intr_thread contains the data relative to an interrupt thread.
  Currently we still provide a 1:1 relationship of events to threads
  with the exception that events only have an associated thread if there
  is at least one threaded interrupt handler attached to the event.  This
  means that on x86 we no longer have 4 bazillion interrupt threads with
  no handlers.  It also means that interrupt events with only INTR_FAST
  handlers no longer have an associated thread either.
- Renamed struct intrhand to struct intr_handler to follow the struct
  intr_foo naming convention.  This did require renaming the powerpc
  MD struct intr_handler to struct ppc_intr_handler.
- INTR_FAST no longer implies INTR_EXCL on all architectures except for
  powerpc.  This means that multiple INTR_FAST handlers can attach to the
  same interrupt and that INTR_FAST and non-INTR_FAST handlers can attach
  to the same interrupt.  Sharing INTR_FAST handlers may not always be
  desirable, but having sio(4) and uhci(4) fight over an IRQ isn't fun
  either.  Drivers can always still use INTR_EXCL to ask for an interrupt
  exclusively.  The way this sharing works is that when an interrupt
  comes in, all the INTR_FAST handlers are executed first, and if any
  threaded handlers exist, the interrupt thread is scheduled afterwards.
  This type of layout also makes it possible to investigate using interrupt
  filters ala OS X where the filter determines whether or not its companion
  threaded handler should run.
- Aside from the INTR_FAST changes above, the impact on MD interrupt code
  is mostly just 's/ithread/intr_event/'.
- A new MI ddb command 'show intrs' walks the list of interrupt events
  dumping their state.  It also has a '/v' verbose switch which dumps
  info about all of the handlers attached to each event.
- We currently don't destroy an interrupt thread when the last threaded
  handler is removed because it would suck for things like ppbus(8)'s
  braindead behavior.  The code is present, though, it is just under
  #if 0 for now.
- Move the code to actually execute the threaded handlers for an interrrupt
  event into a separate function so that ithread_loop() becomes more
  readable.  Previously this code was all in the middle of ithread_loop()
  and indented halfway across the screen.
- Made struct intr_thread private to kern_intr.c and replaced td_ithd
  with a thread private flag TDP_ITHREAD.
- In statclock, check curthread against idlethread directly rather than
  curthread's proc against idlethread's proc. (Not really related to intr
  changes)

Tested on:	alpha, amd64, i386, sparc64
Tested on:	arm, ia64 (older version of patch by cognet and marcel)
2005-10-25 19:48:48 +00:00
John Baldwin
6caf758e7c Use shorter names for the Giant and fast taskqueues so that their names
actually fit.
2005-10-25 19:29:02 +00:00
John Baldwin
58553b9925 Rename the KDB_STOP_NMI kernel option to STOP_NMI and make it apply to all
IPI_STOP IPIs.
- Change the i386 and amd64 MD IPI code to send an NMI if STOP_NMI is
  enabled if an attempt is made to send an IPI_STOP IPI.  If the kernel
  option is enabled, there is also a sysctl to change the behavior at
  runtime (debug.stop_cpus_with_nmi which defaults to enabled).  This
  includes removing stop_cpus_nmi() and making ipi_nmi_selected() a
  private function for i386 and amd64.
- Fix ipi_all(), ipi_all_but_self(), and ipi_self() on i386 and amd64 to
  properly handle bitmapped IPIs as well as IPI_STOP IPIs when STOP_NMI is
  enabled.
- Fix ipi_nmi_handler() to execute the restart function on the first CPU
  that is restarted making use of atomic_readandclear() rather than
  assuming that the BSP is always included in the set of restarted CPUs.
  Also, the NMI handler didn't clear the function pointer meaning that
  subsequent stop and restarts could execute the function again.
- Define a new macro HAVE_STOPPEDPCBS on i386 and amd64 to control the use
  of stoppedpcbs[] and always enable it for i386 and amd64 instead of
  being dependent on KDB_STOP_NMI.  It works fine in both the NMI and
  non-NMI cases.
2005-10-24 21:04:19 +00:00
John Baldwin
6b1e0d75b0 - Various small whitespace and style nits.
- Use PCPU_GET(cpumask) in preference to 1 << PCPU_GET(cpuid) in a few
  places.
2005-10-24 20:31:04 +00:00
John Baldwin
f55ab99409 Document in #ifdef notnow code the actions that proc_fini would need to
take if struct procs were actually freed.
2005-10-24 20:15:23 +00:00
John Baldwin
cf23efc12a Don't panic if a spin lock is initialized that isn't in our static order
list.  Just warn about it instead.

Requested by:	scottl
MFC after:	1 day
2005-10-24 20:14:24 +00:00
John Baldwin
8d2a5b8c34 Revert previous change to this file. I accidentally committed while
fixing spelling in a comment.
2005-10-24 15:58:21 +00:00
John Baldwin
971d0ad835 Spell hierarchy correctly in comments.
Submitted by:	Wojciech A. Koszek dunstan at freebsd dot czest dot pl
2005-10-24 15:57:27 +00:00
Stephan Uphoff
198b0a3b71 Only set B_RAM (Read ahead mark) on an incore buffers if we can lock it.
This fixes a race condition caused by the unlocked write access to the
b_flags field.

MFC after:	3 days
2005-10-24 14:23:04 +00:00
David Xu
fe80a39034 Don't touch last overrun if signal was already on queue. 2005-10-23 22:59:33 +00:00
David Xu
60354683d9 Make p_itimers as a pointer, so file sys/proc.h does not need to include
sys/timers.h.
2005-10-23 12:19:08 +00:00
Alan Cox
5b5908005a Previously, nothing prevented the page that was returned by pmap_extract()
from being reclaimed before it was wired.  Use pmap_extract_and_hold()
instead of pmap_extract() and retain the hold on the page until it has been
wired.
2005-10-23 07:41:56 +00:00
David Xu
e706ee8a16 Regen for POSIX timer syscalls. 2005-10-23 04:26:10 +00:00
David Xu
86857b368d Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.
2005-10-23 04:22:56 +00:00
David Xu
5da49fcb8a 1. Make ksiginfo_alloc and ksiginfo_free public.
2. Introduce flags KSI_EXT and KSI_INS. The flag KSI_EXT allows a ksiginfo
   to be managed by outside code, the KSI_INS indicates sigqueue_add should
   directly insert passed ksiginfo into queue other than copy it.
2005-10-23 04:12:26 +00:00
Alan Cox
0a5a219830 Verify that access to the given address is allowed from user-space.
Discussed with: rwatson@
2005-10-22 20:02:59 +00:00
Alan Cox
52ad48b69f Eliminate spl* calls. 2005-10-21 05:48:38 +00:00
Robert Watson
64a266f9e8 Change format string for u_int64_t to %ju from %llu, in order to use the
correct format string on 64-bit systems.

Pointed out by:	pjd
2005-10-20 21:28:31 +00:00
Robert Watson
909ed16c2b Add a "show malloc" command to DDB, which prints out the current stats for
available kernel malloc types.  Quite useful for post-mortem debugging of
memory leaks without a dump device configured on a panicked box.

MFC after:	2 weeks
2005-10-20 17:41:47 +00:00
John Baldwin
2a2b58faa4 Add entry for the spin mutex used by the hptmv(4) driver.
MFC after: 	1 day
Tested by:	Philip Kizer pckizer at nostrum dot com
2005-10-20 14:49:59 +00:00
John Polstra
135c43dc52 Fix a bug in the kernel module runtime linker that made it impossible
to unload the usb.ko module after boot if it was originally preloaded
from "/boot/loader.conf".  When processing preloaded modules, the
linker erroneously added self-dependencies the each module's reference
count.  That prevented usb.ko's reference count from ever going to 0,
so it could not be unloaded.

Sponsored by Isilon Systems.

Reviewed by:	pjd, peter
MFC after:	1 week
2005-10-19 20:40:30 +00:00
John Baldwin
8c4b6380c7 Move the initialization of the devmtx into the mutex_init() function
called during early init before cninit().

Tested on:	i386, alpha, sparc64
Reviewed by:	phk, imp
Reported by:	Divacky Roman xdivac02 at stud dot fit dot vutbr dot cz
MFC after:	1 week
2005-10-18 18:27:44 +00:00
Stefan Farfeleder
d60e86c86e Const-qualify ksem_timedwait's parameter abstime as it's only passed in. 2005-10-18 11:46:24 +00:00
Peter Wemm
1a330eb01d Add support for kernel modules with a single PT_LOAD section.
While here, support up to four sections because it was trivial to do
and cheap. (One pointer per section).

For amd64 with "-fpic -shared" format .ko files, using a single PT_LOAD
section is important to avoid wasting about 1MB of KVM and physical ram
for the 'gap' between the two PT_LOAD sections.  amd64 normally uses
.o format kld files and isn't affected normally.  But -fpic -shared modules
are actually possible to produce and load...  (And with a bugfix to
binutils, we can build and use plain -shared .ko files without -fpic)

i386 only wastes 4K per .ko file, so that isn't such a big deal there.
2005-10-17 23:21:55 +00:00
Poul-Henning Kamp
5ef5ee7b62 Use new functions to call into drivers methods. 2005-10-16 21:07:31 +00:00
Poul-Henning Kamp
7423b2b40c Make ttyconsolemode() call ttsetwater() so that drivers don't have to. 2005-10-16 20:58:22 +00:00
Poul-Henning Kamp
51514bc484 Make ttsetcompat() static 2005-10-16 20:40:40 +00:00
Poul-Henning Kamp
733634738e Eliminate two unused arguments to ttycreate(). 2005-10-16 20:22:56 +00:00
Paul Saab
a372f8224c Implement the 32bit versions of recvmsg, recvfrom, sendmsg
Partially obtained from:	jhb
2005-10-15 05:57:06 +00:00
Paul Saab
f0b479cd75 Implement 32bit wrappers for clock_gettime, clock_settime, and
clock_getres.
2005-10-15 02:54:18 +00:00
Kris Kennaway
14cdc36456 mpsafevm has been stable and defaulted to 1 on sparc64 for over 6 months,
so we are ready for mpsafevfs=1 by default on sparc64 too.  I have been
running this on all my sparc64 machines for over 6 months, and have not
encountered MD problems.

MFC after:	1 week
2005-10-14 23:56:13 +00:00
Kris Kennaway
f098dcded5 Partially revert revision 1.66, which contained a change that did not
correspond to the commit log.  It changed the maxswzone and maxbcache
parameters from int to long, without changing the extern definitions
in <sys/buf.h>.

In fact it's a good thing it did not, because other parts of the system
are not yet ready for this, and on large-memory sparc machines it causes
severe filesystem damage if you try.

The worst effect of the change was that the tunables controlling the
above variables stopped working.  These were necessary to allow such
large sparc64 machines (with >12GB RAM) to boot, since sparc64 did not
set a hard-coded upper limit on these parameters and they ended
up overflowing an int, causing an infinite loop at boot in bufinit().

Reviewed by:	mlaier
2005-10-14 19:15:10 +00:00
David Xu
823acd70b6 Regen for sigqueue syscall. 2005-10-14 12:56:28 +00:00
David Xu
9104847f21 1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
   sendsig use discrete parameters, now they uses member fields of
   ksiginfo_t structure. For sendsig, this change allows us to pass
   POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
   generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
   blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
   thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
   be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
   an extra siginfo list holds additional siginfo_t data for signals.
   kernel code uses psignal() still behavior as before, it won't be failed
   even under memory pressure, only exception is when deleting a signal,
   we should call sigqueue_delete to remove signal from sigqueue but
   not SIGDELSET. Current there is no kernel code will deliver a signal
   with additional data, so kernel should be as stable as before,
   a ksiginfo can carry more information, for example, allow signal to
   be delivered but throw away siginfo data if memory is not enough.
   SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
   not be caught or masked.
   The sigqueue() syscall allows user code to queue a signal to target
   process, if resource is unavailable, EAGAIN will be returned as
   specification said.
   Just before thread exits, signal queue memory will be freed by
   sigqueue_flush.
   Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
2005-10-14 12:43:47 +00:00
Doug Ambrisko
db43cd0417 Fix tinderbox box by removing incomplete/bad spl usage. Proper giant free
locking is required in for aio.

Pointed out by:	imp
2005-10-12 22:33:22 +00:00
Doug Ambrisko
69cd28dacb Add in kqueue support to LIO event notification and fix how it handled
notifications when LIO operations completed.  These were the problems
with LIO event complete notification:
      -	Move all LIO/AIO event notification into one general function
	so we don't have bugs in different data paths.  This unification
	got rid of several notification bugs one of which if kqueue was
	used a SIGILL could get sent to the process.
      -	Change the LIO event accounting to count all AIO request that
	could have been split across the fast path and daemon mode.
	The prior accounting only kept track of AIO op's in that
	mode and not the entire list of operations.  This could cause
	a bogus LIO event complete notification to occur when all of
	the fast path AIO op's completed and not the AIO op's that
	ended up queued for the daemon.

Suggestions from:	alc
2005-10-12 17:51:31 +00:00
Diomidis Spinellis
9f5c1d1955 Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by:	bde
MFC after:	2 weeks
2005-10-12 06:56:00 +00:00
Tor Egge
8272da3106 Release clean buffer with wrong size and no dependencies also for non-VMIO
case.
2005-10-09 22:41:25 +00:00
Marcel Moolenaar
125fbd3cdc Add parse_uuid() that creates a binary representation of an UUID from
a string representation.
2005-10-07 13:37:10 +00:00
Poul-Henning Kamp
0694506637 Eliminate __RMAN_RESOURCE_VISIBLE hack entirely by moving the struct
resource_ to subr_rman.c where it belongs.
2005-10-06 21:49:31 +00:00
Gleb Smirnoff
f0796cd26c - Don't pollute opt_global.h with DEVICE_POLLING and introduce
opt_device_polling.h
- Include opt_device_polling.h into appropriate files.
- Embrace with HAVE_KERNEL_OPTION_HEADERS the include in the files that
  can be compiled as loadable modules.

Reviewed by:	bde
2005-10-05 10:09:17 +00:00
Warner Losh
2f624c21c5 When data passed into devctl_notify is NULL, don't print (null). Instead
don't print anything at all.

# this fixes a problem that I noticed with devd.pipe not terminating lines
# with \n correctly sometimes.
2005-10-04 22:25:14 +00:00
Robert Watson
7723d5ed12 Re-order MAC and DAC checks in shmget() in order to give precedence to
the MAC result, as well as avoid losing the DAC check result when MAC
is enabled.

MFC after:	3 days
Reported by:	Patrick LeBlanc <Patrick dot LeBlanc at sparta dot com>
2005-10-04 16:40:20 +00:00
Roman Kurakin
826cf005ed Use FILEDESC_UNLOCK(fdp) after FILE_UNLOCK(p), not before to avoid LOR.
Slightly discussed on current@.

LOR #055

MFC after:	14 days
2005-10-04 16:27:54 +00:00
Christian S.J. Peron
9eea3d85cc Standard Giant push down operations for the Mandatory Access Control (MAC)
framework. This makes Giant protection around MAC operations which inter-
act with VFS conditional, based on the MPSAFE status of the file system.

Affected the following syscalls:

o __mac_get_fd
o __mac_get_file
o __mac_get_link
o __mac_set_fd
o __mac_set_file
o __mac_set_link

-Drop Giant all together in __mac_set_proc because the
 mac_cred_mmapped_drop_perms_recurse routine no longer requires it.
-Move conditional Giant aquisitions to after label allocation routines.
-Move the conditional release of Giant to before label de-allocation
 routines.

Discussed with:	rwatson
2005-10-04 14:32:58 +00:00
Don Lewis
34ea500bea Add missing word to comment. 2005-10-04 04:02:33 +00:00
Gleb Smirnoff
e113edf30a o Move a lot of parameter checking from netisr_poll() to
dedicated sysctl handlers. Protect manipulations with
  poll_mtx. The affected sysctls are:
  - kern.polling.burst_max
  - kern.polling.each_burst
  - kern.polling.user_frac
  - kern.polling.reg_frac
o Use CTLFLAG_RD on MIBs that supposed to be read-only.
o u_int32t -> uint32_t
o Remove unneeded locking from poll_switch().
2005-10-03 14:15:26 +00:00
Colin Percival
33812c066d If sufficiently bad things happen during a call to kern_execve(), it is
possible for do_execve() to call exit1() rather than returning.  As a
result, the sequence "allocate memory; call kern_execve; free memory"
can end up leaking memory.

This commit documents this astonishing behaviour and adds a call to
exec_free_args() before the exit1() call in do_execve().  Since all
the users of kern_execve() in the tree use exec_free_args() to free
the command-line arguments after kern_execve() returns, this should
be safe, and it fixes the memory leak which can otherwise occur.

Submitted by:	Peter Holm
MFC after:	3 days
Security:	Local denial of service
2005-10-03 12:49:54 +00:00
Hajimu UMEMOTO
56e5a87a55 make saved cpu level stackable. 2005-10-03 06:57:29 +00:00
Don Lewis
5032ff8197 Always wire the sysctl output buffer in sysctl_kern_proc() before
calling sysctl_out_proc().  -- fix from jhb

Move the code in fill_kinfo_thread() that gathers data from struct proc
into the new function fill_kinfo_proc_only().

Change all callers of fill_kinfo_thread() to call both
fill_kinfo_proc_only() and fill_kinfo() thread.  When gathering
data from a multi-threaded process, fill_kinfo_proc_only() only needs
to be called once.

Grab sched_lock before accessing the process thread list or calling
fill_kinfo_thread().

PR:		kern/84684
MFC after:	3 days
2005-10-02 23:27:56 +00:00
Robert Watson
c30bf5c317 Include kdb.h so that kdb_active is declared regardless of KDB being
included in the kernel.

MFC after:	0 days
2005-10-02 10:03:51 +00:00
Poul-Henning Kamp
7bbb3a2690 Make sure the clone lists are sorted in the right order.
Explosion triggered by:	pjd
MFC:	3 days
2005-10-01 19:21:03 +00:00
Gleb Smirnoff
4092996774 Big polling(4) cleanup.
o Axe poll in trap.

o Axe IFF_POLLING flag from if_flags.

o Rework revision 1.21 (Giant removal), in such a way that
  poll_mtx is not dropped during call to polling handler.
  This fixes problem with idle polling.

o Make registration and deregistration from polling in a
  functional way, insted of next tick/interrupt.

o Obsolete kern.polling.enable. Polling is turned on/off
  with ifconfig.

Detailed kern_poll.c changes:
  - Remove polling handler flags, introduced in 1.21. The are not
    needed now.
  - Forget and do not check if_flags, if_capenable and if_drv_flags.
  - Call all registered polling handlers unconditionally.
  - Do not drop poll_mtx, when entering polling handlers.
  - In ether_poll() NET_LOCK_GIANT prior to locking poll_mtx.
  - In netisr_poll() axe the block, where polling code asks drivers
    to unregister.
  - In netisr_poll() and ether_poll() do polling always, if any
    handlers are present.
  - In ether_poll_[de]register() remove a lot of error hiding code. Assert
    that arguments are correct, instead.
  - In ether_poll_[de]register() use standard return values in case of
    error or success.
  - Introduce poll_switch() that is a sysctl handler for kern.polling.enable.
    poll_switch() goes through interface list and enabled/disables polling.
    A message that kern.polling.enable is deprecated is printed.

Detailed driver changes:
  - On attach driver announces IFCAP_POLLING in if_capabilities, but
    not in if_capenable.
  - On detach driver calls ether_poll_deregister() if polling is enabled.
  - In polling handler driver obtains its lock and checks IFF_DRV_RUNNING
    flag. If there is no, then unlocks and returns.
  - In ioctl handler driver checks for IFCAP_POLLING flag requested to
    be set or cleared. Driver first calls ether_poll_[de]register(), then
    obtains driver lock and [dis/en]ables interrupts.
  - In interrupt handler driver checks IFCAP_POLLING flag in if_capenable.
    If present, then returns.This is important to protect from spurious
    interrupts.

Reviewed by:	ru, sam, jhb
2005-10-01 18:56:19 +00:00
Don Lewis
5997cae9a4 Copy new process argument list in do_execve() before grabbing PROC_LOCK
to avoid touching pageable memory while holding a mutex.

Simplify argument list replacement logic.

PR:		kern/84935
Submitted by:	"Antoine Pelisse" apelisse AT gmail.com (in a different form)
MFC after:	3 days
2005-10-01 08:33:56 +00:00
Don Lewis
bd3c2d867d Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().
2005-09-30 18:07:41 +00:00
David Xu
763a429571 Fox a LOR of sleep and sched_lock by using a timeout wait
when process reaches maximum number of threads.

MFC after: 3 days
2005-09-30 06:09:41 +00:00
Don Lewis
6c8b634f1d Un-staticize runningbufwakeup() and staticize updateproc.
Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads.  A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk.  This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk.  The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>
2005-09-30 01:30:01 +00:00
John Baldwin
b65089ccb5 Trim a couple of unneeded includes. 2005-09-29 19:13:52 +00:00
Peter Edwards
d41c4674c2 Close a race in biodone(), whereby the bio_done field of the passed
bio may have been freed and reassigned by the wakeup before being
tested after releasing the bdonelock.

There's a non-zero chance this is the cause of a few of the crashes
knocking around with biodone() sitting in the stack backtrace.

Reviewed By: phk@
2005-09-29 10:37:20 +00:00
Poul-Henning Kamp
64fd97df54 puc(4) does strange things to resources in order to fool the
subdrivers to hook up.

It should probably be rewritten to implement a simple bus to which
the sub drivers attach using some kind of hint.

Until then, provide a couple of crutch functions with big warning
signs so it can survive the recent changes to struct resource.
2005-09-28 18:06:25 +00:00
Robert Watson
5f419982c2 Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57,
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:

Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.

This aligns this code more with the eventual locking of ttys.

Suggested by:	bde
2005-09-28 07:03:03 +00:00
Christian S.J. Peron
453f7d5369 Push Giant down in jails. Pass the MPSAFE flag to NDINIT, and keep track
of whether or not Giant was picked up by the filesystem. Add VFS_LOCK_GIANT
macros around vrele as it's possible that this can call in the VOP_INACTIVE
filesystem specific code. Also while we are here, remove the Giant assertion.
from the sysctl handler,  we do not actually require Giant here so we
shouldn't assert it. Doing so will just complicate things when Giant is removed
from the sysctl framework.
2005-09-28 00:30:56 +00:00
Robert Watson
667285c4e3 If KDB_STOP_NMI is compiled into the kernel, default
debug.kdb.stop_cpus_with_nmi to 1 rather than 0.

MFC after:	3 days
2005-09-27 21:12:05 +00:00
Robert Watson
2b59d50cfb In lockstatus(), don't lock and unlock the interlock when testing the
sleep lock status while kdb_active, or we risk contending with the
mutex on another CPU, resulting in a panic when using "show
lockedvnods" while in DDB.

MFC after:	3 days
Reviewed by:	jhb
Reported by:	kris
2005-09-27 21:02:59 +00:00
Robert Watson
32a6bd9510 No longer maintain mbstat statistics for the mbuf allocator, UMA
statistics and libmemstat(3) are now used to track mbuf statistics.

MFC after:	1 month
2005-09-27 20:28:43 +00:00
John Baldwin
7e9e371f2d Use the refcount API to manage the reference count for user credentials
rather than using pool mutexes.

Tested on:	i386, alpha, sparc64
2005-09-27 18:09:42 +00:00
John Baldwin
b2149bde1f Use the reference count API to manage the reference counts for process
limit structures rather than using pool mutexes to protect the reference
counts.

Tested on:	i386, alpha, sparc64
2005-09-27 18:07:05 +00:00
John Baldwin
55b4a5ae0d Use the refcount API to implement reference counts on process argument
structures rather than using a global mutex to protect the reference
counts.

Tested on:	i386, alpha, sparc64
2005-09-27 18:03:15 +00:00
Christian S.J. Peron
6acd4b6189 Update the "created from" section to reflect the most recent version of
syscalls.master

Requested by:	jhb
2005-09-27 14:36:59 +00:00
Christian S.J. Peron
7f300b47dd Mark the extended attribute syscalls as being MP safe.
Requested by:	jhb
2005-09-27 14:32:04 +00:00
John Baldwin
d27acf445e Add the spin lock used by the binary nvidia driver to the static lock
order list so that WITNESS and the driver play together nicely.

Tested by:	Harald Schmalzbauer
MFC after:	3 days
2005-09-26 18:30:12 +00:00
Robert Watson
9b7915859d Add "show allpcpu" to DDB, which prints the current CPU id followed by
the per-cpu data for all CPUs.  This is easier to ask users to do than
"figure out how many CPUs you have, now run show pcpu, then run it
once for each CPU you have".

MFC after:	3 days
2005-09-26 16:55:11 +00:00
David Xu
2b7182c6b7 Reorder statements to avoid accessing unknown memory.
In theory, invoking kenv with very long string can panic
kernel.
2005-09-26 14:14:55 +00:00
Robert Watson
329c75a730 Acquire Giant in uprintf() and tprintf() rather than asserting it. In
the vast majority of cases, these functions are called without mutexes
held, meaning that in all but two cases, there will be no ordering
issues with doing this, and it will eliminate the need for changes in
the caller.  In two cases, mutexes are held, so Giant must be acquired
before those mutexes such that uprintf() and tprintf() recurse Giant
rather than generating a lock order reversal.

Suggested by:	bde
2005-09-26 08:02:24 +00:00
Poul-Henning Kamp
2b35175c8a Add rman_is_region_manager() for the benefit of an alpha hack. 2005-09-25 20:10:10 +00:00
Christian S.J. Peron
c47a4d1c9f Implement new world order in VFS locking for extended attributes. This will
remove the unconditional acquisition of Giant for extended attribute related
operations. If the file system is set as being MP safe and debug.mpsafevfs is
1, do not pickup Giant.

Mark the following system calls as being MP safe so we no longer pickup Giant
in the system call handler:

o extattrctl
o extattr_set_file
o extattr_get_file
o extattr_delete_file
o extattr_set_fd
o extattr_get_fd
o extattr_delete_fd
o extattr_set_link
o extattr_get_link
o extattr_delete_link
o extattr_list_file
o extattr_list_link
o extattr_list_fd

-Pass MPSAFE flags to namei(9) lookup and introduce vfslocked variable which
 will keep track of any Giant acquisitions.
-Wrap any fd operations which manipulate vnodes in VFS_{UN}LOCK_GIANT
-Drop VFS_ASSERT_GIANT into function which operate on vnodes to ensure that
 we are sufficiently protected.

I've tested these changes with various TrustedBSD MAC policies which use
extended attribute a lot on SMP and UP systems (thanks to Scott Long for
making some SMP hardware available to me for testing).

Discussed with:	jeff
Requested by:	jhb, rwatson
2005-09-24 23:47:04 +00:00
Poul-Henning Kamp
ae7ff71f63 Split struct resource in an external and internal part.
The external part is still called 'struct resource' but the contents
is now visible to drivers etc.  This makes it part of the device
driver ABI so it not be changed lightly.  A comment to this effect
is in place.

The internal part is called 'struct resource_i' and contain its external
counterpart as one field.

Move the bus_space tag+handle into the external struct resource, this
removes the need for device drivers to even know about these fields
in order to use bus_space to access hardware. (More in following commit).
2005-09-24 20:07:03 +00:00
Poul-Henning Kamp
a778923149 Add two convenience functions for device drivers: bus_alloc_resources()
and bus_free_resources().  These functions take a list of resources
and handle them all in one go.  A flag makes it possible to mark
a resource as optional.

A typical device driver can save 10-30 lines of code by using these.

Usage examples will follow RSN.

MFC:	A good idea, eventually.
2005-09-24 19:31:10 +00:00
Robert Watson
e1ac28e239 Canonicalize the UNIX domain socket copyright layout: original holders
before more recent holders.

MFC after:	3 days
2005-09-23 12:41:06 +00:00
Stephan Uphoff
3fafa27b27 Don't pretend to be thread0 when calling sync().
It confuses the lock manager since in some places thread0 is
then used for vnode locking while curthread is used for vnode unlocking.

Found by:	Yahoo!
Reviewed by:	ps@,jhb@
MFC after:	3 days
2005-09-22 15:34:15 +00:00
David Xu
a861574011 Temporarily disable nice threshold detection code, as it can starve
a thread holding critical resource, e.g mutex or other implicit
synchronous flags. Give thread which exceeds nice threshold a minimum
time slice.

PR: kern/86087
2005-09-22 01:19:37 +00:00
John Baldwin
e12560dd4b Use correct VFS locking rather than unconditionally grabbing Giant around
namei() calls in kern_alternate_path().

Reviewed by:	csjp
MFC after:	1 week
2005-09-21 19:49:42 +00:00
Robert Watson
87328e07e0 Pass 'curthread' into VFS_STATFS() from acctwatch(), rather than passing
NULL.  The NFS client expects that a thread will always be present for a
VOP so that it can check for signal conditions, and will dereference a
NULL pointer if one isn't present.

MFC after:	3 days
2005-09-21 15:28:07 +00:00
Robert Watson
5580b0b157 Correct an incorrect comment from the dawn of time: neither tprintf()
nor uprintf() is believed to perform tsleep() or msleep() as written,
as ttycheckoutq() is called with '0' as its sleep argument.

Remove recently added WITNESS warnings for sleep as the comment was
incorrect.  This should silence a warning from the nfs_timer() code.

Discussed with:	bde
2005-09-20 09:55:36 +00:00
Andre Oppermann
e452573df7 Start time_uptime with 1 instead of 0.
Discussed with:		phk
2005-09-19 22:16:31 +00:00
Poul-Henning Kamp
e606a3c63e Rewamp DEVFS internals pretty severely [1].
Give DEVFS a proper inode called struct cdev_priv.  It is important
to keep in mind that this "inode" is shared between all DEVFS
mountpoints, therefore it is protected by the global device mutex.

Link the cdev_priv's into a list, protected by the global device
mutex.  Keep track of each cdev_priv's state with a flag bit and
of references from mountpoints with a dedicated usecount.

Reap the benefits of much improved kernel memory allocator and the
generally better defined device driver APIs to get rid of the tables
of pointers + serial numbers, their overflow tables,  the atomics
to muck about in them and all the trouble that resulted in.

This makes RAM the only limit on how many devices we can have.

The cdev_priv is actually a super struct containing the normal cdev
as the "public" part, and therefore allocation and freeing has moved
to devfs_devs.c from kern_conf.c.

The overall responsibility is (to be) split such that kern/kern_conf.c
is the stuff that deals with drivers and struct cdev and fs/devfs
handles filesystems and struct cdev_priv and their private liason
exposed only in devfs_int.h.

Move the inode number from cdev to cdev_priv and allocate inode
numbers properly with unr.  Local dirents in the mountpoints
(directories, symlinks) allocate inodes from the same pool to
guarantee against overlaps.

Various other fields are going to migrate from cdev to cdev_priv
in the future in order to hide them.  A few fields may migrate
from devfs_dirent to cdev_priv as well.

Protect the DEVFS mountpoint with an sx lock instead of lockmgr,
this lock also protects the directory tree of the mountpoint.

Give each mountpoint a unique integer index, allocated with unr.
Use it into an array of devfs_dirent pointers in each cdev_priv.
Initially the array points to a single element also inside cdev_priv,
but as more devfs instances are mounted, the array is extended with
malloc(9) as necessary when the filesystem populates its directory
tree.

Retire the cdev alias lists, the cdev_priv now know about all the
relevant devfs_dirents (and their vnodes) and devfs_revoke() will
pick them up from there.  We still spelunk into other mountpoints
and fondle their data without 100% good locking.  It may make better
sense to vector the revoke event into the tty code and there do a
destroy_dev/make_dev on the tty's devices, but that's for further
study.

Lots of shuffling of stuff and churn of bits for no good reason[2].

XXX: There is still nothing preventing the dev_clone EVENTHANDLER
from being invoked at the same time in two devfs mountpoints.  It
is not obvious what the best course of action is here.

XXX: comment out an if statement that lost its body, until I can
find out what should go there so it doesn't do damage in the meantime.

XXX: Leave in a few extra malloc types and KASSERTS to help track
down any remaining issues.

Much testing provided by:		Kris
Much confusion caused by (races in):	md(4)

[1] You are not supposed to understand anything past this point.

[2] This line should simplify life for the peanut gallery.
2005-09-19 19:56:48 +00:00
Robert Watson
84d2b7df26 Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions.  In most cases, this means adding an
acquisition of Giant immediately around the function.  In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset.  This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after:	1 week
2005-09-19 16:51:43 +00:00
Robert Watson
223aaaecb0 Remove mac_create_root_mount() and mpo_create_root_mount(), which
provided access to the root file system before the start of the
init process.  This was used briefly by SEBSD before it knew about
preloading data in the loader, and using that method to gain
access to data earlier results in fewer inconsistencies in the
approach.  Policy modules still have access to the root file system
creation event through the mac_create_mount() entry point.

Removed now, and will be removed from RELENG_6, in order to gain
third party policy dependencies on the entry point for the lifetime
of the 6.x branch.

MFC after:	3 days
Submitted by:	Chris Vance <Christopher dot Vance at SPARTA dot com>
Sponsored by:	SPARTA
2005-09-19 13:59:57 +00:00
Marcel Moolenaar
73130b2224 Move the UUID generator into its own function, called kern_uuidgen(),
so that UUIDs can be generated from within the kernel. The uuidgen(2)
syscall now allocates kernel memory, calls the generator, and does a
copyout() for the whole UUID store. This change is in support of GPT.
2005-09-18 21:40:15 +00:00
Robert Watson
8434c29b28 Add three new read-only socket options, which allow regression tests
and other applications to query the state of the stack regarding the
accept queue on a listen socket:

SO_LISTENQLIMIT    Return the value of so_qlimit (socket backlog)
SO_LISTENQLEN      Return the value of so_qlen (complete sockets)
SO_LISTENINCQLEN   Return the value of so_incqlen (incomplete sockets)

Minor white space tweaks to existing socket options to make them
consistent.

Discussed with:	andre
MFC after:	1 week
2005-09-18 21:08:03 +00:00
Robert Watson
bc6b8b5d64 Fix spelling in a comment.
MFC after:	3 days
2005-09-18 10:46:34 +00:00
Robert Watson
7da7362b95 Re-comment sbcompress() to explain what it is it does; it took me
quite a bit of reading to figure it out, and I want to avoid figuring
it out again.

Convert an if (foo) else printf("this is almost a panic") into a
KASSERT.

MFC after:	3 days
2005-09-18 10:30:10 +00:00
Warner Losh
fe0519b171 MFp4: Expose device_probe_child() 2005-09-18 01:32:09 +00:00
Christian S.J. Peron
42e7197fba Implement new world order in VFS locking for ACLs. This will remove the
unconditional acquisition of Giant for ACL related operations. If the file
system is set as being MP safe and debug.mpsafevfs is 1, do not pickup
giant.

For any operations which require namei(9) lookups:

__acl_get_file
__acl_get_link
__acl_set_file
__acl_set_link
__acl_delete_file
__acl_delete_link
__acl_aclcheck_file
__acl_aclcheck_link

-Set the MPSAFE flag in NDINIT
-Initialize vfslocked variable using the NDHASGIANT macro

For functions which operate on fds, make sure the operations are locked:

__acl_get_fd
__acl_set_fd
__acl_delete_fd
__acl_aclcheck_fd

-Initialize vfslocked using VFS_LOCK_GIANT before we manipulate the vnode

Discussed with:	jeff
2005-09-17 22:01:14 +00:00
Tor Egge
61ac14dab6 Break out of loop if next buffer pointer has become invalid while flushing
current buffer.

Reviewed by:	kan
2005-09-16 18:28:12 +00:00
Stephan Uphoff
19b2dff7b0 Fix race condition that caused activation of an event to
be ignored immediately after it was deactivated.

Found by: 	Yahoo!
MFC after:	3 days
2005-09-15 21:10:12 +00:00
John Baldwin
21f9e816cd Oops, missed adding the required include.
Pointy hat to:	jhb
2005-09-15 20:20:36 +00:00
John Baldwin
53c0e1ff7d Replace the dont_sleep_in_callout mutex hack (similar to g_x{up,down})
with the disallow sleeping facility.
2005-09-15 20:09:08 +00:00
John Baldwin
10f508d9a3 Don't disallow sleeping for handlers on swi's since some swi handlers
(like CAM) do sleep in their handlers.

Requested by:	scottl
2005-09-15 20:08:21 +00:00
John Baldwin
b27dbfbf4a - Enforce an implicit lock order that Giant cannot be locked while holding
any other non-sleepable lock.  In plain English: Giant comes before all
  other mutexes.
- Add some extra description to the lock order reversal printf's to indicate
  when a reversal is triggered by a hard-coded implicit rule.

Requested by:	truckman (2)
MFC after:	1 week
2005-09-15 19:07:14 +00:00
John Baldwin
51460da87f - Add a new simple facility for marking the current thread as being in a
state where sleeping on a sleep queue is not allowed.  The facility
  doesn't support recursion but uses a simple private per-thread flag
  (TDP_NOSLEEPING).  The sleepq_add() function will panic if the flag is
  set and INVARIANTS is enabled.
- Use this new facility to replace the g_xup and g_xdown mutexes that were
  (ab)used to achieve similar behavior.
- Disallow sleeping in interrupt threads when invoking interrupt handlers.

MFC after:	1 week
Reviewed by:	phk
2005-09-15 19:05:37 +00:00
Christian S.J. Peron
68ff2a4397 Improve the MP safeness associated with the creation of symbolic
links and the execution of ELF binaries. Two problems were found:

1) The link path wasn't tagged as being MP safe and thus was not properly
   protected.
2) The ELF interpreter vnode wasnt being locked in namei(9) and thus was
   insufficiently protected.

This commit makes the following changes:

-Sets the MPSAFE flag in NDINIT for symbolic link paths
-Sets the MPSAFE flag in NDINIT and introduce a vfslocked variable which
 will be used to instruct VFS_UNLOCK_GIANT to unlock Giant if it has been
 picked up.
-Drop in an assertion into vfs_lookup which ensures that if the MPSAFE
 flag is NOT set, that we have picked up giant. If not panic (if WITNESS
 compiled into the kernel). This should help us find conditions where vnode
 operations are in-sufficiently protected.

This is a RELENG_6 candidate.

Discussed with:	jeff
MFC after:	4 days
2005-09-15 15:03:48 +00:00
Maxim Konovalov
aada5cccd8 Backout rev. 1.246, it breaks code uses shutdown(2) on non-connected
sockets.

Pointed out by:	rwatson
2005-09-15 13:18:05 +00:00
Ralf S. Engelschall
724447ac41 Fix system shutdown timeout handling by again supporting longer running
shutdown procedures (which have a duration of more than 120 seconds).

We have two user-space affecting shutdown timeouts: a "soft" one in
/etc/rc.shutdown and a "hard" one in init(8). The first one can be
configured via /etc/rc.conf variable "rcshutdown_timeout" and defaults
to 30 seconds. The second one was originally (in 1998) intended to be
configured via sysctl(8) variable "kern.shutdown_timeout" and defaults
to 120 seconds.

Unfortunately, the "kern.shutdown_timeout" was declared "unused" in 1999
(as it obviously is actually not used within the kernel itself) and
hence was intentionally but misleadingly removed in revision 1.107 from
init_main.c. Kernel sysctl(8) variables are certainly a wrong way to
control user-space processes in general, but in this particular case the
sysctl(8) variable should have remained as it supports init(8), which
isn't passed command line flags (which in turn could have been set via
/etc/rc.conf), etc.

As there is already a similar "kern.init_path" sysctl(8) variable which
directly affects init(8), resurrect the init(8) shutdown timeout under
sysctl(8) variable "kern.init_shutdown_timeout". But this time document
it as being intentionally unused within the kernel and used by init(8).
Also document it in the manpages init(8) and rc.conf(5).

Reviewed by: phk
MFC after: 2 weeks
2005-09-15 13:16:07 +00:00
Maxim Konovalov
c5cff17017 o Return ENOTCONN when shutdown(2) on non-connected socket.
PR:		kern/84761
Submitted by:	James Juran
R-test:		tools/regression/sockets/shutdown
MFC after:	1 month
2005-09-15 11:45:36 +00:00
Poul-Henning Kamp
74f46f19aa Retire unused dev_named() function. 2005-09-15 08:01:57 +00:00
Robert Watson
fd1a469ba5 In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported
kqueue filter type is requested on a vnode.

MFC after:	3 days
2005-09-12 19:22:37 +00:00
Jung-uk Kim
9ed448b20c use monotonic time_uptime' instead of time_second'
Approved by:	anholt (mentor)
Discussed on:	arch
2005-09-12 15:31:28 +00:00
Poul-Henning Kamp
2883ba6668 Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.
2005-09-12 08:46:07 +00:00
Tor Egge
6ff5e2db45 Don't retry when vget() returns ENOENT in the nonblocking case due to the
vnode being doomed.  It causes a livelock.
2005-09-12 01:48:57 +00:00
Don Lewis
908b3deb2b Relocate witness_levelall(), witness_leveldescendents(), and
witness_displaydescendants() so that they are protected by
"#ifdef DDB/#endif" to unbreak kernels not using "option DDB".

MFC after:	3 weeks
2005-09-11 07:57:06 +00:00
Gleb Smirnoff
d04304d155 Make callout_reset() return a non-zero value if a pending callout
was rescheduled. If there was no pending callout, then return 0.

Reviewed by:	iedowse, cperciva
2005-09-08 14:20:39 +00:00
Don Lewis
d07f87a218 Add a new struct buf flag bit, B_PERSISTENT, and use it to tag
struct bufs that are persistently held by ext2fs.  Ignore any buffers
with this flag in the code in boot() that counts "busy" and dirty
buffers and attempts to sync the dirty buffers, which is done before
attempting to unmount all the file systems during shutdown.

This fixes the problem caused by any ext2fs file systems that are
mounted at system shutdown time, which caused boot() to give up on
a non-zero number of buffers and skip the call to vfs_unmountall().
This left all the mounted file systems in a dirty state and caused
them to all require cleanup by fsck on reboot.

Move the two separate copies of the "busy" buffer test in boot()
to a separate function.

Nuke the useless spl() stuff in the ext2fs ULCK_BUF() macro.

Bring the PRINT_BUF_FLAGS definition in sys/buf.h up to date with
this and previous flag changes.

PR:		kern/56675, kern/85163
Tested by:	"Matthias Andree" matthias.andree at gmx.de
Reviewed by:	bde
MFC after:	3 days
2005-09-08 06:30:05 +00:00
David E. O'Brien
5b1c0294e4 Forward declaring static variables as extern is invalid ISO-C. Now that
GCC can properly handle forward static declarations, do this properly.
2005-09-07 10:06:14 +00:00
Gleb Smirnoff
016e62123a In soreceive(), when a first mbuf is removed from socket buffer use
sockbuf_pushsync(). Previous manipulation could lead to an inconsistent
mbuf.

Reviewed by:	rwatson
2005-09-06 17:05:11 +00:00
Gleb Smirnoff
f46ab10c02 Document flags of a pollrec. 2005-09-06 11:09:18 +00:00
Christian S.J. Peron
d1dfd92177 Convert the primary ACL allocator from malloc(9) to using a UMA zone instead.
Also introduce an aclinit function which will be used to create the UMA zone
for use by file systems at system start up.

MFC after:	1 month
Discussed with:	rwatson
2005-09-06 00:06:30 +00:00
Gleb Smirnoff
16901c0186 Remove Giant mutex from polling(4) and use a separate poll_mtx(4)
instead. Detailed changelist:

o Add flags field to struct pollrec, to indicate that
  are particular entry is being worked on.
o Define a macro PR_VALID() to check that a pollrec
  is valid and pollable.
o Mark ISRs as mpsafe.

o ether_poll()
  - Acquire poll_mtx while traversing pollrec array.
  - Skip pollrecs, that are being worked on.
  - Conditionally acquire Giant when entering handler.

o netisr_pollmore()
  - Conditionally assert Giant.
  - Acquire poll_mtx while working with statistics.

o netisr_poll()
  - Conditionally assert Giant.
  - Acquire poll_mtx while working with statistics
    and traversing pollrec array.

o ether_poll_register(), ether_poll_deregister()
  - Conditionally assert Giant.
  - Acquire poll_mtx while working with pollrec array.

o poll_idle()
  - Remove all strange manipulations with Giant.

In collaboration with:	ru, pjd
In collaboration with:	Oleg Bulyzhin <oleg rinet.ru>
In collaboration with:	dima <_pppp mail.ru>
2005-09-05 16:02:11 +00:00
Xin LI
5248ef8a3c When padding with zero, do pad after prefixes rather than padding
before prefixes.

Use cases:
	printf("%05d", -42);   -->   "00-42"   (should be "-0042")
	printf("%#05x", 12);   -->   "000xc"   (should be "0x00c")

Submitted by:	Oliver Fromme
PR:		kern/85520
MFC After:	1 week
2005-09-04 18:03:45 +00:00
Poul-Henning Kamp
1e7d2c4763 If we ignore an unknown % sequence, we must stop interpreting the
remaining % arguments because the varargs are now out of sync and
there is a risk that we might for instance dereference an integer
in a %s argument.

Sponsored by: Napatech.com
2005-09-03 10:28:08 +00:00
John Baldwin
acc0265cc2 - Add some comments to some of the static lock orders. Don't explicitly
link proctree and allproc to Giant since that order is already implicitly
  enforced.
- Use a goto to handle the case where we want to enforce a reversal before
  calling isitmydescendant() in witness_checkorder() so that the logic is
  easier to follow and so that it is easier to add more forced-reversal
  cases in the future.

MFC after:	 3 days
2005-09-02 20:23:49 +00:00
John Baldwin
83cece6fa1 - Add an assertion to panic if one tries to call mtx_trylock() on a spin
mutex.
- Don't panic if a spin lock is held too long inside _mtx_lock_spin() if
  panicstr is set (meaning that we are already in a panic).  Just keep
  spinning forever instead.
2005-09-02 20:21:49 +00:00
John Baldwin
83de502d59 Add witness warnings to panic if a thread tries to exit while holding any
locks.

Requested by:	jeff
MFC after:	3 days
2005-09-02 20:20:01 +00:00
Nate Lawson
9000b91eb9 Break out the checks for duplicates and absolute settings being too high
instead of trying to do them all at once.  This should fix the level sorting
problems from the previous revision.

Testing help:	ume
2005-09-02 16:32:43 +00:00
Suleiman Souhlal
1f71de49e1 Print out a warning and a backtrace if we try to unlock a lockmgr that
we do not hold.

Glanced at by:	phk
MFC after:	3 days
2005-09-02 15:56:01 +00:00
Suleiman Souhlal
2611e5a6a9 Don't unbusy the devfs mount in vfs_mountroot_try() as it gets accessed
and unbusied in devfs_fixup(), which assumes that the devfs mount is
still locked.

Granced at by:	phk
MFC after:	3 days
2005-09-02 13:37:54 +00:00
Pawel Jakub Dawidek
d8b464e51e In case of mac_check_vnode_rename_from() or vn_start_write() failure,
vn_finished_write() should not be called.

Reviewed by:	ssouhlal
MFC after:	3 days
2005-09-01 21:46:33 +00:00
Andre Oppermann
fdcc028d11 Changes and cleanups to m_sanity():
o for() instead of while() looping  over mbuf chain
o paren's around all flag checks
o more verbose function and purpose description
o some more style changes

Based on feedback from:	sam
2005-08-30 21:31:42 +00:00
Andre Oppermann
e0068c3a69 Unbreak m_demote() and put back the 'all' flag. Without it we cannot
correctly test for m_nextpkt in an mbuf chain.
2005-08-30 21:14:30 +00:00
Andre Oppermann
fbe816384a o Remove the 'all' flag from m_demote(). Users can simply call it with
m_demote(m->m_next) if they wish to start at the second mbuf in chain.
o Test m_type with == instead of &.
o Check m_nextpkt against NULL instead of implicit 0.

Based on feedback from:	sam
2005-08-30 20:07:49 +00:00
Nate Lawson
5308b2a64e Eliminate cpufreq levels for two cases that are less than optimal:
1. Walk the absolute list in reverse to prefer duplicated levels that have
a lower absolute setting, i.e. 800 Mhz/50% is better than 1600 Mhz/25% even
though both have the same actual frequency.  This also removes the need to
check for already-modified levels since by definition, those will be added
later in the sorted list.

2. Compare the absolute settings for derived levels and don't use the new
level if it's higher.  For example, a level of 800 Mhz/75% is preferable to
1600 Mhz/25% even though the latter has a lower total frequency.

This work is based on a patch from the submitter but reworked by myself.

Submitted by:	Tijl Coosemans (tijl/ulyssis.org)
2005-08-30 04:45:32 +00:00
Andre Oppermann
4da8443133 Add m_copymdata(struct mbuf *m, struct mbuf *n, int off, int len,
int prep, int how).

Copies the data portion of mbuf (chain) n starting from offset off
for length len to mbuf (chain) m.  Depending on prep the copied
data will be appended or prepended.  The function ensures that the
mbuf (chain) m will be fully writeable by making real (not refcnt)
copies of mbuf clusters.  For the prepending the function returns
a pointer to the new start of mbuf chain m and leaves as much
leading space as possible in the new first mbuf.

Reviewed by:	glebius
2005-08-29 20:15:33 +00:00
Andre Oppermann
a048affba5 Add m_sanity(struct mbuf *m, int sanitize) to do some heavy sanity
checking on mbuf's and mbuf chains.  Set sanitize to 1 to garble
illegal things and have them blow up later when used/accessed.

m_sanity()'s main purpose is for KASSERT()'s and debugging of non-
kosher mbuf manipulation (of which we have a number of).

Reviewed by:	glebius
2005-08-29 19:58:56 +00:00
Andre Oppermann
ed111688e9 Add m_demote(struct mbuf *m, int all) to clean up mbuf (chain) from
any tags and packet headers.  If "all" is set then the first mbuf
in the chain will be cleaned too.

This function is used before an mbuf, that arrived as packet with
m->flags & M_PKTHDR, is appended to an mbuf chain using m->m_next
(not m->m_nextpkt).

Reviewed by:	glebius
2005-08-29 19:45:39 +00:00
Pawel Jakub Dawidek
e37a499443 Add 'depth' argument to CTRSTACK() macro, which allows to reduce number
of ktr slots used. If 'depth' is equal to 0, the whole stack will be
logged, just like before.
2005-08-29 11:34:08 +00:00
Suleiman Souhlal
a6c109d658 Fix a typo in vop_rename_pre() where we ended up using vholdl()
instead of vhold(), even though the vnode interlock is unlocked.

MFC after:	3 days
2005-08-28 23:00:11 +00:00
Alan Cox
7f1ef325d7 Handle vm_map_wire()'s failure. 2005-08-28 05:38:40 +00:00
Alan Cox
5d3043ce9a Correctly handle vm_map_wire()'s failure. (See also revisions 1.81 and
1.82.)

Reviewed by:	tegge
2005-08-28 04:50:11 +00:00
Alan Cox
45e31b6034 Eliminate an unneeded reference on a vm object. If, in fact, the nearby
vm_map_find() fails, then the excess reference causes the vm object to be
leaked.

Reviewed by:	tegge
2005-08-28 00:24:58 +00:00
Alan Cox
4167396552 Revert the previous change for two reasons: (1) If vm_map_find() succeeds
but vm_map_wire() fails, then a vm object, vm map entries, and kernel_map
free space is leaked and (2) unwiring is handled automatically by
vm_map_remove().

Suggested by:   tegge
2005-08-28 00:19:54 +00:00