Commit Graph

7538 Commits

Author SHA1 Message Date
Alfred Perlstein
81d16e2d64 do the vfsstd thing instead of messing up our VFS_SYSCTL macro. 2004-07-07 06:58:29 +00:00
Peter Edwards
0f01586867 Fix bug introduced in rev 1.434:
When avoiding the zeroing of "bogus_page" when it appears in a buf,
be sure to advance the pointers into the data for successive pages.

The bug caused file corruption when read(2)ing from a "hole" in a
file where a previous page of the read block had already been faulted
in: fsx tripped up on this pretty quickly. The particular access
pattern is probably pretty unusual, so other applications probably
wouldn't have had problems, but you'd never know.

Reviewed By: alc@
2004-07-06 23:40:40 +00:00
Alfred Perlstein
1ea6061793 Use vfs_suser() where appropriate. 2004-07-06 09:39:32 +00:00
Alfred Perlstein
ea0104b032 Introduce vfs_suser(), used to test if a user should have special privs
for a mount.
2004-07-06 09:37:43 +00:00
Alfred Perlstein
c713aaaeca NFS mobility PHASE I, II & III (phase VI, and V pending):
Rebind the client socket when we experience a timeout.  This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
2004-07-06 09:12:03 +00:00
Robert Watson
df623e3c2f Temporarily disable preemption in SCHED_ULE due to reported panics and
hangs due to recent preemption changes.  This change appears to remove
the panic that I was running into, but at the cost of increasing
ithread scheduling latency, and as such is a temporary band-aid until
jhb has a chance to resolve the ule<->preemption interaction that is
the source of the problem.  If it doesn't fix the problem for others--
sorry!
2004-07-06 05:57:29 +00:00
Don Lewis
27875d9c88 Unconditionally set last_work_seen while in the SYNCER_RUNNING state
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.

Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.

Tested by:	phk
2004-07-05 21:32:01 +00:00
Robert Watson
6a72b225b7 Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT.
A subset of locking changes to soreceive() in the queue for merging.

Bumped into by:	Willem Jan Withagen <wjw@withagen.nl>
2004-07-05 19:29:33 +00:00
Don Lewis
faf1b66d1d Rework syncer termination code:
Speed up the syncer when shutting down by sleeping for a shorter
    period of time instead of cranking up rushjob and using the
    normal one second sleep.

    Skip empty worklist slots when shutting down to avoid lengthy
    intervals of inactivity.

    Give I/O more time to complete between steps by not speeding the
    syncer quite as much.

    Terminate the syncer after one full pass through the worklist
    plus one second with the worklist containing nothing but syncer
    vnodes.

    Print an indication of shutdown progress to the console.

Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.
2004-07-05 01:07:33 +00:00
Poul-Henning Kamp
c555963fd1 Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE. 2004-07-04 22:33:22 +00:00
Alfred Perlstein
2d1dca73ee Pass the operation in with the fsidctl.
Remove some fsidctls that we will not be using.
Correct prototypes for fs sysctls.
2004-07-04 20:21:58 +00:00
Poul-Henning Kamp
7f6599fec6 Make the last commit handle non-phk root devices better. 2004-07-04 19:42:25 +00:00
Stefan Farfeleder
5908d366fb Consistently use __inline instead of __inline__ as the former is an empty macro
in <sys/cdefs.h> for compilers without support for inline.
2004-07-04 16:11:03 +00:00
Poul-Henning Kamp
1cbb1e02c4 Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.
2004-07-04 12:49:04 +00:00
Alfred Perlstein
94ed9c8af5 Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace.  Currently only mount and unmount
of filesystems are signalled.  Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
2004-07-04 10:52:54 +00:00
Alfred Perlstein
903ac7c219 Revision 1.496 would not boot on my system due to
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
 insmntqueue(vp, mp = NULL) -> KASSERT -> panic

Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.
2004-07-04 10:19:15 +00:00
Poul-Henning Kamp
e3c5a7a4dd When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint.  If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

		MNT_ILOCK(mp);
	loop:
		for (vp = TAILQ_FIRST(&mp...);
		    (vp = nvp) != NULL;
		    nvp = TAILQ_NEXT(vp,...)) {
			if (vp->v_mount != mp)
				goto loop;
			MNT_IUNLOCK(mp);
			...
			MNT_ILOCK(mp);
		}
		MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

	MNT_ILOCK(vp->v_mount);
	...
	TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
	...
	MNT_IUNLOCK(vp->v_mount);
	...
	vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go.  This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
2004-07-04 08:52:35 +00:00
Poul-Henning Kamp
cfa5e80af8 Remove stale comment 2004-07-03 19:37:06 +00:00
Poul-Henning Kamp
279f949ee5 Add NULL arg to mi_switch() call to stop kernel compiles from breaking. 2004-07-03 16:57:51 +00:00
John Baldwin
b5cbda5055 Add a NULL param to an mi_switch() that I missed.
Reported by:	Jung-uk Kim jkim at niksun dot com
2004-07-03 02:38:03 +00:00
Bosko Milekic
abdb4e5d01 Fix SCHED_ULE build on SMP. The previous revision (1.110)
introduced a KSE_CAN_MIGRATE() invocation with one argument
missing (class).  Either this is a genuine forget or it crept
in from JHB's repo where he may have modified it.  If it's
the latter then it may require more attention.  For now fix
the make depend.
2004-07-03 01:19:46 +00:00
Marcel Moolenaar
8b44a2e2c9 Unbreak build for the the !PREEMPTION case: don't define variables
that aren't used in that case.
2004-07-03 00:57:43 +00:00
John Baldwin
0c0b25ae91 Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
  determine if a thread about to be added to a run queue should be
  preempted to directly.  If it is not safe to preempt or if the new
  thread does not have a high enough priority, then the function returns
  false and sched_add() adds the thread to the run queue.  If the thread
  should be preempted to but the current thread is in a nested critical
  section, then the flag TDF_OWEPREEMPT is set and the thread is added
  to the run queue.  Otherwise, mi_switch() is called immediately and the
  thread is never added to the run queue since it is switch to directly.
  When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
  then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
  setrunqueue() now does all the correct work.  This also removes the
  do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
  supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
  chance to run if the architecture supports native preemption since
  the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
  architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
  preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by:	scottl (with his re@ hat)
2004-07-02 20:21:44 +00:00
John Baldwin
bf0acc273a - Change mi_switch() and sched_switch() to accept an optional thread to
switch to.  If a non-NULL thread pointer is passed in, then the CPU will
  switch to that thread directly rather than calling choosethread() to pick
  a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
  TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
  requiring all callers of mi_switch() to know to do this if they can be
  called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
  the middle of the function prototypes and up above into their own
  section.
2004-07-02 19:09:50 +00:00
David Xu
f3b929bf42 Allow ptrace to deal with lwpid.
Reviewed by: marcel
2004-07-02 09:19:22 +00:00
Alfred Perlstein
95f004dccd We allocate an array of pointers to the global file table while
not holding the filelist_lock.  This means the filelist can change
size while allocating.  Detect this race and retry the allocation.
2004-07-02 07:40:10 +00:00
John Baldwin
a3a7017895 Tidy up uprof locking. Mostly the fields are protected by both the proc
lock and sched_lock so they can be read with either lock held.  Document
the locking as well.  The one remaining bogosity is that pr_addr and
pr_ticks should be per-thread but profiling of multithreaded apps is
currently undefined.
2004-07-02 03:50:48 +00:00
John Baldwin
16f9f20579 - Assert that any process that has statclock called on it has both a
stats structure and a vmspace as this should always be true rather
  than checking the always true condition in an if statement.
- Remove never-false check: if ((ru = &pstats->p_ru) != NULL)
- Remove pstats variable that is only used once and inline its one use
  instead.
2004-07-02 03:48:09 +00:00
Marcel Moolenaar
cd28f17da2 Change the thread ID (thr_id_t) used for 1:1 threading from being a
pointer to the corresponding struct thread to the thread ID (lwpid_t)
assigned to that thread. The primary reason for this change is that
libthr now internally uses the same ID as the debugger and the kernel
when referencing to a kernel thread. This allows us to implement the
support for debugging without additional translations and/or mappings.

To preserve the ABI, the 1:1 threading syscalls, including the umtx
locking API have not been changed to work on a lwpid_t. Instead the
1:1 threading syscalls operate on long and the umtx locking API has
not been changed except for the contested bit. Previously this was
the least significant bit. Now it's the most significant bit. Since
the contested bit should not be tested by userland, this change is
not expected to be visible. Just to be sure, UMTX_CONTESTED has been
removed from <sys/umtx.h>.

Reviewed by: mtm@
ABI preservation tested on: i386, ia64
2004-07-02 00:40:07 +00:00
Marcel Moolenaar
c2589102b0 Regen. 2004-07-02 00:38:56 +00:00
Don Lewis
e06500dde5 When shutting down the syncer kernel thread, first tell it to run
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates.  This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time.  The downside is that the syncer takes noticeably longer to
terminate.

Tested by:	"Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by:	mckusick
2004-07-01 23:59:19 +00:00
Warner Losh
da35daffaf Add ability to set start/end for rman 2004-07-01 16:22:10 +00:00
John Baldwin
39981fed82 Trim a few things from the dmesg output and stick them under bootverbose to
cut down on the clutter including PCI interrupt routing, MTRR, pcibios,
etc.

Discussed with:	USENIX Cabal
2004-07-01 07:46:29 +00:00
Warner Losh
0363a12688 Hide struct resource and struct rman. You must define
__RMAN_RESOURCE_VISIBLE to see inside these now.

Reviewed by: dfr, njl (not njr)
2004-06-30 16:54:10 +00:00
Warner Losh
37b4e4f471 Include more information about the device in the devadded and
devremoved events.  This reduces the races around these events.  We
now include the pnp info in both.  This lets one do more interesting
thigns with devd on device insertion.

Submitted by: Bernd Walter
2004-06-30 02:46:25 +00:00
John Baldwin
01bd10e163 Oops, this didn't make it into my submit before I committed: Defer
creation of the sysctl tree for the turnstile profiling stats until a
SI_SUB_LOCK sysinit.  Doing it in init_turnstiles() is too early as it is
called before mi_startup().
2004-06-29 03:48:49 +00:00
Peter Wemm
5b201fdcaa Wrap long line. 2004-06-29 03:13:54 +00:00
John Baldwin
ef0ebfc351 Add two new kernel options to allow rudimentary profiling of the internal
hash tables used in the sleep queue and turnstile code.  Each option adds
a sysctl tree under debug containing the maximum depth of any bucket in
the hash table as well as a separate node for each bucket (or chain)
containing the current depth and maximum depth for that bucket.
2004-06-29 02:30:12 +00:00
John Baldwin
a5471e4ef4 Remove the signal_caught argument from sleepq_timedwait() as it was
effectively always zero.
2004-06-28 18:57:06 +00:00
John Baldwin
bd83e879fd - Execute all of the tasks on the taskqueue during taskqueue_free() after
the queue has been removed from the global taskqueue_queues list.  This
  removes the need for the draining queue hack.
- Allow taskqueue_run() to be called with the taskqueue mutex held.  It
  can still be called without the lock for API compatiblity.  In that case
  it will acquire the lock internally.
- Don't lock the individual queue mutex in taskqueue_find() until after the
  strcmp as the global queues mutex is sufficient for the strcmp.
- Simplify taskqueue_thread_loop() now that it can hold the lock across
  taskqueue_run().

Submitted by:	bde (mostly)
2004-06-28 16:28:23 +00:00
John Baldwin
c086588f32 Adjust the priority of the idle threads to be the lowest possible
priority.  This is just a comestic nit as the idle thread priorities aren't
used by the schedulers.

Reported by:	bde
2004-06-28 16:19:50 +00:00
Warner Losh
29b95d5a7e Turns out that jhb didn't really like this. And nate pointed out that
it wasn't a good idea to have the test for NULL on only a limited
subset.  Go back because I'm not sure adding NULL to all the others is
a good idea.
2004-06-28 03:40:23 +00:00
Warner Losh
d5ca7f4f2b Allow dev to be NULL and assume that a device is not alive or not
attached.

Reviewed by: njl(?) and jhb
2004-06-28 02:24:04 +00:00
Pawel Jakub Dawidek
46e3b1cbe7 Add two missing includes and remove two uneeded.
This is quite serious fix, because even with MAC framework compiled in,
MAC entry points in those two files were simply ignored.
2004-06-27 09:03:22 +00:00
Robert Watson
7717cf07f8 Acquire the socket buffer lock when calling unp_scan() on
so->so_rcv.sb_mb to prevent the mbuf chain from changing during the
scan.
2004-06-27 03:29:25 +00:00
Robert Watson
a290574663 Add a new global mutex, so_global_mtx, which protects the global variables
so_gencnt, numopensockets, and the per-socket field so_gencnt.  Annotate
this this might be better done with atomic operations.

Annotate what accept_mtx protects.
2004-06-27 03:22:15 +00:00
Robert Watson
1e4d7da707 Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
  acquire the socket buffer lock and use the _locked() variants of both
  calls.  Note that the _locked() sowakeup() versions unlock the mutex on
  return.  This is done in uipc_send(), divert_packet(), mroute
  socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
  explicit unlock and use the _locked() sowakeup() variant.  This is done
  in soisdisconnecting(), soisdisconnected() when setting the can't send/
  receive flags and dropping data, and in uipc_rcvd() which adjusting
  back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
2004-06-26 19:10:39 +00:00
Marcel Moolenaar
247aba2474 Allocate TIDs in thread_init() and deallocate them in thread_fini().
The overhead of unconditionally allocating TIDs (and likewise,
unconditionally deallocating them), is amortized across multiple
thread creations by the way UMA makes it possible to have type-stable
storage.
Previously the cost was kept down by having threads created as part
of a fork operation use the process' PID as the TID. While this had
some nice properties, it also introduced complexity in the way TIDs
were allocated. Most importantly, by using the type-stable storage
that UMA gives us this was also unnecessary.

This change affects how core dumps are created and in particular how
the PRSTATUS notes are dumped. Since we don't have a thread with a
TID equalling the PID, we now need a different way to preserve the
old and previous behavior. We do this by having the given thread (i.e.
the thread passed to the core dump code in td) dump it's state first
and fill in pr_pid with the actual PID. All other threads will have
pr_pid contain their TIDs. The upshot of all this is that the debugger
will now likely select the right LWP (=TID) as the initial thread.

Credits to: julian@ for spotting how we can utilize UMA.
Thanks to: all who provided julian@ with test results.
2004-06-26 18:58:22 +00:00
Robert Watson
11c40a39b6 Replace comment on spl state when calling soabort() with a comment on
locking state.  No socket locks should be held when calling soabort()
as it will call into protocol code that may acquire socket locks.
2004-06-26 17:12:29 +00:00
Poul-Henning Kamp
cb9ea5f4cb Pick the hotchar out of the tty structure instead of caching private
copies.

No current line disciplines have a dynamically changing hotchar, and
expecting to receive anything sensible during a change in ldisc is
insane so no locking of the hotchar field is necessary.
2004-06-26 09:20:07 +00:00
Poul-Henning Kamp
4776c07426 Fix line discipline switching issues: If opening a new ldisc fails,
we have to revert to TTYDISC which we know will successfully open
rather than try the previous ldisc which might also fail to open.

Do not let ldisc implementations muck about with ->t_line, and remove
code which checks for reopens, it should never happen.

Move ldisc->l_hotchar to tty->t_hotchar and have ldisc implementation
initialize it in their open routines.  Reset to zero when we enter
TTYDISC.  ("no" should really be -1 since zero could be a valid
hotchar for certain old european mainframe protocols.)
2004-06-26 08:44:04 +00:00
Poul-Henning Kamp
ccfac9e40e Gah! commit from wrong tree.
Remove now unused variables from last commit.
2004-06-25 22:10:20 +00:00
Poul-Henning Kamp
950cce9b30 Retire the TIOC_REMOTE ioctl.
It was added 22 years ago for emacs to use, but emacs gave up on it
it 17 years ago.
2004-06-25 21:54:49 +00:00
Robert Watson
a5993a9778 Release UNIX domain socket subsystem lock earlier -- don't need to
hold it over free of unp_addr if we've already removed all references
to unp.
2004-06-25 20:12:06 +00:00
Poul-Henning Kamp
e77b206f0e Add two new methods to struct tty: One for manipulating BREAK condition
and one for fiddling modem-control signals.

Add generic code to deal with the relevant ioctls if these methods are
present.
2004-06-25 10:24:10 +00:00
Robert Watson
4f3bf9b9b4 Don't cuddle else's so much as we removed additional parts of each
block.
2004-06-24 17:22:29 +00:00
Robert Watson
5e11031e05 Remove temporary API bandage that allowed applications speaking the
older API to list attributes on a file (zero-length attribute name)
to function.  extattr_list_*() are now the only available APIs to
use when listing attributes.
2004-06-24 17:14:28 +00:00
Poul-Henning Kamp
075ef10234 #include <sys/serial.h> 2004-06-24 10:32:30 +00:00
Poul-Henning Kamp
98de21b633 Use CTASSERT to enforce the relationship between the new serial port
modem definitions and the old definitions from ioctls.
2004-06-24 10:06:55 +00:00
Robert Watson
c6b93bf29a Lock socket buffers when processing setting socket options SO_SNDLOWAT
or SO_RCVLOWAT for read-modify-write.
2004-06-24 04:28:30 +00:00
Robert Watson
ad6b0efff5 Acquire socket lock in the "waiting for connection" loop in
kern_connect(), replacing tsleep() with msleep() with the socket
mutex.
2004-06-24 01:43:23 +00:00
Robert Watson
3f11a2f374 Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted.  sbreserve() now acquires
the lock before calling sbreserve_locked().  In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order.  In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
2004-06-24 01:37:04 +00:00
Robert Watson
adb4cf0fbc Slide socket buffer lock earlier in sopoll() to cover the call into
selrecord(), setting up select and flagging the socker buffers as SB_SEL
and setting up select under the lock.
2004-06-24 00:54:26 +00:00
Bruce M Simpson
a3146ff925 Fix an inconsistency in socket option propagation on accept(). Propagate
the SS_NBIO flag from the parent socket to the child socket during an
accept() operation.

The file descriptor O_NONBLOCK flag would have been propagated already
by the fflag assignment, and therefore would have been inconsistent
with the underlying socket's so_state member.

This makes accept() more closely adhere to the API contract we effectively
outline in the manual page. Note also that Linux continues to differ here;
O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as
does Solaris. The Single UNIX Specification does not offer specific
advice on this issue.

PR:		kern/45733
Requested by:	Jayanth Vijayaraghavan
Reviewed by:	rwatson
2004-06-22 23:58:09 +00:00
Lukas Ertl
9a98ae94ba Fix a few spelling mistakes in comments and clean them up a bit. 2004-06-22 20:22:24 +00:00
Robert Watson
5282c61738 Regenerate after updating syscalls.master. 2004-06-22 04:36:25 +00:00
Robert Watson
2ed57081a7 Mark unlink() as MPSAFE as we now acquire Giant in the unlink()
system call.
2004-06-22 04:34:55 +00:00
Robert Watson
9260798fd7 Acquire Giant in link() so that the system call can be marked
MPSAFE.  Don't want to acquire Giant in kern_link() sync linux
compat code performs actions requiring Giant prior to calling
kern_link().
2004-06-22 04:34:05 +00:00
Robert Watson
7af72ad7b6 Rebuild following marking link() as MPSAFE. 2004-06-22 04:29:59 +00:00
Robert Watson
61d87ffdc0 Mark link() system call as MPSAFE. 2004-06-22 04:29:27 +00:00
Robert Watson
694b21cf7b Acquire Giant in link() so that we can mark it as MSTD in
syscalls.master.  Don't want to do it in kern_link() since the
Linux emulation code calls kern_link() after performing other
actions requiring Giant.
2004-06-22 04:29:07 +00:00
Robert Watson
fea24c0a71 Remove spl's from uipc_socket to ease in merging. 2004-06-22 03:49:22 +00:00
Scott Long
36c6fd1c0f Fix another typo in the previous commit. 2004-06-21 23:47:47 +00:00
Poul-Henning Kamp
ec66f15d14 Put the pre FreeBSD-2.x tty compat code under BURN_BRIDGES. 2004-06-21 22:57:16 +00:00
Scott Long
c38dd4b6bd Fix typo that somehow crept into the previous commit 2004-06-21 22:42:46 +00:00
Kelly Yancey
de0a924120 Update previous commit to:
* Obtain/release schedlock around calls to calcru.
  * Sort switch cases which do not cascade per style(9).
  * Sort local variables per style(9).
  * Remove "superfluous" whitespace.
  * Cleanup handling of NULL uap->tp in clock_getres().  It would probably
    be better to return EFAULT like clock_gettime() does by passing the
    pointer to copyout(), but I presume it was written to not fail on
    purpose in the original code.  I'll defer to -standards on this one.

Reported by:	bde
2004-06-21 22:34:57 +00:00
Scott Long
dc09579417 Add the sysctl node 'kern.sched.name' that has the name of the scheduler
currently in use.  Move the 4bsd kern.quantum node to kern.sched.quantum
for consistency.
2004-06-21 22:05:46 +00:00
Julian Elischer
dcc9954eb9 Mark the thread in an exiting program as inactive.
This is not really used by the process but it's confusing to some
status readers to see zombie processes the "runnin" threads.

Pointed out by: Don Lewis <truckman@FreeBSD.org>
2004-06-21 20:44:02 +00:00
Bruce Evans
ba39a1c5a4 Turned off the "calcru: negative time" warning for certain SMP cases
where it is known to detect a problem but the problem is not very easy
to fix.  The warning became very common recently after a call to calcru()
was added to fill_kinfo_thread().

Another (much older) cause of "negative times" (actually non-monotonic
times) was fixed in rev.1.237 of kern_exit.c.

Print separate messages for non-monotonic and negative times.
2004-06-21 17:46:27 +00:00
Bruce Evans
40a3fa2d59 (1) Removed the bogus condition "p->p_pid != 1" on calling sched_exit()
from exit1().  sched_exit() must be called unconditionally from exit1().
    It was called almost unconditionally because the only exits on system
    shutdown if at all.

(2) Removed the comment that presumed to know what sched_exit() does.
    sched_exit() does different things for the ULE case.  The call became
    essential when it started doing load average stuff, but its caller
    should not know that.

(3) Didn't fix bugs caused by bitrot in the condition.  The condition was
    last correct in rev.1.208 when it was in wait1().  There p was spelled
    curthread->td_proc and was for the waiting parent; now p is for the
    exiting child.  The condition was to avoid lowering init's priority.
    It should be in sched_exit() itself.  Lowering of priorities is broken
    in other ways in at least the 4BSD scheduler, and doing it for init
    causes less noticeable problems than doing it for for shells.

Noticed by:	julian (1)
2004-06-21 14:49:50 +00:00
Bruce Evans
871684b822 Update p_runtime on exit. This fixes calcru() on zombies, and prepares
for not calling calcru() on exit.  calcru() on a zombie can happen if
ttyinfo() (^T) picks one.

PR:		52490
2004-06-21 14:03:38 +00:00
Poul-Henning Kamp
55dbc267cb New style functions, kill register keyword. 2004-06-21 12:28:56 +00:00
Robert Watson
a34b704666 Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry.  Move
  the call to KNOTE higher in sowakeup() so that it is made with
  the socket buffer lock held for consistency with other calls.
  Release the socket buffer lock prior to calling into pgsigio(),
  so_upcall(), or aio_swake().  Locking for this event management
  will need revisiting in the future, but this model avoids lock
  order reversals when upcalls into other subsystems result in
  socket/socket buffer operations.  Assert that the socket buffer
  lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
  have _locked versions which assert the socket buffer lock on
  entry.  If a wakeup is required by sb_notify(), invoke
  sowakeup(); otherwise, unconditionally release the socket buffer
  lock.  This results in the socket buffer lock being released
  whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
  asserts the socket buffer lock.  socantsendmore()
  unconditionally locks the socket buffer before calling
  socantsendmore_locked().  Note that both functions return with
  the socket buffer unlocked as socantsendmore_locked() calls
  sowwakeup_locked() which has the same properties.  Assert that
  the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
  asserts the socket buffer lock.  socantrcvmore() unconditionally
  locks the socket buffer before calling socantrcvmore_locked().
  Note that both functions return with the socket buffer unlocked
  as socantrcvmore_locked() calls sorwakeup_locked() which has
  similar properties.  Assert that the socket buffer is unlocked
  on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
  socket buffer lock.  sbrelease() unconditionally locks the
  socket buffer before calling sbrelease_locked().
  sbrelease_locked() now invokes sbflush_locked() instead of
  sbflush().

- Assert the socket buffer lock in socket buffer sanity check
  functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
  (and variations on that name) that assert the socket buffer
  lock.  The !_locked() variations unconditionally lock the socket
  buffer before calling their _locked counterparts.  Internally,
  make sure to call _locked() support routines, etc, if already
  holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
  the socket buffer lock.  sbinsertoob() unconditionally locks the
  socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
  socket buffer lock.  sbflush() unconditionally locks the socket
  buffer before calling sbflush_locked().  Update panic strings
  for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
  buffer lock.  sbdrop() unconditionally locks the socket buffer
  before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
  the socket buffer lock.  sbdroprecord() unconditionally locks
  the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
  socket buffer lock on return.  It also now calls
  sbrelease_locked().

- sorflush() now calls socantrcvmore_locked() and re-acquires the
  socket buffer lock on return.  Clean up/mess up other behavior
  in sorflush() relating to the temporary stack copy of the socket
  buffer used with dom_dispose by more properly initializing the
  temporary copy, and selectively bzeroing/copying more carefully
  to prevent WITNESS from getting confused by improperly
  initialized mutexes.  Annotate why that's necessary, or at
  least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
  socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-21 00:20:43 +00:00
Garance A Drosehn
7638fa19a7 Fill in the values for the ki_tid and ki_numthreads which have been
added to kproc_info.

PR:		bin/65803  (a tiny part...)
Submitted by:	Cyrille Lefevre
2004-06-20 22:17:22 +00:00
Robert Watson
c9f69064af In uipc_rcvd(), lock the socket buffers at either end of the UNIX
domain sokcet when updating fields at both ends.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
2004-06-20 21:43:13 +00:00
Robert Watson
1b2e3b4b46 Hold SOCK_LOCK(so) when frobbing so_state when disconnecting a
connected UNIX domain datagram socket.
2004-06-20 21:29:56 +00:00
Robert Watson
fa8368a8fe When retrieving the SO_LINGER socket option for user space, hold the
socket lock over pulling so_options and so_linger out of the socket
structure in order to retrieve a consistent snapshot.  This may be
overkill if user space doesn't require a consistent snapshot.
2004-06-20 17:50:42 +00:00
Robert Watson
6f4b1b5578 Convert an if->panic in soclose() into a call to KASSERT(). 2004-06-20 17:47:51 +00:00
Robert Watson
ed2f7766b0 Annotate some ordering-related issues in solisten() which are not yet
resolved by socket locking: in particular, that we test the connection
state at the socket layer without locking, request that the protocol
begin listening, and then set the listen state on the socket
non-atomically, resulting in a non-atomic cross-layer test-and-set.
2004-06-20 17:38:19 +00:00
Robert Watson
d43c1f67cc Annotate two intentionally unlocked reads with comments.
Annotate a potentially inconsistent result returned to user space when
performing fstaT() on a socket due to not using socket buffer locking.
2004-06-20 17:35:50 +00:00
Thomas Moestl
3971dcfa4b Initialize ni_cnd.cn_cred before calling lookup() (this is normally done
by namei(), which cannot easily be used here however). This fixes boot
time crashes on sparc64 and probably other platforms.

Reviewed by:	phk
2004-06-20 17:31:01 +00:00
Garance A Drosehn
99d2ecbc7d Add a call to calcru() to update the kproc_info fields of ki_rusage.ru_utime
and ki_rusage.ru_stime.  This greatly improves the accuracy of those fields.

Suggested by:	bde
2004-06-20 02:03:33 +00:00
Marcel Moolenaar
0068114dd5 Define __lwpid_t as an int32_t in <sys/_types.h> and define lwpid_t
as an __lwpid_t in <sys/types.h>. Retype td_tid from an int to a
lwpid_t and change related definitions accordingly.
2004-06-19 17:58:32 +00:00
Tim J. Robbins
68ba7a1d57 When no fixed address is given in a shmat() request, pass a hint address
to vm_map_find() that is less likely to be outside of addressable memory
for 32-bit processes: just past the end of the largest possible heap.
This is the same hint that mmap() uses.
2004-06-19 14:46:13 +00:00
Garance A Drosehn
078842c5c9 Fill in the some new fields 'struct kinfo_proc', namely ki_childstime,
ki_childutime, and ki_emul.  Also uses the timevaladd() routine to
correct the calculation of ki_childtime.  That will correct the value
returned when ki_childtime.tv_usec > 1,000,000.

This also implements a new KERN_PROC_GID option for kvm_getprocs().
(there will be a similar update to lib/libkvm/kvm_proc.c)

Submitted by:	Cyrille Lefevre
2004-06-19 14:03:00 +00:00
Poul-Henning Kamp
d7086f313a Only initialize f_data and f_ops if nobody else did so already. 2004-06-19 11:41:45 +00:00
Poul-Henning Kamp
a769355f9b Explicitly initialize f_data and f_vnode to NULL.
Report f_vnode to userland in struct xfile.
2004-06-19 11:40:08 +00:00
Robert Watson
31f555a1c5 Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.  Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation.  Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy.  Assert the socket buffer mutex
strategically after code sections or at the beginning of loops.  In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive().  This commit only merges basic socket locking in this
code; follow-up commits will close additional races.  As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by:	juli, tjr
2004-06-19 03:23:14 +00:00
Brian Feldman
8e1b797456 Add a sysctl/tunable, "kern.always_console_output", that lets you set
output to permanently (not ephemerally) go to the console.  It is also
sent to any other console specified by TIOCCONS as normal.

While I'm here, document the kern.log_console_output sysctl.
2004-06-18 20:12:42 +00:00
David Xu
b370279ef8 Add comment to reflect that we should retry after thread singling failed. 2004-06-18 11:13:49 +00:00
David Xu
0aabef657e Remove a bogus panic. It is possible more than one threads will
be suspended in thread_suspend_check, after they are resumed, all
threads will call thread_single, but only one can be success,
others should retry and will exit in thread_suspend_check.
2004-06-18 06:21:09 +00:00
David Xu
ec008e96a8 If thread singler wants to terminate other threads, make sure it includes
all threads except itself.

Obtained from: julian
2004-06-18 06:15:21 +00:00
Robert Watson
7b574f2e45 Hold SOCK_LOCK(so) while frobbing so_options. Note that while the
local race is corrected, there's still a global race in sosend()
relating to so_options and the SO_DONTROUTE flag.
2004-06-18 04:02:56 +00:00
Robert Watson
c012260726 Merge some additional leaf node socket buffer locking from
rwatson_netperf:

Introduce conditional locking of the socket buffer in fifofs kqueue
filters; KNOTE() will be called holding the socket buffer locks in
fifofs, but sometimes the kqueue() system call will poll using the
same entry point without holding the socket buffer lock.

Introduce conditional locking of the socket buffer in the socket
kqueue filters; KNOTE() will be called holding the socket buffer
locks in the socket code, but sometimes the kqueue() system call
will poll using the same entry points without holding the socket
buffer lock.

Simplify the logic in sodisconnect() since we no longer need spls.

NOTE: To remove conditional locking in the kqueue filters, it would
make sense to use a separate kqueue API entry into the socket/fifo
code when calling from the kqueue() system call.
2004-06-18 02:57:55 +00:00
Kelly Yancey
b8817154c3 Implement CLOCK_VIRTUAL and CLOCK_PROF for clock_gettime(2) and
clock_getres(2).

Reviewed by:	phk
PR:		23304
2004-06-17 23:12:12 +00:00
Robert Watson
9535efc00d Merge additional socket buffer locking from rwatson_netperf:
- Lock down low hanging fruit use of sb_flags with socket buffer
  lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
  socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
  grab the receive socket buffer lock, which are currently actually
  the same lock.  Depending on how we want to play our cards, we
  may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
  soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.
2004-06-17 22:48:11 +00:00
Poul-Henning Kamp
b90c855961 Reduce the thaumaturgical level of root filesystem mounts: Instead of using
an otherwise redundant clone routine in geom_disk.c, mount a temporary
DEVFS and do a proper lookup.

Submitted by:	thomas
2004-06-17 21:24:13 +00:00
Poul-Henning Kamp
f3732fd15b Second half of the dev_t cleanup.
The big lines are:
	NODEV -> NULL
	NOUDEV -> NODEV
	udev_t -> dev_t
	udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
2004-06-17 17:16:53 +00:00
Poul-Henning Kamp
89c9c53da0 Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.
2004-06-16 09:47:26 +00:00
Julian Elischer
fa88511615 Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.
2004-06-16 00:26:31 +00:00
Peter Wemm
a8774e396e Change strategy based on a suggestion from Ian Dowse. Instead of trying
to keep track of different section base addresses at a symbol-by-symbol
level, just set the symbol values at load time.
2004-06-15 23:57:02 +00:00
Robert Watson
7721f5d760 Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field.  This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.
2004-06-15 03:51:44 +00:00
Peter Wemm
1cab0c857e Fix symbol lookups between modules. This caused modules that depend on
other modules to explode.  eg: snd_ich->snd_pcm and umass->usb.
The problem was that I was using the unified base address of the module
instead of finding the start address of the section in question.
2004-06-15 01:35:57 +00:00
Peter Wemm
add21e178f Insurance: cause a proper symbol lookup failure for symbol entries that
reference unknown sections.. rather than returning a small value.
2004-06-15 01:33:39 +00:00
John Polstra
4717d22a7c Change the return value of sema_timedwait() so it returns 0 on
success and a proper errno value on failure.  This makes it
consistent with cv_timedwait(), and paves the way for the
introduction of functions such as sema_timedwait_sig() which can
fail in multiple ways.

Bump __FreeBSD_version and add a note to UPDATING.

Approved by:	scottl (ips driver), arch
2004-06-14 18:19:05 +00:00
Robert Watson
c0b99ffa02 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
Poul-Henning Kamp
170593a9b5 Remove a left over from userland buffer-cache access to disks. 2004-06-14 14:25:03 +00:00
Robert Watson
310e7ceb94 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
Robert Watson
cce9e3f104 Introduce socket and UNIX domain socket locks into hard-coded lock
order definition for witness.  Send lock before receive lock, and
socket locks after accept but  before select:

  filedesc -> accept -> so_snd -> so_rcv -> sellck

All routing locks after send lock:

  so_rcv -> radix node head

All protocol locks before socket locks:

  unp -> so_snd
  udp -> udpinp -> so_snd
  tcp -> tcpinp -> so_snd
2004-06-13 00:23:03 +00:00
Robert Watson
3e87b34a25 Correct whitespace errors in merge from rwatson_netperf: tabs instead of
spaces, no trailing tab at the end of line.

Pointed out by:	csjp
2004-06-12 23:36:59 +00:00
Robert Watson
395a08c904 Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
Robert Watson
f6c0cce6d9 Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer.  Add accessor macros to use the
mutex (SOCKBUF_*()).  Initialize the mutex in soalloc(), and destroy
it in sodealloc().  Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by:	sam
Sponosored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 16:08:41 +00:00
Poul-Henning Kamp
2653139fd2 Fix registration of loadable line disciplines.
This should make watch(8)/snp(4) work again.
2004-06-12 12:31:42 +00:00
Bosko Milekic
96e124135b Gah! Plug a mbuf leak I introduced in the last commit.
I don the pointy-hat.

Problem reported by: Peter Holm <pho@>
2004-06-11 18:17:25 +00:00
Julian Elischer
94e0a4cdf3 Shuffle some code around. 2004-06-11 17:48:20 +00:00
Poul-Henning Kamp
1930e303cf Deorbit COMPAT_SUNOS.
We inherited this from the sparc32 port of BSD4.4-Lite1.  We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.
2004-06-11 11:16:26 +00:00
Brian Feldman
b4adfcf2f4 Make sysctl_wire_old_buffer() respect ENOMEM from vslock() by marking
the valid length as 0.  This prevents vsunlock() from removing a system
wire from memory that was not successfully wired (by us).

Submitted by:	tegge
2004-06-11 02:20:37 +00:00
Robert Watson
0d9ce3a1ac Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables.  This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:

- Add unp_mtx, a global mutex which will protect all UNIX domain socket
  related variables, structures, etc.

- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.

- Acquire unp_mtx on entering most UNIX domain socket code,
  drop/re-acquire around calls into VFS, and release it on return.

- Avoid performing sodupsockaddr() while holding the mutex, so in general
  move to allocating storage before acquiring the mutex to copy the data.

- Make a stack copy of the xucred rather than copying out while holding
  unp_mtx.  Copy the peer credential out after releasing the mutex.

- Add additional assertions of vnode locks following VOP_CREATE().

A few notes:

- Use of an sx lock for the file list mutex may cause problems with regard
  to unp_mtx when garbage collection passed file descriptors.

- The locking in unp_pcblist() for sysctl monitoring is correct subject to
  the unpcb zone not returning memory for reuse by other subsystems
  (consistent with similar existing concerns).

- Sam's version of this change, as with the BSD/OS version, made use of
  both a global lock and per-unpcb locks.  However, in practice, the
  global lock covered all accesses, so I have simplified out the unpcb
  locks in the interest of getting this merged faster (reducing the
  overhead but not sacrificing granularity in most cases).  We will want
  to explore possibilities for improving lock granularity in this code in
  the future.

Submitted by:	sam
Sponsored by:	FreeBSD Foundatiuon
Obtained from:	BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
Bosko Milekic
b5b2ea9a46 Plug a race where upon free this scenario could occur:
(time grows downward)
thread 1         thread 2
------------|------------
dec ref_cnt |
            | dec ref_cnt  <-- ref_cnt now zero
cmpset      |
free all    |
return      |
            |
alloc again,|
reuse prev  |
ref_cnt     |
            | cmpset, read
            | already freed
            | ref_cnt
------------|------------

This should fix that by performing only a single
atomic test-and-set that will serve to decrement
the ref_cnt, only if it hasn't changed since the
earlier read, otherwise it'll loop and re-read.
This forces ordering of decrements so that truly
the thread which did the LAST decrement is the
one that frees.

This is how atomic-instruction-based refcnting
should probably be handled.

Submitted by: Julian Elischer
2004-06-10 00:04:27 +00:00
Maxime Henrion
931f76ab48 Fix a panic happening when m_getm() is called with len < MCLBYTES.
Reported by:	ale
Tested by:	ale
Reviewed by:	bosko
2004-06-09 14:53:35 +00:00
Juli Mallett
6c27c6039b Add a comment explaining td_critnest's initial state and its life from that
point on, as it happens relatively indirectly, and in a codepath the casual
reader may not be acquainted with or find obvious.

Glanced at by:	jhb
2004-06-09 14:06:44 +00:00
Poul-Henning Kamp
b7b4b455b5 Rename struct pt_ioctl to "ptsc" and pointers to it from "pti" to "pt" 2004-06-09 10:21:53 +00:00
Poul-Henning Kamp
b7ffba0afc Ditch K&R function style 2004-06-09 10:16:14 +00:00
Poul-Henning Kamp
2195e4207a Reference count struct tty.
Add two new functions: ttyref() and ttyrel().  ttymalloc() creates a struct
tty with a reference count of one.  when ttyrel sees the count go to zero,
struct tty is freed.

Hold references for open ttys and for ttys which are controlling terminal
for sessions.

Until drivers start using ttyrel(), this commit will make no difference.
2004-06-09 09:41:30 +00:00
Poul-Henning Kamp
a59df4e1ee Fix a race in destruction of sessions. 2004-06-09 09:29:08 +00:00
Poul-Henning Kamp
c0afc00670 Move PTY private defines into PTY private files. 2004-06-09 09:09:54 +00:00
Stefan Farfeleder
1a5ff9285a Avoid assignments to cast expressions.
Reviewed by:	md5
Approved by:	das (mentor)
2004-06-08 13:08:19 +00:00
Tim J. Robbins
f55530b436 Remove remnants of PGINPROF. 2004-06-08 10:37:30 +00:00
Robert Watson
aa57bb0424 Correct a resource leak introduced in recent accept locking changes:
when I reordered events in accept1() to allocate a file descriptor
earlier, I didn't properly update use of goto on exit to unwind for
cases where the file descriptor is now held, but wasn't previously.
The result was that, in the event of accept() on a non-blocking socket,
or in the event of a socket error, a file descriptor would be leaked.

This ended up being non-fatal in many cases, as the file descriptor
would be properly GC'd on process exit, so only showed up for processes
that do a lot of non-blocking accept() calls, and also live for a long
time (such as qmail).

This change updates the use of goto targets to do additional unwinding.

Eyes provided by:	Brian Feldman <green@freebsd.org>
Feet, hands provided by:	Stefan Ehmann <shoesoft@gmx.net>,
				Dimitry Andric <dimitry@andric.com>
				Arjan van Leeuwen <avleeuwen@piwebs.com>
2004-06-07 21:45:44 +00:00
Poul-Henning Kamp
5df76176f7 Make linesw[] an array of pointers to linedesc instead of an array of
linedisc.
2004-06-07 20:45:45 +00:00
Julian Elischer
345ad86692 Split kern_thread.c into 2 parts. kern_kse.c and kern_thread.c
Kern_kse has already been committed.
This separates out the KSE threading ABI from  generic thread support.
2004-06-07 19:00:57 +00:00
David Xu
36939a0a5c According to SUSv3, sigwait is different with sigwaitinfo, sigwait
returns error code in return value, not in errno.
2004-06-07 13:35:02 +00:00
Pawel Jakub Dawidek
79db0f1cbf Remove unused code.
Submitted by:	Bjoern A. Zeeb
2004-06-07 12:19:55 +00:00
Hajimu UMEMOTO
7a1a900c65 allow more than MLEN bytes for ancillary data to meet the
requirement of Section 20.1 of RFC3542.

Obtained from:	KAME
MFC after:	1 week
2004-06-07 09:59:50 +00:00
Tim J. Robbins
be5318b2ca Remove a stale and misleading comment. 2004-06-07 09:35:00 +00:00
Julian Elischer
30276dc9f8 Move the KSE ABI specific code here and separate it from code that
is generic to any threading system. This commit does not link this
file to the build yet, nor does it remove these functions from their
current location in kern_thread.c. (that commit coming up after further review)
2004-06-07 07:25:03 +00:00
Poul-Henning Kamp
9a6dc4b647 Remove filename+line number from panic messages. 2004-06-06 21:26:49 +00:00
Bruce Evans
05b2c96fd3 Detect interrupt storms better. The storm detection didn't work at all
with an ASUS A7N8X-E motherboard in APIC mode, since storming interrupts
don't repeat immediately.  Use DELAY(1) to wait a bit for them to repeat.
This affects all systems.  Only delay for the first
(10 * intr_storm_threshold) interrupts (per interrupt handler) so that
this is only a pessimization while warming up.  Throttle after calling
the sub-handlers instead of before so that the long delay given by
throttling can be used instead of the DELAY(1) to detect storms after
warming up.

Reduced the throttling period from 1/10 second to 1/hz seconds so that
throttling doesn't destroy performance so much.  Interrupts that are
detected as storming are effectively handled by polling at a frequency
of hz Hz.  On A7N8X-E's there is another hardware or configuration bug
that makes the throttled frequency closer to 2*hz Hz.
2004-06-05 18:27:28 +00:00
Maxime Henrion
bd304417e1 When we don't have any meaningful value to print for the device sysctl
tree, output an empty string instead of "?".  This is already what
happened with DEVICE_SYSCTL_LOCATION and DEVICE_SYSCTL_PNPINFO.  This
makes the output of "sysctl dev" much nicer (it won't display those
empty sysctls).

Reviewed by:	des
2004-06-05 11:39:05 +00:00
Tim J. Robbins
f99619a0dc Change the types of vn_rdwr_inchunks()'s len and aresid arguments to
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.
2004-06-05 02:18:28 +00:00
Tim J. Robbins
2b471bc616 Back out workaround for vn_rdwr_inchunks()'s INT_MAX length limitation
after discussions with bde; vn_rdwr_inchunks() itself should be fixed.
2004-06-05 02:00:12 +00:00
Poul-Henning Kamp
13e84a71e0 Centralize the line discipline optimization determination in a function
called ttyldoptim().

Use this function from all the relevant drivers.

I belive no drivers finger linesw[] directly anymore, paving the way for
locking and refcounting.
2004-06-04 21:55:55 +00:00
Poul-Henning Kamp
fe3ec6224a Manual edits to change linesw[]-frobbing to ttyld_*() calls. 2004-06-04 20:04:52 +00:00
Poul-Henning Kamp
2140d01b27 Machine generated patch which changes linedisc calls from accessing
linesw[] directly to using the ttyld...() functions

The ttyld...() functions ar inline so there is no performance hit.
2004-06-04 16:02:56 +00:00
Tim J. Robbins
c4d85674d5 Remove a stale comment. 2004-06-04 11:00:22 +00:00
Dag-Erling Smørgrav
35e32fd8a3 Add a devclass level to the dev sysctl tree, in order to support per-
class variables in addition to per-device variables.  In plain English,
this means that dev.foo0.bar is now called dev.foo.0.bar, and it is
possible to to have dev.foo.bar as well.
2004-06-04 10:23:00 +00:00
Poul-Henning Kamp
d1afdc6644 Get rid of ttyregister(). All drivers now use ttymalloc() for struct
tty, so now we stand a chance of implementing refcounting and getting
rid of the damn things again.
2004-06-04 07:17:03 +00:00
Poul-Henning Kamp
214ef22684 Use ttymalloc() instead of ttyregister(). Use ttyioctl() instead of
direct calls to the linedisc.
2004-06-04 06:50:35 +00:00
Tim J. Robbins
16e6d16299 Write segments to core dump files in maximally-sized chunks that neither
exceed vn_rdwr_inchunks()'s INT_MAX length limitation nor span a block
boundary. This fixes dumping segments larger than 2GB.

PR:	67546
2004-06-04 06:30:16 +00:00
Robert Watson
e7dd9a1001 Mark sun_noname as const since it's immutable. Update definitions
of functions that potentially accept &sun_noname (sbappendaddr(),
et al) to accept a const sockaddr pointer.
2004-06-04 04:07:08 +00:00
Alan Cox
62326de742 Move the definitions of SWAPBLK_NONE and SWAPBLK_MASK from vm_page.h to
blist.h, enabling the removal of numerous #includes from subr_blist.c.
(subr_blist.c and swap_pager.c are the only users of these definitions.)
2004-06-04 04:03:26 +00:00
John Baldwin
ba8b26f960 - Comment out NULL, NULL barrier for Unix domain sockets section as the
double NULL entries signal Witness to stop processing the array of
  order entries meaning none of the spin locks are added resulting in
  panics on boot.
- Add a missing NULL, NULL terminator to the Slip locks list to keep them
  separate from the spin locks.
2004-06-03 20:07:44 +00:00
Tim J. Robbins
cc05397ffc Remove checks for curthread == NULL - it can't happen. 2004-06-03 10:22:47 +00:00
Tim J. Robbins
fa2a4d0595 Move TDF_DEADLKTREAT into td_pflags (and rename it accordingly) to avoid
having to acquire sched_lock when manipulating it in lockmgr(), uiomove(),
and uiomove_fromphys().

Reviewed by:	jhb
2004-06-03 01:47:37 +00:00
Robert Watson
d97e0534fa Expand the hard-coded WITNESS lock order to include the following
relationships:

Sockets:    filedesc->accept->sellck
Routing:    radix node head->rtentry->ifaddr
UDP:        udp->udpinp
TCP:        tcp->tcpinp
SLIP:       slip_mtx->slip sc_mtx

Drop in a place holder section for UNIX domain sockets.  Various
sections to be expanded over the next few days.
2004-06-02 23:28:06 +00:00
Maxime Henrion
2e34ae7a26 As discussed on arch@, flatten the device sysctl tree to make it
more convenient to deal with.  The notion of hierarchy is however
preserved by adding a new %parent node.
2004-06-02 22:43:35 +00:00
Tim J. Robbins
e4e815db72 Remove a redundant "td = curthread" statement from profclock(). 2004-06-02 12:05:06 +00:00
Tim J. Robbins
aa0aa7a113 Move TDF_SA from td_flags to td_pflags (and rename it accordingly)
so that it is no longer necessary to hold sched_lock while
manipulating it.

Reviewed by:	davidxu
2004-06-02 07:52:36 +00:00
Jeff Roberson
dc03363dd8 - Run sched_balance() and sched_balance_groups() from hardclock via
sched_clock() rather than using callouts.  This means we no longer have to
   take the load of the callout thread into consideration while balancing and
   should make the balancing decisions simpler and more accurate.

Tested on:	x86/UP, amd64/SMP
2004-06-02 05:46:48 +00:00
Robert Watson
2658b3bb8e Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

          so_qlen          so_incqlen         so_qstate
          so_comp          so_incomp          so_list
          so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and  address
lock order concerns.  In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
  always allocate a file descriptor with falloc() so that if we do
  find a socket, we don't have to encounter the "Oh, there wasn't
  a socket" race that can occur if falloc() sleeps in the current
  code, which broke inbound accept() ordering, not to mention
  requiring backing out socket state changes in a way that raced
  with the protocol level.  We may want to add a lockless read of
  the queue state if polling of empty queues proves to be important
  to optimize.

- In accept1(), soref() the socket while holding the accept lock
  so that the socket cannot be free'd in a race with the protocol
  layer.  Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
  insert our new socket once we've committed to inserting it, or
  races can occur that cause the incomplete socket queue to
  overfill.  In the previously implementation, it was sufficient
  to simply tested once since calling soabort() didn't release
  synchronization permitting another thread to insert a socket as
  we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
  caller to remove a socket from the incomplete connection queue
  before calling soabort(), which prevents soabort() from having
  to walk into the accept socket to release the socket from its
  queue, and avoids races when releasing the accept mutex to enter
  soabort(), permitting soabort() to avoid lock ordering issues
  with the caller.

- Generally cluster accept queue related operations together
  throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
2004-06-02 04:15:39 +00:00
Robert Watson
f3d055b6de Rather than assert f_type==DTYPE_VNODE, conditionally perform the
file lock release based on f_type==DTYPE_VNODE.  vn_closefile() is
used by non-vnode types as well (fifo).
2004-06-01 23:36:47 +00:00
Robert Watson
948a4734ed Add GIANT_REQUIRED to kqueue_close(), since kqueue currently requires
Giant.
2004-06-01 18:05:41 +00:00
Robert Watson
63732dce22 Push the VOP_ADVLOCK() call to release advisory locks on vnode file
descriptors out of fdrop_locked() and into vn_closefile().  This
removes all knowledge of vnodes from fdrop_locked(), since the lock
behavior was specific to vnodes.  This also removes the specific
requirement for Giant in fdrop_locked(), it's now only required by
code that it calls into.

Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.
2004-06-01 18:03:20 +00:00
Bosko Milekic
6bc72ab95a Fix a couple of bugs in the mbuf and packet ctors. In the latter case,
nextpkt within the m_hdr was not being initialized to NULL for
!M_PKTHDR cases.  *Maybe* this will fix weird socket buffer
inconsistency panics, but we'll see.
2004-06-01 16:17:10 +00:00
Poul-Henning Kamp
3a95025ffc Introduce a ttyioctl() cdevsw default function. 2004-06-01 13:39:02 +00:00
Poul-Henning Kamp
be9bd88238 There is no need to explicitly call the stop function. In all likelyhood
->l_close() did it and ttyclose certainly will.
2004-06-01 11:57:15 +00:00
Robert Watson
d087080c1f Add a global mutex, accept_filter_mtx, to protect the global list of
accept filters and prevent read-modify-write races.
2004-06-01 04:08:48 +00:00
Robert Watson
36568179e3 The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket.  This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
2004-06-01 02:42:56 +00:00
Don Lewis
866046f5a6 Add MSG_NBIO flag option to soreceive() and sosend() that causes
them to behave the same as if the SS_NBIO socket flag had been set
for this call.  The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).

Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation.  The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.
2004-06-01 01:18:51 +00:00
Bosko Milekic
099a0e588c Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
Robert Watson
e79962dbce Assert Giant in vn_start_write() and vn_finished_write(). 2004-05-31 20:56:10 +00:00
Robert Watson
9e6127fe3b Assert Giant in vrele(). 2004-05-31 19:06:01 +00:00
Poul-Henning Kamp
77409fe148 Add missing #include <sys/module.h> 2004-05-30 20:34:58 +00:00
Poul-Henning Kamp
41ee9f1c69 Add some missing <sys/module.h> includes which are masked by the
one on death-row in <sys/kernel.h>
2004-05-30 17:57:46 +00:00
Tim J. Robbins
7671b766a6 Enable MI bits for gcc -ftest-coverage -fprofile-arcs on amd64. 2004-05-29 01:18:14 +00:00
Pawel Jakub Dawidek
d860b24150 Sysctl hw.bus.devctl_disable shouldn't be writtable from inside a jail.
Approved by:	imp
2004-05-26 16:36:32 +00:00
Thomas Moestl
65e29c4822 Retire cpu_sched_exit(); it is not used any more. 2004-05-26 12:09:39 +00:00
Dag-Erling Smørgrav
5c1921b779 As previously threatened, give each device its own sysctl context and
subtree (under the new dev top-level node).  This should greatly simplify
drivers which need per-device sysctl variables (such as ndis).
2004-05-25 12:06:26 +00:00
Garance A Drosehn
b8fdc89d79 Implement the new KERN_PROC_RGID option, and also implement the
KERN_PROC_SESSION option which had been previously defined but
never implemented.

PR:		bin/65803  (a very tiny piece of the PR)`
Submitted by:	Cyrille Lefevre
2004-05-22 23:11:44 +00:00
David Xu
702ac0f112 Clear KSE thread flags after KSE thread mode is ended. The side effect
of not clearing the flags for execv() syscall will result that a new
program runs in KSE thread mode without enabling it.

Submitted by: tjr
Modified by: davidxu
2004-05-21 14:50:23 +00:00
Bruce Evans
a4c2da1503 Fixed some style bugs in tdsigwakeup(). 2004-05-21 10:02:24 +00:00
John Baldwin
80c4433c18 In tdsigwakeup(), use TD_ON_SLEEPQ() rather than TD_IS_SLEEPING() to see if
a thread is on a sleep queue and should have it's sleep aborted.

Reported by:	Thierry Herbelot thierry at herbelot dot com
2004-05-20 20:17:28 +00:00
Bruce Evans
372c2e9613 Fixed printf format errors which helped break GUPROF for arches with
64-bit function pointers.
2004-05-20 16:48:17 +00:00
Bruce Evans
c81d4a0396 Initialize the history counter type field in struct gmonparam as
threatened in rev.1.10 of usr.sbin/kgmon/kgmon.c more than 2 years ago.
kgmon has been recovering from the missing initialization for too
long, but the fixup there is ifdefed for i386's and shouldn't be
needed for other arches.
2004-05-20 16:42:39 +00:00
Bruce Evans
e77c22bf45 Moved i386 asms to an i386 header. The asms are for calibration of
high resolution kernel profiling (options GUPROF.  "U" in GUPROF stands
for microseconds resolution, but the resolution is now smaller than 1
nanosecond on multi-GHz machines and the accuracy is heading towards
1 nanosecond too).  Arches that support GUPROF must now provide certain
macros for the calibration.  GUPROF is now only supported for i386's,
so the absence of the new macros for other arches doesn't break anything
that wasn't already broken.  amd64's have uncommitted support for
GUPROF, and sparc64's have support that seems to be complete except
here (there was an #error for non-i386 cases; now there are undefined
macros).

Changed the asms a little:
- declare them as __volatile.  They must not be moved, and exporting a
  label across asms is technically incorrect, so try harder to stop gcc
  moving them.
- don't put the non-clobbered register "bx" in the clobber list.  The
  clobber lists are still more conservative than necessary.
- drop the non-support for gcc-1.  It just gave a better error message,
  and this is not useful since compiling with gcc-1 would cause thousands
  of worse error messages.
- drop the support for aout.
2004-05-20 16:12:19 +00:00
Pawel Jakub Dawidek
2ff8a3496f Fix sysctl name: security.jail.getfsstate_getfsstatroot_only ->
security.jail.getfsstatroot_only.

Approved by:	rwatson
2004-05-20 05:28:44 +00:00
Bruce Evans
5ad6c3b1ea Include <sys/gmon.h> instead of <machine/profile.h> for the declaration
of kmupetext().  The declaration is misplaced in <machine/profile.h>
since it is not MD and not related to the lowest level of profiling.
It will be moved, but getting it via <sys/gmon.h> already works.
2004-05-19 14:36:38 +00:00
Paul Saab
c2696aaf51 syncache broke rev 1.23 which was done to fix the "thundering herd"
problem in Apache.  Fix it.

Reviewed by:	peter
2004-05-19 00:22:10 +00:00
Peter Wemm
4cec6f5d02 If a symbol has section+offset definitions provided, always use instead
of doing a name lookup for global symbols.  This fixes the snd_pcm module.
2004-05-18 05:15:43 +00:00
Peter Wemm
82d0d1a01b Remove leftover padding variables.
Convert some silent 'ignore programmer error' cases into panics
Remove 'align' field from section table (no longer needed)
2004-05-18 05:14:19 +00:00