Commit Graph

7821 Commits

Author SHA1 Message Date
Julian Elischer
ed062c8d66 Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is  now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by:	scottl, peter
MFC after:	1 week
2004-09-05 02:09:54 +00:00
Julian Elischer
00b0483d5c Don't declare a function we are not defining. 2004-09-03 09:19:49 +00:00
Julian Elischer
37c28a022b fix compile for UP 2004-09-03 09:15:10 +00:00
Julian Elischer
293968d8d3 ooops finish last commit.
moved the variables but not the declarations.
2004-09-03 08:19:31 +00:00
Julian Elischer
82a1dfc16d Move 4bsd specific experimental IP code into the 4bsd file.
Move the sysctls into kern.sched
2004-09-03 07:42:31 +00:00
Alan Cox
94ddc7076d Push Giant deep into vm_forkproc(), acquiring it only if the process has
mapped System V shared memory segments (see shmfork_myhook()) or requires
the allocation of an ldt (see vm_fault_wire()).
2004-09-03 05:11:32 +00:00
Robert Watson
b6ac582880 Tag AIO as requiring Giant over the network stack using
NET_NEEDS_GIANT().

RELENG_5 candidate.
2004-09-03 03:19:14 +00:00
Julian Elischer
44692526be remove unused code
MFC after:	 2 days
2004-09-02 23:37:41 +00:00
Scott Long
9923b511ed Turn PREEMPTION into a kernel option. Make sure that it's defined if
FULL_PREEMPTION is defined.  Add a runtime warning to ULE if PREEMPTION is
enabled (code inspired by the PREEMPTION warning in kern_switch.c).  This
is a possible MT5 candidate.
2004-09-02 18:59:15 +00:00
Julian Elischer
7e37fb1729 *Blush* forgot to test non SMP builds.. oddly enough some UP code (particularly
in the acpi code) seems to want this in a UP build. (I guess so you can have
a sigle kernel module that works for both)
2004-09-01 18:05:43 +00:00
Julian Elischer
6804a3ab6d Give the 4bsd scheduler the ability to wake up idle processors
when there is new work to be done.

MFC after:	5 days
2004-09-01 06:42:02 +00:00
Julian Elischer
2630e4c90c Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after:	2 days
2004-09-01 02:11:28 +00:00
David Xu
cf1867f932 Remove TDP_USTATCLOCK, we no longer need it because we now always
update tick count for userland in thread_userret. This change
also removes a "no upcall owned" panic because fuword() schedules
an upcall under heavily loaded, and code assumes there is no upcall
can occur.

Reported and Tested by: Peter Holm <peter@holm.cc>
2004-08-31 11:52:05 +00:00
Julian Elischer
5995adc206 Remove an unneeded argument..
The removed argument could trivially be derived from the remaining one.
That in turn should be the same as curthread, but it is possible that curthread could be expensive to derive on some syste,s so leave it as an argument.
Having both proc and thread as an argumen tjust gives an opportunity for
them to get out sync.

MFC after:	3 days
2004-08-31 07:34:54 +00:00
Julian Elischer
99e9dcb817 Remove sched_free_thread() which was only used
in diagnostics. It has outlived its usefulness and has started
causing panics for people who turn on DIAGNOSTIC, in what is otherwise
good code.

MFC after:	2 days
2004-08-31 06:12:13 +00:00
Warner Losh
2063e05140 Fix BUS_DEBUG case 2004-08-30 05:48:49 +00:00
Pawel Jakub Dawidek
2e4db7cfd7 Add a missing '\n'. 2004-08-30 01:10:20 +00:00
David Xu
45a4bfa17d Only test return_instead if P_SINGLE_EXIT is set, otherwise a fork()
syscall can interrupt other thread's syscall in sleepq_catch_signals().
Current, all callers know thread_suspend_check may suspend thread
itself, so we need't to check return_instead for normal suspension
flags (no P_SINGLE_EXIT set).

Tested by: deischen
Reported by: Maarten L. Hekkelman <m.hekkelman@cmbi.kun.nl>
2004-08-29 23:10:02 +00:00
Warner Losh
f52c5866ea Initial support (disabled) for rebidding devices. I've been running
this in my tree for a while and in its disabled state there are no
issues.  It isn't enabled yet because some drivers (in acpi) have side
effects in their probe routines that need to be resolved in some
manner before this can be turned on.  The consensus at the last
developer's summit was to provide a static method for each driver
class that will return characteristics of the driver, one of which is
if can be reprobed idempotently.
2004-08-29 18:25:21 +00:00
Warner Losh
3cdf2a3f20 MFp4: Merge in the patches, submitted long ago by someone whose email
address I've lost, that move the location information to the atttach
routine as well.  While one could use devinfo to get this data, that
is difficult and error prone and subject to races for short lived
devices.

Would make a good MT5 candidate.
2004-08-29 18:11:10 +00:00
Dag-Erling Smørgrav
0eac4495db Remove the HW_WDOG option; it serves no purpose.
MFC after:	3 days
2004-08-29 11:10:09 +00:00
Ian Dowse
70b7ffee1b Add support for completing the installation of ELF relocatable
object format modules that were read in by the loader. Loading
modules via the loader should now work on the amd64 platform.
2004-08-29 01:21:51 +00:00
David Xu
5897f840f0 1. try to use existing mailbox address in thread_update_usr_ticks.
2. remove '\n' in KASSERT.
2004-08-28 04:16:32 +00:00
David Xu
ad1280b593 Move TDF_CAN_UNBIND to thread private flags td_pflags, this eliminates
need of sched_lock in some places. Also in thread_userret, remove
spare thread allocation code, it is already done in thread_user_enter.

Reviewed by: julian
2004-08-28 04:08:05 +00:00
Peter Wemm
6f96710c60 Backout the previous backout (with scott's ok). sched_ule.c:1.122 is
believed to fix the problem with ULE that this change triggered.
2004-08-28 01:04:44 +00:00
David E. O'Brien
dd68efd05b s/smp_rv_mtx/smp_ipi_mtx/g
Requested by:	jhb
2004-08-28 00:49:55 +00:00
Peter Wemm
91c1172a5a Commit Jeff's suggested changes for avoiding a bug that is exposed by
preemption and/or the rev 1.79 kern_switch.c change that was backed out.

The thread was being assigned to a runq without adding in the load, which
would cause the counter to hit -1.
2004-08-28 00:49:22 +00:00
Andre Oppermann
2580f4e584 Poll() uses the array smallbits that is big enough to hold 32 struct
pollfd's to avoid calling malloc() on small numbers of fd's.  Because
smalltype's members have type char, its address might be misaligned
for a struct pollfd.  Change the array of char to an array of struct
pollfd.

PR:		kern/58214
Submitted by:	Stefan Farfeleder <stefan@fafoe.narf.at>
Reviewed by:	bde (a long time ago)
MFC after:	3 days
2004-08-27 21:23:50 +00:00
Alexander Kabaev
4cef6d5a53 Reintroduce slightly modified patch from kern/69964. Check for
LK_HAVE_EXL in both acquire invocations.

MFC after:	5 days
2004-08-27 01:41:28 +00:00
Ian Dowse
0ca311f6a1 When trying each linker class in turn with a preloaded module, exit
the loop if the preload was successful. Previously a successful
preload was ignored if the linker class was not the last in the
list.
2004-08-27 01:20:26 +00:00
Robert Watson
161a0c7cff Don't hold the UNIX domain socket subsystem lock over the body of the
UNIX domain socket garbage collection implementation, as that risks
holding the mutex over potentially sleeping operations (as well as
introducing some nasty lock order issues, etc).  unp_gc() will hold
the lock long enough to do necessary deferal checks and set that it's
running, but then release it until it needs to reset the gc state.

RELENG_5 candidate.

Discussed with:	alfred
2004-08-25 21:24:36 +00:00
Robert Watson
fe0f2d4e11 Conditional acquisition of socket buffer mutexes when testing socket
buffers with kqueue filters is no longer required: the kqueue framework
will guarantee that the mutex is held on entering the filter, either
due to a call from the socket code already holding the mutex, or by
explicitly acquiring it.  This removes the last of the conditional
socket locking.
2004-08-24 05:28:18 +00:00
Warner Losh
0160658e84 Set the description to NULL in the right detach routine. This should
keep dangling pointers to strings in loaded modules from hanging
around after the drivers are unloaded.
2004-08-24 05:19:15 +00:00
David Xu
d30412a8db Remove checking of single exit flag in thread_user_enter(), this is
generic code for threaded process, should not be here.
2004-08-23 22:54:37 +00:00
Peter Wemm
f1009e1e1f Commit Doug White and Alan Cox's fix for the cross-ipi smp deadlock.
We were obtaining different spin mutexes (which disable interrupts after
aquisition) and spin waiting for delivery.  For example, KSE processes
do LDT operations which use smp_rendezvous, while other parts of the
system are doing things like tlb shootdowns with a different mutex.

This patch uses the common smp_rendezvous mutex for all MD home-grown
IPIs that spinwait for delivery.  Having the single mutex means that
the spinloop to aquire it will enable interrupts periodically, thus
avoiding the cross-ipi deadlock.

Obtained from: dwhite, alc
Reviewed by:   jhb
2004-08-23 21:39:29 +00:00
Alexander Kabaev
cffdaf2dce Temporarily back out r1.74 as it seems to cause a number of regressions
accordimg to numerous reports. It  might get reintroduced some time later
when an exact failure mode is understood better.
2004-08-23 02:39:45 +00:00
Robert Watson
d963815baf Make debug.kdb.stop_cpus also a TUNABLE() so it can be set prior to boot
to help debug early nasty hangs.
2004-08-22 15:10:52 +00:00
Julian Elischer
ad59c36ba1 diff reduction for upcoming patch. Use a macro that masks
some of the odd goings on with sub-structures, because they will
go away anyhow.
2004-08-22 05:21:41 +00:00
Don Lewis
1a1c04b6b3 Don't bother calling the module event handlers from module_shutdown()
in the shutdown_final state if the RB_NOSYNC flag is set.

The specific motivation in this case is that a system panic in an
interrupt context results in a call to module_shutdown(), which
calls g_modevent(), which calls g_malloc(..., M_WAITOK), which
results in a second panic.   While g_modevent() could be fixed to
not call malloc() for MOD_SHUTDOWN events (which it doesn't handle
in any case), it is probably also a good idea to entirely skip the
execution of the module shutdown handlers after a panic.

This may be a MFC candidate for RELENG_5.
2004-08-20 21:47:48 +00:00
Don Lewis
8ded654028 Don't attempt to trigger the syncer thread final sync code in the
shutdown_pre_sync state if the RB_NOSYNC flag is set.  This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.

This is a MFC candidate for RELENG_5.

MFC after:	3 days
2004-08-20 19:21:47 +00:00
John Baldwin
55c45354ff Remove some dead code under a straggling APIC_IO #ifdef that I missed
back before 5.2.
2004-08-20 17:24:52 +00:00
Robert Watson
7b38f0d3c3 Back out uipc_socket.c:1.208, as it incorrectly assumes that all
sockets are connection-oriented for the purposes of kqueue
registration.  Since UDP sockets aren't connection-oriented, this
appeared to break a great many things, such as RPC-based
applications and services (i.e., NFS).  Since jmg isn't around I'm
backing this out before too many more feet are shot, but intend to
investigate the right solution with him once he's available.

Apologies to:	jmg
Discussed with:	imp, scottl
2004-08-20 16:24:23 +00:00
Scott Long
2384290ced Revert the previous change. It works great for 4BSD but causes major
problems for ULE.  The reason is quite unknown and worrisome.
2004-08-20 05:58:38 +00:00
Scott Long
2c86298c6c In maybe_preempt(), ignore threads that are in an inconsistent state. This
is an effective band-aid for at least some of the scheduler corruption seen
recently.  The real fix will involve protecting threads while they are
inconsistent, and will come later.

Submitted by: julian
2004-08-20 05:18:50 +00:00
John-Mark Gurney
5d6dd4685a make sure that the socket is either accepting connections or is connected
when attaching a knote to it...  otherwise return EINVAL...

Pointed out by:	benno
2004-08-20 04:15:30 +00:00
Nate Lawson
0b54748fec Add a newline. 2004-08-19 20:16:09 +00:00
Poul-Henning Kamp
d298f91974 Add bioq_takefirst().
If the bioq is empty, NULL is returned.  Otherwise the front element
is removed and returned.

This can simplify locking in many drivers from:

	lock()
	bp = bioq_first(bq);
	if (bp == NULL) {
		unlock()
		return
	}
	bioq_remove(bp, bq)
	unlock
to:
	lock()
	bp = bioq_takefirst(bq);
	unlock()
	if (bp == NULL)
		return;
2004-08-19 19:51:51 +00:00
Nate Lawson
c003dab8ff Add debugging to rman_manage_region() as well. This is useful since we
manage subregions in ACPI.

MFC after:	3 days
2004-08-19 16:41:12 +00:00
Robert Watson
16239786ca Remove GIANT_REQUIRED from setugidsafety() as knote_fdclose() no longer
requires Giant.
2004-08-19 14:59:51 +00:00
John Baldwin
007ddf7e7a Now that the return value semantics of cv's for multithreaded processes
have been unified with that of msleep(9), further refine the sleepq
interface and consolidate some duplicated code:
- Move the pre-sleep checks for theaded processes into a
  thread_sleep_check() function in kern_thread.c.
- Move all handling of TDF_SINTR to be internal to subr_sleepqueue.c.
  Specifically, if a thread is awakened by something other than a signal
  while checking for signals before going to sleep, clear TDF_SINTR in
  sleepq_catch_signals().  This removes a sched_lock lock/unlock combo in
  that edge case during an interruptible sleep.  Also, fix
  sleepq_check_signals() to properly handle the condition if TDF_SINTR is
  clear rather than requiring the callers of the sleepq API to notice
  this edge case and call a non-_sig variant of sleepq_wait().
- Clarify the flags arguments to sleepq_add(), sleepq_signal() and
  sleepq_broadcast() by creating an explicit submask for sleepq types.
  Also, add an explicit SLEEPQ_MSLEEP type rather than a magic number of
  0.  Also, add a SLEEPQ_INTERRUPTIBLE flag for use with sleepq_add() and
  move the setting of TDF_SINTR to sleepq_add() if this flag is set rather
  than sleepq_catch_signals().  Note that it is the caller's responsibility
  to ensure that sleepq_catch_signals() is called if and only if this flag
  is passed to the preceeding sleepq_add().  Note that this also removes a
  sched_lock lock/unlock pair from sleepq_catch_signals().  It also ensures
  that for an interruptible sleep, TDF_SINTR is always set when
  TD_ON_SLEEPQ() is true.
2004-08-19 11:31:42 +00:00
John-Mark Gurney
000968010a add options MPROF_BUFFERS and MPROF_HASH_SIZE that adjust the sizes of
the mutex profiling buffers.  Document them in the man page and in NOTES.
Ensure _HASH_SIZE is larger than _BUFFERS with a cpp error.
2004-08-19 06:38:26 +00:00
Robert Watson
4c5bc1ca39 Add UNP_UNLOCK_ASSERT() to asser that the UNIX domain socket subsystem
lock is not held.

Rather than annotating that the lock is released after calls to
unp_detach() with a comment, annotate with an assertion.

Assert that the UNIX domain socket subsystem lock is not held when
unp_externalize() and unp_internalize() are called.
2004-08-19 01:45:16 +00:00
Robert Watson
2cfe973b62 Annotate call to DELAY() in interrupt storm mitigation as being
something to revisit.

Approved by:	re (scottl)
2004-08-17 04:09:09 +00:00
Alexander Kabaev
c8b876219f Upgrading a lock does not play well together with acquiring an exclusive lock
and can lead to two threads being granted exclusive access. Check that no one
has the same lock in exclusive  mode before proceeding to acquire it.

The LK_WANT_EXCL and LK_WANT_UPGRADE bits act as mini-locks and can block
other threads.  Normally this is not a problem since the mini locks are
upgraded to full locks and the release of the locks will unblock the other
threads.  However if a thread reset the bits without obtaining a full lock
other threads are not awoken. Add missing wakeups for these cases.

PR:		kern/69964
Submitted by:	Stephan Uphoff <ups at tree dot com>
Very good catch by: Stephan Uphoff <ups at tree dot com>
2004-08-16 15:01:22 +00:00
David E. O'Brien
78c37b0de8 s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g 2004-08-16 08:33:37 +00:00
Robert Watson
40f2ac28a0 Always acquire the UNIX domain socket subsystem lock (UNP lock)
before dereferencing sotounpcb() and checking its value, as so_pcb
is protected by protocol locking, not subsystem locking.  This
prevents races during close() by one thread and use of ths socket
in another.

unp_bind() now assert the UNP lock, and uipc_bind() now acquires
the lock around calls to unp_bind().
2004-08-16 04:41:03 +00:00
Brian Feldman
8912c44d9f Add the missing knote_fdclose(). 2004-08-16 03:09:01 +00:00
Brian Feldman
1c0f9af5b5 Allocate the marker, when scanning a kqueue, from the "heap" instead of the
stack.  When swapped out, a process's kernel stack would be unavailable,
and we could get a page fault when scanning the same kqueue.

PR:	kern/61849
2004-08-16 03:08:38 +00:00
Robert Watson
ce5f32de11 Annotate the current UNIX domain socket locking strategies, order,
strengths, and weaknesses in a comment.  Assert a copyright over the
changes made as part of the locking work.
2004-08-16 01:52:04 +00:00
Mike Silbersack
5173e8f567 Major enhancements to pipe memory usage:
- pipespace is now able to resize non-empty pipes; this allows
  for many more resizing opportunities

- Backing is no longer pre-allocated for the reverse direction
  of pipes.  This direction is rarely (if ever) used, so this cuts the
  amount of map space allocated to a pipe in half.

- Pipe growth is now much more dynamic; a pipe will now grow when
  the total amount of data it contains and the size of the write are
  larger than the size of pipe.  Previously, only individual writes greater
  than the size of the pipe would cause growth.

- In low memory situations, pipes will now shrink during both read
  and write operations, where possible.  Once the memory shortage
  ends, the growth code will cause these pipes to grow back to an appropriate
  size.

- If the full PIPE_SIZE allocation fails when a new pipe is created, the
  allocation will be retried with SMALL_PIPE_SIZE.  This helps to deal
  with the situation of a fragmented map after a low memory period has
  ended.

- Minor documentation + code changes to support the above.

In total, these changes increase the total number of pipes that
can be allocated simultaneously, drastically reducing the chances that
pipe allocation will fail.

Performance appears unchanged due to dynamic resizing.
2004-08-16 01:27:24 +00:00
Don Lewis
b6915bdbe5 Yet another tweak to the shutdown messages in boot():
Don't count busy buffers before the initial call to sync() and
  don't skip the initial sync() if no busy buffers were called.
  Always call sync() at least once if syncing is requested.  This
  defers the "Syncing disks, buffers remaining..." message until
  after the initial sync() call and the first count of busy
  buffers.  This backs out changes in kern_shutdown 1.162.

  Print a different message when there are no busy buffers after the
  initial sync(), which is now the expected situation.

  Print an additional message when syncing has completed successfully
  in the unusual situation where the work of syncing was done by
  boot().

  Uppercase one message to make it consistent with all of the other
  kernel shutdown messages.

Discussed with:	bde (in a much earlier form, prior to 1.162)
Reviewed by:	njl (in an earlier form)
2004-08-15 19:17:23 +00:00
John-Mark Gurney
ad3b9257c2 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
Robert Watson
d8939d82cb Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
attempt to IPI other cpus when entering the debugger in order to stop
them while in the debugger.  The default remains to issue the stop;
however, that can result in a hang if another cpu has interrupts disabled
and is spinning, since the IPI won't be received and the KDB will wait
indefinitely.  We probably need to add a timeout, but this is a useful
stopgap in the mean time.

Reviewed by:	marcel
2004-08-15 02:06:27 +00:00
Robert Watson
6cbea71c82 Cause pfind() not to return processes in the PRS_NEW state. As a result,
threads consuming the result of pfind() will not need to check for a NULL
credential pointer or other signs of an incompletely created process.
However, this also means that pfind() cannot be used to test for the
existence or find such a process.  Annotate pfind() to indicate that this
is the case.  A review of curent consumers seems to indicate that this is
not a problem for any of them.  This closes a number of race conditions
that could result in NULL pointer dereferences and related failure modes.
Other related races continue to exist, especially during iteration of the
allproc list without due caution.

Discussed with:	tjr, green
2004-08-14 17:15:16 +00:00
Poul-Henning Kamp
d8e8b6755c Add some KASSERTS. 2004-08-14 08:33:49 +00:00
Julian Elischer
f0017f3321 Whitespace nit. 2004-08-14 07:21:20 +00:00
Robert Watson
b295bdcded After completing a name lookup for a target UNIX domain socket to
connect to, re-check that the local UNIX domain socket hasn't been
closed while we slept, and if so, return EINVAL.  This affects the
system running both with and without Giant over the network stack,
and recent ULE changes appear to cause it to trigger more
frequently than previously under load.  While here, improve catching
of possibly closed UNIX domain sockets in one or two additional
circumstances.  I have a much larger set of related changes in
Perforce, but they require more testing before they can be merged.

One debugging printf is left in place to indicate when such a race
takes place: this is typically triggered by a buggy application
that simultaenously connect()'s and close()'s a UNIX domain socket
file descriptor.  I'll remove this at some point in the future, but
am interested in seeing how frequently this is reported.  In the
case of Martin's reported problem, it appears to be a result of a
non-thread safe syslog() implementation in the C library, which
does not synchronize access to its logging file descriptor.

Reported by:	mbr
2004-08-14 03:43:49 +00:00
John-Mark Gurney
ac77164d64 clean up whitespace... 2004-08-13 17:43:53 +00:00
John-Mark Gurney
7d5e45a391 looks like rwatson forgot tabs... :) 2004-08-13 07:38:58 +00:00
Julian Elischer
c00661f83c Don't keep evaluating our own cpu mask..
it's not likely to have changed....
2004-08-13 00:57:43 +00:00
Robert Watson
44f31f7556 Trim trailing white space. 2004-08-12 18:06:21 +00:00
Warner Losh
9f7f340a0f Minor formatting fixes for lines > 80 characters 2004-08-12 17:26:22 +00:00
Jeff Roberson
f2b74cbf28 - Introduce a new flag KEF_HOLD that prevents sched_add() from doing a
migration.  Use this in sched_prio() and sched_switch() to stop us from
   migrating threads that are in short term sleeps or are runnable.  These
   extra migrations were added in the patches to support KSE.
 - Only set NEEDRESCHED if the thread we're adding in sched_add() is a
   lower priority and is being placed on the current queue.
 - Fix some minor whitespace problems.
2004-08-12 07:56:33 +00:00
Julian Elischer
0f54f48225 Properly keep track of how many kses are on the system run queue(s). 2004-08-11 20:54:48 +00:00
Robert Watson
217a4b6e4e Replace a reference to splnet() with a reference to locking in a comment. 2004-08-11 03:43:10 +00:00
Marcel Moolenaar
4da47b2fec Add __elfN(dump_thread). This function is called from __elfN(coredump)
to allow dumping per-thread machine specific notes. On ia64 we use this
function to flush the dirty registers onto the backingstore before we
write out the PRSTATUS notes.

Tested on: alpha, amd64, i386, ia64 & sparc64
Not tested on: arm, powerpc
2004-08-11 02:35:06 +00:00
Robert Watson
87e83e7d4c In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However,
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL.  Otherwise,
accept that another thread got there first, and release the local storage.

Discussed with:	jmg
2004-08-11 01:27:53 +00:00
Alan Cox
fad44deea3 Eliminate the acquisition and release of Giant within physio(). Remove
the spl calls.

Reviewed by: phk@
Discussed with: scottl@
2004-08-10 21:47:11 +00:00
John Baldwin
274f8f48e8 Synchronize the extra SA threading checks and return value handling of
condition variables with that of msleep().

Reviewed by:	davidxu
2004-08-10 17:42:59 +00:00
Jeff Roberson
2454aaf51c - Use a new flag, KEF_XFERABLE, to record with certainty that this kse had
contributed to the transferable load count.  This prevents any potential
   problems with sched_pin() being used around calls to setrunqueue().
 - Change the sched_add() load balancing algorithm to try to migrate on
   wakeup.  This attempts to place threads that communicate with each other
   on the same CPU.
 - Don't clear the idle counts in kseq_transfer(), let the cpus do that when
   they call sched_add() from kseq_assign().
 - Correct a few out of date comments.
 - Make sure the ke_cpu field is correct when we preempt.
 - Call kseq_assign() from sched_clock() to catch any assignments that were
   done without IPI.  Presently all assignments are done with an IPI, but I'm
   trying a patch that limits that.
 - Don't migrate a thread if it is still runnable in sched_add().  Previously,
   this could only happen for KSE threads, but due to changes to
   sched_switch() all threads went through this path.
 - Remove some code that was added with preemption but is not necessary.
2004-08-10 07:52:21 +00:00
Nate Lawson
c8c216d558 Skip the syncing disks loop if there are no dirty buffers. Remove a
variable used to flag the initial printf.

Submitted by:	truckman (earlier version)
2004-08-10 01:32:05 +00:00
Scott Long
0f4ad91810 Add a temporary debugging hack to detect a deadlock in setrunqueue(). This
is here so that we can gather stats on the nature of the recent rash of
hard lockups, and in this particular case panic the machine instead of
letting it deadlock forever.
2004-08-10 00:26:25 +00:00
Julian Elischer
e2105bce2a Slight changes to comments and some whitespace changes. 2004-08-09 21:57:30 +00:00
Julian Elischer
1a5cd27b4b Make kg->kg_runnable actually count runnable threads in the ksegrp run queue
instead of only doing it sometimes.. This is not used outdide of debugging code
in the current code, but that will probably change.
2004-08-09 20:36:03 +00:00
Julian Elischer
332e72ddb7 Remove typos on KASSERT messages. 2004-08-09 20:13:07 +00:00
Brian Feldman
83dd6b37e1 Normalize the VM wiring done with SPARSE_MAPPING: check for errors, and
unmap when done.  For whatever reason, SPARSE_MAPPING is not even a
config option, so this is dead code.
2004-08-09 18:46:13 +00:00
Julian Elischer
732d95288a Increase the amount of data exported by KTR in the KTR_RUNQ setting.
This extra data is needed to really follow what is going on in the
threaded case.
2004-08-09 18:21:12 +00:00
John-Mark Gurney
6141e04a7e add option to automaticly mark core dumps with the nodump flag
PR:		57065
Submitted by:	Walter C. Pelissero
2004-08-09 05:46:46 +00:00
David Xu
604be46d1e 1.Add KSE_INTR_DBSUSPEND command for kse_thr_interrupt to suspend a bound
thread, after the bound thread leaves critical region, the thread should
check debug flag may suspend itself by using the command.
2.Schedule upcall after thread is suspended by debugger
3.Wakeup upcall thread after process suspension.

Reviewed by: deischen
2004-08-08 22:32:20 +00:00
David Xu
2b70a83aff Call thread_user_enter for M:N thread, ast() should be treated as another
entrance of kernel.
2004-08-08 22:28:33 +00:00
David Xu
1f2eac6cf3 Add pl_flags to ptrace_lwpinfo, two flags PL_FLAG_SA and PL_FLAG_BOUND
indicate that a thread is in UTS critical region.

Reviewed by: deischen
Approved by: marcel
2004-08-08 22:26:11 +00:00
Doug Rabson
cfaf7e60cc Make sure that AT_PHDR has a useful value even for static programs. 2004-08-08 09:48:10 +00:00
John-Mark Gurney
227559d11f rearange some code that handles the thread taskqueue so that it is more
generic.  Introduce a new define TASKQUEUE_DEFINE_THREAD that takes a
single arg, which is the name of the queue.

Document these changes.
2004-08-08 02:37:22 +00:00
Robert Watson
b223d06425 We're not yet ready to assert !Giant in kern_fcntl(), as it's called
with Giant from ABI wrappers such as Linux emulation.

Foot shoot off:	phk
2004-08-07 14:09:02 +00:00
Robert Watson
db532b63c2 Flag a broad range of VFS operations as GIANT_REQUIRED in order to
catch leaking into VFS without Giant.

Inch Giant a little lower in several file descriptor operations on
vnodes to cover only VFS operations that need it, rather than file
flag reading, etc.
2004-08-06 22:25:35 +00:00
Robert Watson
cc701b73b8 In thread_exit(), include more information about the thread/process
context in the KTR trace record.  In particular, include the same
information as passed for mi_switch() and fork_exit() KTR trace
records.
2004-08-06 22:06:14 +00:00
Robert Watson
5dd3a4ed6c Push UIDINFO_UNLOCK() slightly earlier in chgsbize(), as it's not
needed if we print the local variable version of the limit rather
than the shared version.
2004-08-06 22:04:33 +00:00
Robert Watson
a0a819747c Avoid acquiring Giant for some common light-weight or already MPSAFE
fcntl() operations, including:

  F_DUPFD          dup() alias
  F_GETFD          retrieve close-on-exec flag
  F_SETFD          set close-on-exec flag
  F_GETFL          retrieve file descriptor flags

For the remaining fcntl() operations, do acquire Giant, especially
where we call into fo_ioctl() as a result.  We're not yet ready to
push Giant into fo_ioctl().  Once we do, this can all become quite a
bit prettier.
2004-08-06 22:00:55 +00:00
Robert Watson
ff7ec58af8 Cut a KTR record whenever a callout is invoked. Mark whether it runs
with Giant or not, and include the function point so it can be looked
up against the kernel symbol table during trace analysis.
2004-08-06 21:49:00 +00:00
John Baldwin
44fe3c1ff0 Don't scare users with a warning about preemption being off when it isn't
yet safe to have on by default.
2004-08-06 15:49:44 +00:00
Robert Watson
6f40c417ca In ithread_schedule(), when we plan to go harvest some entropy as
a result of scheduling an ithread, cut a KTR_INTR trace record so
that it's clear in tracing interrupt activity where and when the
entropy harvesting code is invoked.
2004-08-06 03:39:28 +00:00
Colin Percival
0413bacd09 When reseting a pending callout, perform the deregistration in
callout_reset rather than calling callout_stop.  This results in a few
lines of code duplication, but it provides a significant performance
improvement because it avoids recursing on callout_lock.

Requested by:	rwatson
2004-08-06 02:44:58 +00:00
John Baldwin
5cc00cfc67 Fix the code in rman that merges adjacent unallocated resources to use a
better check for 'adjacent'.  The old code assumed that if two resources
were adjacent in the linked list that they were also adjacent range wise.
This is not true when a resource manager has to manage disparate regions.
For example, the current interrupt code on i386/amd64 will instruct
irq_rman to manage two disjoint regions: 0-1 and 3-15 for the non-APIC
case.  If IRQs 1 and 3 were allocated and then released, the old code
would coalesce across the 1 to 3 boundary because the resources were
adjacent in the linked list thus adding 2 to the area of resources that
irq_rman managed as a side effect.  The fix adds extra checks so that
adjacent unallocated resources are only merged with the resource being
freed if the start and end values of the resources also match up.  The
patch also consolidates the checks for adjacent resources being allocated.
2004-08-05 15:48:18 +00:00
John Baldwin
0e5a07e533 Remove a potential deadlock on i386 SMP by changing the lazypmap ipi and
spin-wait code to use the same spin mutex (smp_tlb_mtx) as the TLB ipi
and spin-wait code snippets so that you can't get into the situation of
one CPU doing a TLB shootdown to another CPU that is doing a lazy pmap
shootdown each of which are waiting on each other.  With this change, only
one of the CPUs would do an IPI and spin-wait at a time.
2004-08-04 20:31:19 +00:00
John Baldwin
c950c15c76 Workaround a possible deadlock on SMP due to a spin lock LOR by disabling
the immediate awakening of proc0 (scheduler kproc, controls swapping
processes in and out).  The scheduler process periodically awakens already,
so this will not result in processes not being swapped in, there will just
be more latency in between a thread being made runnable and the scheduler
waking up to swap the affected process back in.
2004-08-04 20:24:40 +00:00
John Baldwin
bdcfcf5bc4 Cache the value of curthread in the _get_sleep_lock() and _get_spin_lock()
macros and pass the value to the associated _mtx_*() functions to avoid
more curthread dereferences in the function implementations.  This provided
a very modest perf improvement in some benchmarks.

Suggested by:	rwatson
Tested by:	scottl
2004-08-04 20:18:45 +00:00
Robert Watson
7a36e1d6c7 Assert Giant in namei(). Bugs have been reported in which, following
a sleep() call waking up in namei(), a later assertion triggers that
Giant is not held.  By asserting Giant at the start of namei(), we can
know that if that assertion triggers, Giant is lost during the call to
namei(), and not before.
2004-08-04 18:39:07 +00:00
Robert Watson
0be8ad5fbc Assert Giant in the following file descriptor-related functions:
Function             Reason
--------             ------
fdfree()             VFS
setugidsafety()      KQueue
fdcheckstd()         VFS
_fgetvp()            VFS
fgetsock()           Conditional assertion based on debug.mpsafenet
2004-08-04 18:35:33 +00:00
Robert Watson
1b93405c7c Remove spl's from kern_resource.c. 2004-08-04 18:19:09 +00:00
Maxime Henrion
9f1b87f106 Instead of calling ia32_pause() conditionally on __i386__ or __amd64__
being defined, define and use a new MD macro, cpu_spinwait().  It only
expands to something on i386 and amd64, so the compiled code should be
identical.

Name of the macro found by:	jhb
Reviewed by:	jhb
2004-08-03 18:44:27 +00:00
Pawel Jakub Dawidek
24b2151f4d Don't skip permission checks when sending signals to zombie processes.
Pointed out by:	bde
Reviewed by:	rwatson
2004-08-03 15:39:23 +00:00
Mike Silbersack
e10ecdea88 Standardize pipe locking, ensuring that everything is locked via
pipelock(), not via a mixture of mutexes and pipelock().  Additionally,
add a few KASSERTS, and change some statements that should have been
KASSERTS into KASSERTS.

As a result of these cleanups, some segments of code have become
significantly shorter and/or easier to read.
2004-08-03 02:59:15 +00:00
David Xu
4513fb36aa s/TMDF_DONOTRUNUSER/TMDF_SUSPEND/g
Dicussed with: deischen
2004-08-03 02:23:06 +00:00
Julian Elischer
4fd54632b0 Repeat after me:
"Do not apply your tested patches to your commit tree by hand"
2004-08-03 01:43:29 +00:00
Julian Elischer
c94b38af46 Remove an argument that is never used. 2004-08-02 23:48:43 +00:00
David E. O'Brien
64298d52cc Put a cap on the auto-tuning of kern.maxvnodes.
Cap value chosen by:	scottl
2004-08-02 21:52:43 +00:00
Robert Watson
3d3f5f6057 Add what appears to be a missing '*/' at the end of a comment. 2004-08-02 01:38:27 +00:00
Brian Feldman
b23f72e98a * Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
  or not.
* Allow uma_zone constructors and initialation functions to return either
  success or error.  Almost all of the ones in the tree currently return
  success unconditionally, but mbuf is a notable exception: the packet
  zone constructor wants to be able to fail if it cannot suballocate an
  mbuf cluster, and the mbuf allocators want to be able to fail in general
  in a MAC kernel if the MAC mbuf initializer fails.  This fixes the
  panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
  the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
2004-08-02 00:18:36 +00:00
Julian Elischer
6e0fbb01c5 Comment kse_create() and make a few minor code cleanups
Reviewed by:	davidxu
2004-08-01 23:02:00 +00:00
Poul-Henning Kamp
5e8c582ac2 Put a version element in the VFS filesystem configuration structure
and refuse initializing filesystems with a wrong version.  This will
aid maintenance activites on the 5-stable branch.

s/vfs_mount/vfs_omount/

s/vfs_nmount/vfs_mount/

Name our filesystems mount function consistently.

Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space.  A few places abused
it to get hold of some credentials to pass around.  Effectively
it is unused.

Reorganize the root filesystem selection code.
2004-07-30 22:08:52 +00:00
Alan Cox
9be60284a6 Giant is no longer required by vm_waitproc() and vmspace_exitfree().
Eliminate it acquisition and release around vm_waitproc() in kern_wait().
2004-07-30 20:31:02 +00:00
Nate Lawson
b1c8139147 Minor message cleanup. 2004-07-30 01:30:05 +00:00
Pawel Jakub Dawidek
0b011ea3da Syscall kill(2) called for a zombie process should return 0.
Obtained from:	Darwin
2004-07-29 20:38:19 +00:00
Pawel Jakub Dawidek
cebabef04f Fill some informations about zombie processes as well.
Before this change every zombie process were reported as an owner of PID 0 in
ps(1) output.

Reviewed by:	julian
2004-07-29 20:27:59 +00:00
Poul-Henning Kamp
d634f69316 Remove global variable rootdevs and rootvp, they are unused as such.
Add local rootvp variables as needed.

Remove checks for miniroot's in the swappartition.  We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.
2004-07-28 20:21:04 +00:00
Alexander Kabaev
00fbcda80d Avoid casts as lvalues. 2004-07-28 06:42:41 +00:00
David Xu
8bda8a620c Use P_SINGLE_EXIT to check single-threading case, P_WEXIT is not for that
purpose.
2004-07-28 06:30:52 +00:00
Poul-Henning Kamp
3dfe213e61 Convert the vfsconf list to a TAILQ.
Introduce vfs_byname() function to find things on it.

Staticize vfs_nmount() function under the name vfs_donmount().

Various cleanups.
2004-07-27 22:32:01 +00:00
Robert Watson
1a8cfbc450 Pass a thread argument into cpu_critical_{enter,exit}() rather than
dereference curthread.  It is called only from critical_{enter,exit}(),
which already dereferences curthread.  This doesn't seem to affect SMP
performance in my benchmarks, but improves MySQL transaction throughput
by about 1% on UP on my Xeon.

Head nodding:	jhb, bmilekic
2004-07-27 16:41:01 +00:00
Robert Watson
a9abdce44a Add "options ADAPTIVE_GIANT" which causes Giant to also be treated in
an adaptive fashion when adaptive mutexes are enabled.  The theory
behind non-adaptive Giant is that Giant will be held for long periods
of time, and therefore spinning waiting on it is wasteful.  However,
in MySQL benchmarks which are relatively Giant-free, running Giant
adaptive makes an observable difference on SMP (5% transaction rate
improvement).  As such, make adaptive behavior on Giant an option so
it can be more widely benchmarked.
2004-07-27 16:34:48 +00:00
Alan Cox
1a276a3f91 - Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit().  (Giant is acquired only if the vmspace
   contains shm segments.)
 - Eliminate the acquisition of Giant from proc_rwmem().
 - Reduce the scope of Giant in exit1(), uncovering the destruction of the
   address space.
2004-07-27 03:53:41 +00:00
Bosko Milekic
0047b9a96a Move the schedlock owner state update following the context
switch in fork_exit() to before anything else is done (but keep
schedlock for the deadthread check).  This means one less
nasty bug if ever in the future whatever might have been called
before the update played with schedlock or critical sections.

Discussed with: tjr
2004-07-27 03:46:31 +00:00
Colin Percival
66d5c640fa In revision 1.228, I accidentally broke the "total number of processes in
the system" resource limit code: When checking if the caller has superuser
privileges, we should be checking the *real* user, not the *effective*
user.  (In general, resource limiting is done based on the real user, in
order to avoid resource-exhaustion-by-setuid-program attacks.)

Now that a SUSER_RUID flag to suser_cred exists, use it here to return
this code to its correct behaviour.

Pointed out by:	rwatson
2004-07-26 07:54:39 +00:00
Colin Percival
56f21b9d74 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
Robert Watson
feb9bd18c6 Revert modification of subr_turnstile.c accidentally included in the
last commit; this assertion was provided by jhb for local debugging
and not intended for broader consumption.
2004-07-25 23:32:32 +00:00
Robert Watson
fd179ee91d In uipc_connect(), assert that the passed thread is curthread, and pass
td into unp_connect() instead of reading curthread.
2004-07-25 23:30:43 +00:00
Robert Watson
99901d0afb Do some initial locking on accept filter registration and attach. While
here, close some races that existed in the pre-locking world during low
memory conditions.  This locking isn't perfect, but it's closer than
before.
2004-07-25 23:29:47 +00:00
Poul-Henning Kamp
cf95b5c381 Eliminate unused second argument to reassignbuf() and simplify it
accordingly.
2004-07-25 21:24:23 +00:00
Robert Watson
3ed994c6c3 Add netatalk mutexes to hard-coded WITNESS lock order. 2004-07-25 20:16:51 +00:00
Warner Losh
4411688509 Expand the generic, but bogusly formed, copyright notice to include
the license from /usr/src/COPYRIGHT.  Since cvs annotate shows that
this was written by jasone, julian, jhb, peter, bmilekic and obrien.
cvs log shows that many others may have contributed to this file.  As
such, go ahead and use the author of 'FreeBSD Project' for this file.
If this is a problem, please notify me.

# this eliminates the last file in the kernel with an indirect reference
# to /usr/src/COPYRIGHT in the kernel.  A few more in userland remain.
2004-07-25 19:49:01 +00:00
Poul-Henning Kamp
a3d57cfbfd Neuter this warning for now, I think I know the remaining issues. 2004-07-25 08:09:21 +00:00
Julian Elischer
aa3c8c02ae White space fix..
diff reduction for upcoming commit.
2004-07-24 04:57:41 +00:00
Scott Long
e038d35422 Clean up whitespace, increase consistency and correctness.
Submitted by: bde
2004-07-23 23:09:00 +00:00
Robert Watson
ff381670df Don't include a "\n" in KTR output, it confuses automatic parsing. 2004-07-23 20:12:56 +00:00
Scott Long
18f480f8f6 Remove the previous hack since it doesn't make a difference and is getting
in the way of debugging.
2004-07-23 19:59:16 +00:00
Alan Cox
b332cea583 Use kmem_alloc_nofault() rather than kmem_alloc_pageable() for allocating
KVA for explicitly managed mappings, i.e., mappings created with
pmap_qenter().
2004-07-23 19:36:18 +00:00
Robert Watson
4da86f8826 Export KTR_COMPILE as a sysctl so you can easily check from user space
what event mask has been compiled into the kernel.
2004-07-23 17:41:44 +00:00
Robert Watson
46b25cb5f6 Don't perform pipe endpoint locking during pipe_create(), as the pipe
can't yet be referenced by other threads.

In microbenchmarks, this appears to reduce the cost of
pipe();close();close() on UP by 10%, and SMP by 7%.  The vast majority
of the cost of allocating a pipe remains VM magic.

Suggested by:	silby
2004-07-23 14:11:04 +00:00
Robert Watson
71a057bc73 In setpgid(), since td is passed in as a system call argument, use it
in preference to curthread, which costs slightly more.
2004-07-23 04:26:49 +00:00
Robert Watson
a6719c82b1 Push Giant acquisition down into fo_stat() from most callers. Acquire
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant.  Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op.  Don't acquire Giant in fstat() system call.

Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.
2004-07-22 20:40:23 +00:00
Robert Watson
1c1ce9253f Push acquisition of Giant from fdrop_closed() into fo_close() so that
individual file object implementations can optionally acquire Giant if
they require it:

- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)

Notes:

  Giant is still acquired in close() even when closing MPSAFE objects
  due to kqueue requiring Giant in the calling closef() code.
  Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
  of pipe create/destroy pairs from user space with SMP compiled into
  the kernel.

  The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
  test it extensively and so have left Giant over fo_close().  It can
  probably be removed given some testing and review.
2004-07-22 18:35:43 +00:00
Robert Watson
df04411ac4 suser() accepts a thread argument; as suser() dereferences td_ucred, a
thread-local pointer, in practice that thread needs to be curthread.  If
we're running with INVARIANTS, generate a warning if not.  If we have
KDB compiled in, generate a stack trace.  This doesn't fire at all in my
local test environment, but could be irritating if it fires frequently
for someone, so there will be motivation to fix things quickly when it
does.
2004-07-22 17:05:04 +00:00
Scott Long
9493183e77 Disable the PREEMPTION-enabled code in critical_exit() that encourages
switching to a different thread.  This is just a hack to try to improve
stability some more, but likely points closer to the real culprit.
2004-07-22 14:32:48 +00:00
Bosko Milekic
01e9ccbd9c Back out just a portion of Alfred's last commit. Remove the MBUF_CHECK
(WITNESS) for code paths that always call uma_zalloc_arg() shortly
after where the check was, because uma_zalloc_arg() already does
a similar check.

No objections from Alfred.  Thanks Alfred.
2004-07-21 21:03:01 +00:00
Robert Watson
46e38ce826 Don't sync the file system on panic by default. This seems to basically
work very infrequently, and often results in a compound panic which
confuses debugging; locking/SMP have made the layering violation (and
risks) of this more obvious over time.

Discussed with:	green, bde, et al.
2004-07-21 16:04:46 +00:00
Alfred Perlstein
05656b6e2b put several of the options for DEBUG_VFS_LOCKS under control of sysctls. 2004-07-21 07:13:14 +00:00
Alfred Perlstein
063d811465 Make sure we don't call mbuf allocation functions with mutexes held.
Discussed with: rwatson
2004-07-21 07:12:24 +00:00
Marcel Moolenaar
3d4f313695 Add kdb_thr_from_pid(), which given a PID returns the first thread
in the process. This is useful when working from or with a process.
2004-07-21 04:49:48 +00:00
Mike Silbersack
eb3d2c61b4 Fix a minor error in pipe_stat - st_size was always reported as 0
when direct writes kicked in.  Whether this affected any applications
is unknown.
2004-07-20 07:06:43 +00:00
Peter Wemm
b09cb1027b #ifdef __i386__ -> __i386__ || __amd64__ 2004-07-20 02:15:10 +00:00
Julian Elischer
3a63b92c12 You always spot the typos after you have committed.. Start sentence
with a Cap.
2004-07-19 18:06:12 +00:00
Julian Elischer
f6449d9d31 Allow the user who calls doadump() from the kernel debugger
to not get a page fault if he has not defined a dump device.
Panic can often not do a dump as it can hang forever in some cases.
 The original PR was for amd64 only. This is a generalised version of
that change.

PR:		amd64/67712
Submitted by:	wjw@withagen.nl <Willen Jan Withagen>
2004-07-19 18:03:02 +00:00
Brian Feldman
4362fada8f Reimplement contigmalloc(9) with an algorithm which stands a greatly-
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.

The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less.  The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block.  If this is not enough, the subsequent passes
page out any unwired memory.  To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.

The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed.  Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.

There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used.  Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month.  It is safe to switch at
run-time to see the difference it makes.

A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig().  These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.

When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats.  Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well.  This invalidates the BUGS section of the
contigmalloc(9) manpage.
2004-07-19 06:21:27 +00:00
Julian Elischer
55d44f79ea When calling scheduler entrypoints for creating new threads and processes,
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)

Reviewed by: peter
2004-07-18 23:36:13 +00:00
Pawel Jakub Dawidek
ece2d9891e Now we have NO_ADAPTIVE_MUTEXES option, so use it here too.
Missed by:	scottl
2004-07-18 23:27:14 +00:00
Marcel Moolenaar
1f7a1baa37 After maintaining previous behaviour in writing out the core notes, it's
time now to break with the past: do not write the PID in the first note.
Rationale:
1.  [impact of the breakage] Process IDs in core files serve no immediate
    purpose to the debugger itself. They are only useful to relate a core
    file to a process. This can provide context to the person looking at
    the core file, provided one keeps track of this. Overall, not having
    the PID in the core file is only in very rare occasions unfortunate.
2.  [reason of the breakage] Having one PRSTATUS note contain the PID,
    while all others contain the LWPID of the corresponding kernel thread
    creates an irregularity for the debugger that cannot easily be worked
    around. This is caused by libthread_db correlating user thread IDs to
    kernel thread (aka LWP) IDs and thus aware of the actual LWPIDs.

Update comments accordingly.
2004-07-18 20:28:07 +00:00
David Malone
cdb71f7526 The recent changes to control message passing broke some things
that get certain types of control messages (ping6 and rtsol are
examples). This gets the new code closer to working:

	1) Collect control mbufs for processing in the controlp ==
	NULL case, so that they can be freed by externalize.

	2) Loop over the list of control mbufs, as the externalize
	function may not know how to deal with chains.

	3) In the case where there is no externalize function,
	remember to add the control mbuf to the controlp list so
	that it will be returned.

	4) After adding stuff to the controlp list, walk to the
	end of the list of stuff that was added, incase we added
	a chain.

This code can be further improved, but this is enough to get most
things working again.

Reviewed by:	rwatson
2004-07-18 19:10:36 +00:00
Doug Rabson
4c4392e791 Add doxygen doc comments for most of newbus and the BUS interface. 2004-07-18 16:30:31 +00:00
Scott Long
701f140800 Enable ADAPTIVE_MUTEXES by default by changing the sense of the option to
NO_ADAPTIVE_MUTEXES.  This option has been enabled by default on amd64 for
quite some time, and has been extensively tested on i386 and sparc64.  It
shows measurable performance gains in many circumstances, and few negative
effects.  It would be nice in t he future if adaptive mutexes actually went
to sleep after a certain amount of spinning, but that will require quite a
bit more testing.
2004-07-18 15:59:03 +00:00
Alan Cox
d8582da660 Remove GIANT_REQUIRED from vmapbuf(). 2004-07-18 04:57:49 +00:00
Robert Watson
2260c03d77 Drop Giant and acquire the UNIX domain socket subsystem lock a bit
earlier in unp_connect() so that vp->v_socket can't change between
our copying its value to a local variable and later use of that
variable.  This may have been responsible for a panic during
shutdown that I experienced where simultaneous closing of a listen
socket by rpcbind and a new connection being made to rpcbind by
mountd.
2004-07-18 01:29:43 +00:00
David Xu
c3d88cbab8 Fix typo. 2004-07-17 23:15:41 +00:00
David Malone
e140eb430c Add a kern_setsockopt and kern_getsockopt which can read the option
values from either user land or from the kernel. Use them for
[gs]etsockopt and to clean up some calls to [gs]etsockopt in the
Linux emulation code that uses the stackgap.
2004-07-17 21:06:36 +00:00
John Baldwin
52eb84641d - Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags
since they are only accessed by curthread and thus do not need any
  locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
  and directly into struct thread as td_profil_addr and td_profil_ticks
  as these variables are really per-thread.  (They are used to defer an
  addupc_intr() that was too "hard" until ast()).
2004-07-16 21:04:55 +00:00
John Baldwin
d3373e371b Whitespace fix. 2004-07-16 21:01:52 +00:00
John Baldwin
6dbc085016 Improve readability a bit by changing some code at the end of a function
that did:

	if (foo)
		return
	else
		blah

to just do the simpler

	if (!foo)
		blah

instead.
2004-07-16 21:00:50 +00:00
Colin Percival
24283cc01b Add a SUSER_RUID flag to suser_cred. This flag indicates that we want to
check if the *real* user is the superuser (vs. the normal behaviour, which
checks the effective user).

Reviewed by:	rwatson
2004-07-16 15:57:16 +00:00
Robert Watson
dad7b41a9b When entering soclose(), assert that SS_NOFDREF is not already set. 2004-07-16 00:37:34 +00:00
Poul-Henning Kamp
672c05d49c Preparation commit for the tty cleanups that will follow in the near
future:

rename ttyopen() -> tty_open() and ttyclose() -> tty_close().

We need the ttyopen() and ttyclose() for the new generic cdevsw
functions for tty devices in order to have consistent naming.
2004-07-15 20:47:41 +00:00
Poul-Henning Kamp
3e019deaed Do a pass over all modules in the kernel and make them return EOPNOTSUPP
for unknown events.

A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
2004-07-15 08:26:07 +00:00
Alfred Perlstein
bb5faea34f Cleanup shutdown output. 2004-07-15 08:01:00 +00:00
Alfred Perlstein
da6303bacc Tidy up system shutdown. 2004-07-15 04:29:48 +00:00
Alfred Perlstein
a88295bb83 Disable SIGIO for now, leave a comment as to why it's busted and hard
to fix.
2004-07-15 03:49:52 +00:00
Nate Lawson
8916adb1c9 Clean up the output on reboot by keeping completion messages on the same
line as the announcement.  Someone should probably update the "buffers
remaining" message since we now no longer should have any buffers remaining
at that point.
2004-07-15 03:20:08 +00:00
Poul-Henning Kamp
e2ad640e13 A module with no modevent function gets modevent_nop() as default.
Until now the function has just returned zero for any event, but
that is downright wrong for MOD_UNLOAD and not very useful for any
future events we add where it may be crucial to be able to tell
if the event was unhandled or successful.

Change the function to return as follows:

	MOD_LOAD -> 0
	MOD_UNLOAD -> EBUSY
	anything else -> EOPNOTSUPP
2004-07-14 22:37:36 +00:00
Christian S.J. Peron
ed6c545cf0 In addition to the real user ID check, do an explicit jail
check to ensure that the caller is not prison root.

The intention is to fix file descriptor creation so that
prison root can not use the last remaining file descriptors.
This privilege should be reserved for non-jailed root users.

Approved by:	bmilekic (mentor)
2004-07-14 19:04:31 +00:00
Alfred Perlstein
67543ab1e3 Make FIOASYNC, FIOSETOWN and FIOGETOWN work on kqueues. 2004-07-14 07:02:03 +00:00
John Baldwin
6942d4339e Set TDF_NEEDRESCHED when a higher priority thread is scheduled in
sched_add() rather than just doing it in sched_wakeup().  The old
ithread preemption code used to set NEEDRESCHED unconditionally if it
didn't preempt which masked this bug in SCHED_4BSD.

Noticed by:	jake
Reported by:	kensmith, marcel
2004-07-13 20:49:13 +00:00
Poul-Henning Kamp
65a311fcb2 Give kldunload a -f(orce) argument.
Add a MOD_QUIESCE event for modules.  This should return error (EBUSY)
of the module is in use.

MOD_UNLOAD should now only fail if it is impossible (as opposed to
inconvenient) to unload the module.  Valid reasons are memory references
into the module which cannot be tracked down and eliminated.

When kldunloading, we abandon if MOD_UNLOAD fails, and if -force is
not given, MOD_QUIESCE failing will also prevent the unload.

For backwards compatibility, we treat EOPNOTSUPP from MOD_QUIESCE as
success.

Document that modules should return EOPNOTSUPP for unknown events.
2004-07-13 19:36:59 +00:00
Poul-Henning Kamp
1a946b9fef Add kldunloadf() system call. Stay tuned for follwing commit messages. 2004-07-13 19:35:11 +00:00
Poul-Henning Kamp
49bddf0c9f fix compilation. 2004-07-13 16:33:38 +00:00
Colin Percival
65bba83fef Replace "uid != 0" with "suser(td->td_ucred) != 0" when checking if we've
hit the maximum number of processes.  The last ten processes are reserved
for the *non-jailed* superuser.
2004-07-13 13:10:07 +00:00
David Xu
4d47dc5549 Add code to support debugging threaded process.
1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
     thread current user thread is running on. Add tm_dflags into
     kse_thr_mailbox, the flags is written by debugger, it tells
     UTS and kernel what should be done when the process is being
     debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.

     TMDF_SSTEP is used to tell kernel to turn on single stepping,
     or turn off if it is not set.

     TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
     whenever possible, to UTS, it means do not run the user thread
     until debugger clears it, this behaviour is necessary because
     gdb wants to resume only one thread when the thread's pc is
     at a breakpoint, and thread needs to go forward, in order to
     avoid other threads sneak pass the breakpoints, it needs to remove
     breakpoint, only wants one thread to go. Also, add km_lwp to
     kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
     switch time when process is not being debugged, so when process
     is attached, debugger can map kernel thread to user thread.

  2. Add p_xthread to proc strcuture and td_xsig to thread structure.
     p_xthread is used by a thread when it wants to report event
     to debugger, every thread can set the pointer, especially, when
     it is used in ptracestop, it is the last thread reporting event
     will win the race. Every thread has a td_xsig to exchange signal
     with debugger, thread uses TDF_XSIG flag to indicate it is reporting
     signal to debugger, if the flag is not cleared, thread will keep
     retrying until it is cleared by debugger, p_xthread may be
     used by debugger to indicate CURRENT thread. The p_xstat is still
     in proc structure to keep wait() to work, in future, we may
     just use td_xsig.

  3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
     a thread. When process stops, debugger can set the flag for
     thread, thread will check the flag in thread_suspend_check,
     enters a loop, unless it is cleared by debugger, process is
     detached or process is existing. The flag is also checked in
     ptracestop, so debugger can temporarily suspend a thread even
     if the thread wants to exchange signal.

  4. Current, in ptrace, we always resume all threads, but if a thread
     has already a TDF_DBSUSPEND flag set by debugger, it won't run.

Encouraged by: marcel, julian, deischen
2004-07-13 07:33:40 +00:00
David Xu
ef9457becb Implement following commands: PT_CLEARSTEP, PT_SETSTEP, PT_SUSPEND
PT_RESUME, PT_GETNUMLWPS, PT_GETLWPLIST.
2004-07-13 07:25:24 +00:00
David Xu
cbf4e354ec Add code to support debugging threaded process.
1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
   thread current user thread is running on. Add tm_dflags into
   kse_thr_mailbox, the flags is written by debugger, it tells
   UTS and kernel what should be done when the process is being
   debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.

   TMDF_SSTEP is used to tell kernel to turn on single stepping,
   or turn off if it is not set.

   TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
   whenever possible, to UTS, it means do not run the user thread
   until debugger clears it, this behaviour is necessary because
   gdb wants to resume only one thread when the thread's pc is
   at a breakpoint, and thread needs to go forward, in order to
   avoid other threads sneak pass the breakpoints, it needs to remove
   breakpoint, only wants one thread to go. Also, add km_lwp to
   kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
   switch time when process is not being debugged, so when process
   is attached, debugger can map kernel thread to user thread.

2. Add p_xthread to proc strcuture and td_xsig to thread structure.
   p_xthread is used by a thread when it wants to report event
   to debugger, every thread can set the pointer, especially, when
   it is used in ptracestop, it is the last thread reporting event
   will win the race. Every thread has a td_xsig to exchange signal
   with debugger, thread uses TDF_XSIG flag to indicate it is reporting
   signal to debugger, if the flag is not cleared, thread will keep
   retrying until it is cleared by debugger, p_xthread may be
   used by debugger to indicate CURRENT thread. The p_xstat is still
   in proc structure to keep wait() to work, in future, we may
   just use td_xsig.

3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
   a thread. When process stops, debugger can set the flag for
   thread, thread will check the flag in thread_suspend_check,
   enters a loop, unless it is cleared by debugger, process is
   detached or process is existing. The flag is also checked in
   ptracestop, so debugger can temporarily suspend a thread even
   if the thread wants to exchange signal.

4. Current, in ptrace, we always resume all threads, but if a thread
   has already a TDF_DBSUSPEND flag set by debugger, it won't run.

Encouraged by: marcel, julian, deischen
2004-07-13 07:20:10 +00:00
Alan Cox
ce8da3091f Push down the acquisition and release of the page queues lock into
pmap_remove_pages().  (The implementation of pmap_remove_pages() is
optional.  If pmap_remove_pages() is unimplemented, the acquisition and
release of the page queues lock is unnecessary.)

Remove spl calls from the alpha, arm, and ia64 pmap_remove_pages().
2004-07-13 02:49:22 +00:00
David Malone
dcee93dcf9 Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by:	alfred
2004-07-12 21:42:33 +00:00
Mike Makonnen
c21e3b38bd writers must hold both sched_lock and the process lock; therefore, readers
need only obtain the process lock.
2004-07-12 15:28:31 +00:00
Alfred Perlstein
f257b7a54b Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
2004-07-12 08:14:09 +00:00
David Xu
507b03186a Change kse_switchin to accept kse_thr_mailbox pointer, the syscall
will be used heavily in debugging KSE threads. This breaks libpthread
on IA64, but because libpthread was not in 5.2.1 release, I would like
to change it so we needn't to introduce another syscall.
2004-07-12 07:39:20 +00:00
Alfred Perlstein
d58d3648dd Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama
2004-07-12 06:22:42 +00:00
Marcel Moolenaar
fbc3247d81 Implement the PT_LWPINFO request. This request can be used by the
tracing process to obtain information about the LWP that caused the
traced process to stop. Debuggers can use this information to select
the thread currently running on the LWP as the current thread.

The request has been made compatible with NetBSD for as much as
possible. This implementation differs from NetBSD in the following
ways:
1.  The data argument is allowed to be smaller than the size of the
    ptrace_lwpinfo structure known to the kernel, but not 0. This
    is opposite to what NetBSD allows. The reason for this is that
    we can extend the structure without affecting older binaries.
2.  On NetBSD the tracing process is to set the pl_lwpid field to
    the Id of the LWP it wants information of. We don't do that.
    Our ptrace interface allows passing the LWP Id instead of the
    PID. The tracing process is to set the PID to the LWP Id it
    wants information of.
3.  When the PID is actually the PID of the tracing process, this
    request returns the information about the LWP that caused the
    process to stop. This was the whole purpose of the request in
    the first place.

When the traced process has exited, this request will return the
LWP Id 0, indicating that the process state is not the result of
an event specific to a LWP.
2004-07-12 05:07:50 +00:00
Alfred Perlstein
7ae8ce5df1 Dump the actual bad values when this assertion is tripped. 2004-07-12 04:13:38 +00:00
Marcel Moolenaar
3bcd2440db Make kdb_dbbe_select() available as an interface function. This allows
changing the backend from outside the KDB frontend. For example from
within a backend. Rewrite kdb_sysctl_current to make use of this
function as well.
2004-07-12 01:15:55 +00:00
Robert Watson
a294c3664f Use sockbuf_pushsync() to synchronize stack and socket buffer state
in soreceive() after removing an MT_SONAME mbuf from the head of the
socket buffer.

When processing MT_CONTROL mbufs in soreceive(), first remove all of
the MT_CONTROL mbufs from the head of the socket buffer to a local
mbuf chain, then feed them into dom_externalize() as a set, which
both avoids thrashing the socket buffer lock when handling multiple
control mbufs, and also avoids races with other threads acting on
the socket buffer when the socket buffer mutex is released to enter
the externalize code.  Existing races that might occur if the protocol
externalize method blocked during processing have also been closed.

Now that we synchronize socket buffer and stack state following
modifications to the socket buffer, turn the manual synchronization
that previously followed control mbuf processing with a set of
assertions.  This can eventually be removed.

The soreceive() code is now substantially more MPSAFE.
2004-07-11 23:13:14 +00:00
Robert Watson
b7562e178c Add sockbuf_pushsync(), an inline function that, following a change to
the head of the mbuf chains in a socket buffer, re-synchronizes the
cache pointers used to optimize socket buffer appends.  This will be
used by soreceive() before dropping socket buffer mutexes to make sure
a consistent version of the socket buffer is visible to other threads.

While here, update copyright to account for substantial rewrite of much
socket code required for fine-grained locking.
2004-07-11 22:59:32 +00:00
Poul-Henning Kamp
9ef295b7e0 Better descriptions of the cdev malloc class and mutex. 2004-07-11 19:26:43 +00:00
Robert Watson
d861372b14 Add additional annotations to soreceive(), documenting the effects of
locking on 'nextrecord' and concerns regarding potentially inconsistent
or stale use of socket buffer or stack fields if they aren't carefully
synchronized whenever the socket buffer mutex is released.  Document
that the high-level sblock() prevents races against other readers on
the socket.

Also document the 'type' logic as to how soreceive() guarantees that
it will only return one of normal data or inline out-of-band data.
2004-07-11 18:29:47 +00:00
Doug Rabson
67f8f14a6e Expand and rewrite documentation using doxygen markup so that we can
generate funky web pages from it.
2004-07-11 16:17:42 +00:00
Marcel Moolenaar
a8bfba1a27 Fix braino: Make sure there is a current backend before we return its
name in the debug.kdb.current sysctl. All other dereferences are
properly guarded, but this one was overlooked.

Reported by: Morten Rodal (morten at rodal dot no)
2004-07-11 15:22:43 +00:00
Poul-Henning Kamp
911dbd84c7 Introduce ttygone() which indicates that the hardware is detached.
Move dtrwait logic to the generic TTY level.
2004-07-11 15:18:39 +00:00
Robert Watson
0014b343e0 In the 'dontblock' section of soreceive(), assert that the mbuf on hand
('m') is in fact the first mbuf in the receive socket buffer.
2004-07-11 01:44:12 +00:00
Robert Watson
5e44d93ffc Break out non-inline out-of-band data receive code from soreceive()
and put it in its own helper function soreceive_rcvoob().
2004-07-11 01:34:34 +00:00
Robert Watson
a04b09398c Assign pointers values of NULL rather than 0 in soreceive(). 2004-07-11 01:22:40 +00:00
Marcel Moolenaar
32240d082c Update for the KDB framework:
o  Call kdb_enter() instead of Debugger().
2004-07-10 21:47:53 +00:00
Robert Watson
7e17bc9f26 When the MT_SONAME mbuf is popped off of a receive socket buffer
associated with a PR_ADDR protocol, make sure to update the m_nextpkt
pointer of the new head mbuf on the chain to point to the next record.
Otherwise, when we release the socket buffer mutex, the socket buffer
mbuf chain may be in an inconsistent state.
2004-07-10 21:43:35 +00:00
Marcel Moolenaar
82ebaee7a3 Update for the KDB framework:
o  Check kdb_active instead of db_active and do so unconditionally.
2004-07-10 21:43:23 +00:00
Marcel Moolenaar
eba21ad501 Update for the KDB framework:
o  Make debugging code conditional upon KDB instead of DDB.
o  s/WITNESS_DDB/WITNESS_KDB/g
o  s/witness_ddb/witness_kdb/g
o  Rename the debug.witness_ddb sysctl to debug.witness_kdb.
o  Call kdb_backtrace() instead of backtrace().
o  Call kdb_enter() instead Debugger().
o  Assert kdb_active instead of db_active.
2004-07-10 21:42:16 +00:00
Marcel Moolenaar
2c3490b1a8 Update for the KDB framework:
o  Call kdb_backtrace() instead of backtrace().
2004-07-10 21:38:22 +00:00
Marcel Moolenaar
ecb01c64d2 Make the GDB dynamic linker hooks (r_debug_state) conditional upon
GDB instead of DDB.
2004-07-10 21:37:30 +00:00
Marcel Moolenaar
2d50560abc Update for the KDB framework:
o  Make debugging code conditional upon KDB instead of DDB.
o  Call kdb_enter() instead of Debugger().
o  Call kdb_backtrace() instead of db_print_backtrace() or backtrace().

kern_mutex.c:
o  Replace checks for db_active with checks for kdb_active and make
   them unconditional.

kern_shutdown.c:
o  s/DDB_UNATTENDED/KDB_UNATTENDED/g
o  s/DDB_TRACE/KDB_TRACE/g
o  Save the TID of the thread doing the kernel dump so the debugger
   knows which thread to select as the current when debugging the
   kernel core file.
o  Clear kdb_active instead of db_active and do so unconditionally.
o  Remove backtrace() implementation.

kern_synch.c:
o  Call kdb_reenter() instead of db_error().
2004-07-10 21:36:01 +00:00
Marcel Moolenaar
cbc174356c Introduce the KDB debugger frontend. The frontend provides a framework
in which multiple (presumably different) debugger backends can be
configured and which provides basic services to those backends.
Besides providing services to backends, it also serves as the single
point of contact for any and all code that wants to make use of the
debugger functions, such as entering the debugger or handling of the
alternate break sequence. For this purpose, the frontend has been
made non-optional.
All debugger requests are forwarded or handed over to the current
backend, if applicable. Selection of the current backend is done by
the debug.kdb.current sysctl. A list of configured backends can be
obtained with the debug.kdb.available sysctl. One can enter the
debugger by writing to the debug.kdb.enter sysctl.
2004-07-10 18:40:12 +00:00
Poul-Henning Kamp
552afd9c12 Clean up and wash struct iovec and struct uio handling.
Add copyiniov() which copies a struct iovec array in from userland into
a malloc'ed struct iovec.  Caller frees.

Change uiofromiov() to malloc the uio (caller frees) and name it
copyinuio() which is more appropriate.

Add cloneuio() which returns a malloc'ed copy.  Caller frees.

Use them throughout.
2004-07-10 15:42:16 +00:00
Robert Watson
5c2b7a2273 Now socket buffer locks are being asserted at higher code blocks in
soreceive(), remove some leaf assertions that are redundant.
2004-07-10 04:38:06 +00:00
Robert Watson
32775a01da Assert socket buffer lock at strategic points between sections of code
in soreceive() to confirm we've moved from block to block properly
maintaining locking invariants.
2004-07-10 03:47:15 +00:00
John Baldwin
776b99ee1a Check the lock lists to see if they are empty directly rather than
assigning a pointer to the list and then dereferencing the pointer as a
second step.  When the first spin lock is acquired, curthread is not in
a critical section so it may be preempted and would end up using another
CPUs lock list instead of its own.

When this code was in witness_lock() this sequence was safe as curthread
was in a critical section already since witness_lock() is called after the
lock is acquired.

Tested by:	Daniel Lang dl at leo.org
2004-07-09 17:46:27 +00:00
Dag-Erling Smørgrav
520df27692 Cosmetic adjustment to previous commit: name the second argument to
sbuf_bcat() and sbuf_bcpy() "buf" rather than "data".
2004-07-09 11:37:44 +00:00
Dag-Erling Smørgrav
d751f0a935 Have sbuf_bcat() and sbuf_bcpy() take a const void * instead of a
const char *, since callers are likely to pass in pointers to all
kinds of structs and whatnot.
2004-07-09 11:35:30 +00:00
Alan Cox
0049f8b27b Eliminate struct shm_handle. It is an unnecessary level of indirection to
a vm_object.
2004-07-09 05:28:38 +00:00
Robert Watson
6ec70e64c6 Remove spl()'s from do_sendfile(). 2004-07-09 01:46:03 +00:00
John Baldwin
63fcce68f1 - Move contents of sched_add() into a sched_add_internal() function that
takes an argument to specify if it should preempt or not.  Don't preempt
  when sched_add_internal() is called from kseq_idled() or kseq_assign()
  as in those cases we are about to call mi_switch() anyways.  Also, doing
  so during the first context switch on an AP leads to a NULL pointer deref
  because curthread is NULL.
- Reenable preemption for ULE.

Submitted by:	Taku YAMAMOTO taku at tackymt.homeip.net
2004-07-08 21:45:04 +00:00
Alfred Perlstein
057589c485 fixup sysctl by fsid node 2004-07-08 06:11:36 +00:00
Alfred Perlstein
14d543ddfb style(9) 2004-07-07 07:00:02 +00:00
Alfred Perlstein
81d16e2d64 do the vfsstd thing instead of messing up our VFS_SYSCTL macro. 2004-07-07 06:58:29 +00:00
Peter Edwards
0f01586867 Fix bug introduced in rev 1.434:
When avoiding the zeroing of "bogus_page" when it appears in a buf,
be sure to advance the pointers into the data for successive pages.

The bug caused file corruption when read(2)ing from a "hole" in a
file where a previous page of the read block had already been faulted
in: fsx tripped up on this pretty quickly. The particular access
pattern is probably pretty unusual, so other applications probably
wouldn't have had problems, but you'd never know.

Reviewed By: alc@
2004-07-06 23:40:40 +00:00
Alfred Perlstein
1ea6061793 Use vfs_suser() where appropriate. 2004-07-06 09:39:32 +00:00
Alfred Perlstein
ea0104b032 Introduce vfs_suser(), used to test if a user should have special privs
for a mount.
2004-07-06 09:37:43 +00:00
Alfred Perlstein
c713aaaeca NFS mobility PHASE I, II & III (phase VI, and V pending):
Rebind the client socket when we experience a timeout.  This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
2004-07-06 09:12:03 +00:00
Robert Watson
df623e3c2f Temporarily disable preemption in SCHED_ULE due to reported panics and
hangs due to recent preemption changes.  This change appears to remove
the panic that I was running into, but at the cost of increasing
ithread scheduling latency, and as such is a temporary band-aid until
jhb has a chance to resolve the ule<->preemption interaction that is
the source of the problem.  If it doesn't fix the problem for others--
sorry!
2004-07-06 05:57:29 +00:00
Don Lewis
27875d9c88 Unconditionally set last_work_seen while in the SYNCER_RUNNING state
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.

Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.

Tested by:	phk
2004-07-05 21:32:01 +00:00
Robert Watson
6a72b225b7 Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT.
A subset of locking changes to soreceive() in the queue for merging.

Bumped into by:	Willem Jan Withagen <wjw@withagen.nl>
2004-07-05 19:29:33 +00:00
Don Lewis
faf1b66d1d Rework syncer termination code:
Speed up the syncer when shutting down by sleeping for a shorter
    period of time instead of cranking up rushjob and using the
    normal one second sleep.

    Skip empty worklist slots when shutting down to avoid lengthy
    intervals of inactivity.

    Give I/O more time to complete between steps by not speeding the
    syncer quite as much.

    Terminate the syncer after one full pass through the worklist
    plus one second with the worklist containing nothing but syncer
    vnodes.

    Print an indication of shutdown progress to the console.

Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.
2004-07-05 01:07:33 +00:00
Poul-Henning Kamp
c555963fd1 Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE. 2004-07-04 22:33:22 +00:00
Alfred Perlstein
2d1dca73ee Pass the operation in with the fsidctl.
Remove some fsidctls that we will not be using.
Correct prototypes for fs sysctls.
2004-07-04 20:21:58 +00:00
Poul-Henning Kamp
7f6599fec6 Make the last commit handle non-phk root devices better. 2004-07-04 19:42:25 +00:00
Stefan Farfeleder
5908d366fb Consistently use __inline instead of __inline__ as the former is an empty macro
in <sys/cdefs.h> for compilers without support for inline.
2004-07-04 16:11:03 +00:00
Poul-Henning Kamp
1cbb1e02c4 Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.
2004-07-04 12:49:04 +00:00
Alfred Perlstein
94ed9c8af5 Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace.  Currently only mount and unmount
of filesystems are signalled.  Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
2004-07-04 10:52:54 +00:00
Alfred Perlstein
903ac7c219 Revision 1.496 would not boot on my system due to
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
 insmntqueue(vp, mp = NULL) -> KASSERT -> panic

Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.
2004-07-04 10:19:15 +00:00
Poul-Henning Kamp
e3c5a7a4dd When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint.  If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

		MNT_ILOCK(mp);
	loop:
		for (vp = TAILQ_FIRST(&mp...);
		    (vp = nvp) != NULL;
		    nvp = TAILQ_NEXT(vp,...)) {
			if (vp->v_mount != mp)
				goto loop;
			MNT_IUNLOCK(mp);
			...
			MNT_ILOCK(mp);
		}
		MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

	MNT_ILOCK(vp->v_mount);
	...
	TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
	...
	MNT_IUNLOCK(vp->v_mount);
	...
	vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go.  This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
2004-07-04 08:52:35 +00:00