Commit Graph

7167 Commits

Author SHA1 Message Date
John Baldwin
f4daf05619 - Correct the translation of old rlimit values to properly handle the old
RLIM_INFINITY case for ogetrlimit().
- Use %jd and intmax_t to output negative time in usec in calcru().
- Rework getrusage() to make a copy of the rusage struct into a local
  variable while holding Giant and then do the copyout from the local
  variable to avoid having to have the original process rusage struct
  locked while doing the copyout (which would not be safe).  This also
  includes a few style fixes from Bruce to getrusage().

Submitted by:	bde (1, parts of 3)
Suggested by:	bde (2)
2004-02-06 19:30:12 +00:00
John Baldwin
99b6e02ba6 A few more style fixes from Bruce including a few I missed last time.
Submitted by:	bde
2004-02-06 19:25:34 +00:00
John Baldwin
4c3558aa82 Always set a process' state to normal when it is fully constructed in
fork1() rather than only doing it for the RFSTOPPED case and then having
to fix it up in other places later on.
2004-02-05 21:01:37 +00:00
John Baldwin
b4323d7729 - A lot of style and whitespace fixes.
- Update a few comments regarding locking notes.

Submitted by:	bde (1, mostly)
2004-02-05 20:53:25 +00:00
Jacques Vidrine
b00a3c85da Correct a reference counting bug in shmat(2). If vm_map_find(9)
failed, the reference count for the virtual memory object referenced
by the specified shared memory segment would have been erroneously
incremented.

Reported by:	Joost Pol <joost@pine.nl>
2004-02-05 18:00:35 +00:00
Alexander Kabaev
dec8868dcc Rename cn_unavailable to cnunavailable for little more consistency.
Garbage collect unused cndebug() function.

Suggested by:	bde
2004-02-05 17:35:28 +00:00
Mike Silbersack
b711d74eaf Style fixes: don't indent variable names.
Submitted by:	bde
2004-02-05 08:29:27 +00:00
Alexander Kabaev
e99c09e2dc Eliminate global cons_unavailable flag and replace it by the status
bit maintained on a per-device basis. Single variable is inadequate
on machines running with multiple consoles enabled.
2004-02-05 01:56:43 +00:00
John Baldwin
91d5354a2c Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count.  The plimit
  structure is treated similarly to struct ucred in that is is always copy
  on write, so having a reference to a structure is sufficient to read from
  it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
  limits from a process to keep the limit structure from changing out from
  under you while reading from it.
- Various global limits that are ints are not protected by a lock since
  int writes are atomic on all the archs we support and thus a lock
  wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
  behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
  either an rlimit, or the current or max individual limit of the specified
  resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
  other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
  (it didn't used the stackgap when it should have) but uses lim_rlimit()
  and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
  but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits.  It
  also no longer uses the stackgap for accessing sysctl's for the
  ibcs2_sysconf() syscall but uses kernel_sysctl() instead.  As a result,
  ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by:	mtm (mostly, I only did a few cleanups and catchups)
Tested on:	i386
Compiled on:	alpha, amd64
2004-02-04 21:52:57 +00:00
Mike Silbersack
ff5e43a3fd Rename iov_to_uio to uiofromiov to be more consistent with other
uio* functions.

Suggested by:	bde
2004-02-04 08:43:21 +00:00
Pawel Jakub Dawidek
19b0efd32d Allow assert that the current thread does not hold the sx(9) lock.
Reviewed by:		jhb
In cooperation with:	juli, jhb
Approved by:		jhb, scottl (mentor)
2004-02-04 08:14:58 +00:00
Mike Silbersack
2ccbe4b596 Style fixes
Submitted by:	bde
2004-02-04 08:14:47 +00:00
Robert Watson
5e312ddcc6 A variety of further cleanups to ttyinfo():
- Rename temporary variable names ("tmp", "tmp2") to more informative
  names ("load", "pctcpu", "rss", ...)

- Unclutter indentation and return paths: rather than lots of nested
  ifs, simply return earlier if it's not going to work out.  Simplify
  general structure and avoid "deep" code.

- Comment on the thread/process selection and locking.

- Correct handling of "running"/"runnable" states, avoid "unknown"
  that people were seeing for running processes.  This was due to
  a misunderstanding of the more complex state machine / inhibitors
  behavior of KSE.

- Do perform ttyinfo() printing on KSE (P_SA) processes, it seems
  generally to work.

While I initially attempted to formulate this as two commits (one
layout, the other content), I concluded that the layout changes were
really structural changes.

Many elements submitted by:  bde
2004-02-04 05:46:05 +00:00
John Baldwin
3e9ac3ebf2 Remove a bogus assertion.
Noticed by:	bde
Pointy hat to:	jhb
2004-02-03 15:14:27 +00:00
Daniel Eischen
b5426f096b Regen after adding ksem_timedwait(). 2004-02-03 05:11:31 +00:00
Daniel Eischen
aae94fbbb6 Add ksem_timedwait() to complement ksem_wait().
Glanced at by:	alfred
2004-02-03 05:08:32 +00:00
Robert Watson
4f638130c3 Don't dec/inc the amountpipes counter every time we resize a pipe --
instead, just dec/inc in the ctor/dtor.  For now, increment/decrement
in two's, since we're now performing the operation once per pair,
not once per pipe.  Not really any measurable performance change
in my micro-benchmarks, but doing less work is good, especially when
it comes to atomic operations.

Suggested by:	alc
2004-02-03 04:55:24 +00:00
Robert Watson
9a830ddc54 Catch instances of (pipe == NULL) that were obsoleted with recent
changes to jointly allocated pipe pairs.  Replace these checks
with pipe_present checks.  This avoids a NULL pointer dereference
when a pipe is half-closed.

Submitted by:	Peter Edwards <peter.edwards@openet-telecom.com>
2004-02-03 02:50:51 +00:00
John Baldwin
9c9c52a3ed - Assert that witness_cold is not true in enroll().
- Only check witness_watch once in enroll().

Reported by:	ru (2)
2004-02-02 22:15:17 +00:00
Pawel Jakub Dawidek
3410b19324 Fix many issues related to mount/unmount:
1. Root from inside a jail was able to unmount any file system
   (except /).
2. Unprivileged root was able to unmount file systems mounted by
   privileged root (execpt /).
3. User from inside a jail was able to mount file system when
   sysctl vfs.usermount was set to 1.
4. User was able to mount file system when vfs.usermount was set to 1
   (that's ok) and unmount it even if vfs.usermount was equal to 0
   (that's not correct).

Possibility from point 1 was reported by: Dariusz Kowalski <darek@76.pl>

Only a part of this fix will be MFC'ed (if approved).

PR:		kern/60149
Reviewed by:	rwatson
Approved by:	scottl (mentor)
MFC after:	3 days
2004-02-02 19:02:05 +00:00
Mike Silbersack
02ec600572 Remove debugging code that slipped into the previous commit.
Spotted by:	bde
2004-02-02 09:09:59 +00:00
Jeff Roberson
b209e5e3e4 - style fixes to the critical_exit() KASSERT().
Submitted by:	bde
2004-02-02 08:13:27 +00:00
Jeff Roberson
0392e39dff - Allow interactive tasks to use the maximum time-slice. This is not as
detrimental as I thought it would be in the case of massive process
   storms from a shell and it makes regular desktop usage noticeably
   better.
2004-02-01 10:38:13 +00:00
Mike Silbersack
beb699c7ba Rewrite sendfile's header support so that headers are now sent in the first
packet along with data, instead of in their own packet.  When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network.  Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)

Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.
2004-02-01 07:56:44 +00:00
Jeff Roberson
f2f51f8ab8 - Disable ithread binding in all cases for now. This doesn't make as much
sense with sched_4bsd as it does with sched_ule.
 - Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
   not a thread should be accounted for in sched_tdcnt.
2004-02-01 06:20:18 +00:00
Robert Watson
4795b82c13 Coalesce pipe allocations and frees. Previously, the pipe code
would allocate two 'struct pipe's from the pipe zone, and malloc a
mutex.

- Create a new "struct pipepair" object holding the two 'struct
  pipe' instances, struct mutex, and struct label reference.  Pipe
  structures now have a back-pointer to the pipe pair, and a
  'pipe_present' flag to indicate whether the half has been
  closed.

- Perform mutex init/destroy in zone init/destroy, avoiding
  reallocating the mutex for each pipe.  Perform most pipe structure
  setup in zone constructor.

- VM memory mappings for pageable buffers are still done outside of
  the UMA zone.

- Change MAC API to speak 'struct pipepair' instead of 'struct pipe',
  update many policies.  MAC labels are also handled outside of the
  UMA zone for now.  Label-only policy modules don't have to be
  recompiled, but if a module is recompiled, its pipe entry points
  will need to be updated.  If a module actually reached into the
  pipe structures (unlikely), that would also need to be modified.

These changes substantially simplify failure handling in the pipe
code as there are many fewer possible failure modes.

On half-close, pipes no longer free the 'struct pipe' for the closed
half until a full-close takes place.  However, VM mapped buffers
are still released on half-close.

Some code refactoring is now possible to clean up some of the back
references, etc; this patch attempts not to change the structure
of most of the pipe implementation, only allocation/free code
paths, so as to avoid introducing bugs (hopefully).

This cuts about 8%-9% off the cost of sequential pipe allocation
and free in system call tests on UP and SMP in my micro-benchmarks.
May or may not make a difference in macro-benchmarks, but doing
less work is good.

Reviewed by:	juli, tjr
Testing help:	dwhite, fenestro, scottl, et al
2004-02-01 05:56:51 +00:00
Jeff Roberson
40ece05382 - Revert rev 1.240 we no longer need a kthread for loadav(). 2004-02-01 05:37:36 +00:00
Jeff Roberson
e7f004fe23 - Use sched_load() rather than grabbing the sx lock and traversing the proc
table to discover the load.
2004-02-01 02:51:33 +00:00
Jeff Roberson
33916c360e - Add a new member to struct kseq called ksq_sysload. This is intended to
track the load for the sched_load() function.  In the SMP case this member
   is not defined because it would be redundant with the ksg_load member
   which already tracks the non ithd load.
 - For sched_load() in the UP case simply return ksq_sysload.  In the SMP
   case traverse the list of kseq groups and sum up their ksg_load fields.
2004-02-01 02:48:36 +00:00
Jeff Roberson
ca59f15272 - Keep a variable 'sched_tdcnt' that is used for the local implementation
of sched_load().  This variable tracks the number of running and runnable
   non ithd threads.  This removes the need to traverse the proc table and
   discover how many threads are runnable.
2004-02-01 02:46:47 +00:00
Robert Watson
fca542bcaa Move KASSERT regarding td_critnest to after the value of td is set to
curthread, to avoid warning and incorrect behavior.

Hoped not to mind:	jeff
2004-02-01 02:31:36 +00:00
Jeff Roberson
6767c6547b - Assert that td_critnest > 0 in critical_exit() to catch cases of
unbalanced uses of the critical_* api.
2004-02-01 01:24:54 +00:00
Robert Watson
26518e8d8c Fix an error in a KASSERT string: it's pipe_free_kmem(), not
pipespace(), that contains this KASSERT.
2004-01-31 23:03:22 +00:00
Poul-Henning Kamp
be8a62e821 Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
Robert Watson
30a9f26db2 Assert process lock in ptracestop(), since we're going to rely
on it, and later unlock it.
2004-01-29 00:58:21 +00:00
Robert Watson
94ffb20d72 Add a reset sysctl for mutex profiling: zeros all of the mutex
profiling buffers and hash table.  This makes it a lot easier to
do multiple profiling runs without rebooting or performing
gratuitous arithmetic.  Sysctl is named debug.mutex.prof.reset.

Reviewed by:	jake
2004-01-28 22:11:53 +00:00
John Baldwin
d5b75694e7 Move the loadav() callout into its own kthread since it uses allproc_lock
which is a sleepable lock and thus is not safe to acquire from a callout
routine.
2004-01-28 20:44:41 +00:00
John Baldwin
8d768e7676 Rework witness_lock() to make it slightly more useful and flexible.
- witness_lock() is split into two pieces: witness_checkorder() and
  witness_lock().  Witness_checkorder() determines if acquiring a specified
  lock at the time it is called would result in a lock order.  It
  optionally adds a new lock order relationship as well.  witness_lock()
  updates witness's data structures to assume that a lock has been acquired
  by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
  acquire a lock and continue to call witness_lock() after the acquire is
  completed.  This will let witness catch a deadlock before it happens
  rather than trying to do so after the threads have deadlocked (i.e. never
  actually report it).
- A new function witness_defineorder() has been added that adds a lock
  order between two locks at runtime without having to acquire the locks.
  If the lock order cannot be added it will return an error.  This function
  is available to programmers via the WITNESS_DEFINEORDER() macro which
  accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
  witness_checkorder() anywhere as a way of enforcing locking assertions
  in code that might acquire a certain lock in some situations.  The
  macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
  appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
  was unnested by using a goto to vastly improve the readability of this
  function.
2004-01-28 20:39:57 +00:00
John Baldwin
62a0fd943c Use mtx_assert() rather than using a home-rolled version. 2004-01-28 20:26:39 +00:00
Alexander Kabaev
975634280a Move the part of the comment which applies to osigsuspend where
it belongs. The current sigsuspend syscall does expect a pointer
to the mask as argument.

Submitted by:	Igor Sysoev <is at rambler-co dot ru>
2004-01-28 06:06:04 +00:00
Dag-Erling Smørgrav
84344f9fbf Rename the kern.vm.kmem.size tunable to the more logical vm.kmem_size. To
assure backward compatibility (conditional on !BURN_BRIDGES), look it up
by its old name first, and log a warning (but accept the setting) if it
was found.  If both the old and new name are defined, the new name takes
precedence.

Also export vm.kmem_size as a read-only sysctl variable; I find it hard to
tune a parameter when I don't know its default value, especially when that
default value is computed at boot time.
2004-01-27 15:59:38 +00:00
Robert Watson
6bea667f63 When aborting fork() due to a failure, if using MAC, make sure to clean
up the p_label field.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-01-25 18:42:18 +00:00
Ruslan Ermilov
33fe8fd0df Register the uart(4)'s spin lock with witness(4). 2004-01-25 15:04:37 +00:00
Jeff Roberson
c77ac1fdee - sched_strict has been dead for a long time now. Get rid of it. 2004-01-25 08:58:14 +00:00
Jeff Roberson
c494ddc8a1 - Clean up KASSERTS. 2004-01-25 08:57:38 +00:00
Jeff Roberson
5a2b158d8d - Correct function names listed in KASSERTs. These were copied from other
code and it was sloppy of me not to adjust these sooner.
2004-01-25 08:21:46 +00:00
Jeff Roberson
e17c57b14b - Implement cpu pinning and binding. This is acomplished by keeping a per-
cpu run queue that is only used for pinned or bound threads.

Submitted by:	Chris Bradfield <chrisb@ation.org>
2004-01-25 08:00:04 +00:00
Jeff Roberson
d1605f0ac9 - Use a unique string for the sched_setup SYSINIT and rename sched_setup to
synch_setup.  The schedulers use the sched_setup function name.
2004-01-25 07:49:45 +00:00
Jeff Roberson
29bcc4514f - Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL.  Assert that one of these is set in mi_switch() and propery
   adjust the rusage statistics.  This is to simplify the large number of
   users of this interface which were previously all required to adjust the
   proper counter prior to calling mi_switch().  This also facilitates more
   switch and locking optimizations.
 - Change all callers of mi_switch() to pass the appropriate paramter and
   remove direct references to the process statistics.
2004-01-25 03:54:52 +00:00
Robert Watson
8dc10be885 Add some basic support for measuring sleep mutex contention to the
mutex profiling code.  As with existing mutex profiling, measurement
is done with respect to mtx_lock() instances in the code, as opposed
to specific mutexes.  In particular, measure two things:

(1) Lock contention.  How often did this mtx_lock() call get made and
    have to sleep (or almost sleep) waiting for the lock.  This helps
    identify the "victims" of contention.

(2) Hold contention.  How often, while the lock was held by a thread
    as a result of this mtx_lock(), did another thread try to acquire
    the same mutex.  This helps identify the causes of contention.

I'm currently exploring adding measurement of "time waited for the
lock", but the current implementation has proven useful to me so far
so I figured I'd commit it so others could try it out.  Note that this
increases the size of mutexes when MUTEX_PROFILING is enabled, so you
might find you need to further bump UMA_BOOT_PAGES.  Fixes welcome.

The once over:	des, others
2004-01-25 01:59:27 +00:00
Poul-Henning Kamp
551260fc36 Deal with MOD_FREQUENCY before MOD_OFFSET because the latter is the
one which runs the actual update.  This fixes a bug where there were
a delay in applying the frequency adjustment.  In extreme cases this
could result in marginal stability of the kernel-pll.
2004-01-24 21:48:43 +00:00
Jeff Roberson
b9509b56fa - Move smp_topology to subr_smp.c so that it is defined on all architectures. 2004-01-24 19:52:48 +00:00
Robert Watson
646e29ccac Don't grab Giant in crfree(), since prison_free() no longer requires it.
The uidinfo code appears to be MPSAFE, and is referenced without Giant
elsewhere.  While this grab of Giant was only made in fairly rare
circumstances (actually GC'ing on refcount==0), grabbing Giant here
potentially introduces lock order issues with any locks held by the
caller.  So this probably won't help performance much unless you change
credentials a lot in an application, and leave a lot of file descriptors
and cached credentials around.  However, it simplifies locking down
consumers of the credential interfaces.

Bumped into by:	sam
Appeased:	tjr
2004-01-23 21:07:52 +00:00
Robert Watson
b3059e09f6 Defer the vrele() on a jail's root vnode reference from prison_free()
to a new prison_complete() task run by a task queue.  This removes
a requirement for grabbing Giant in crfree().  Embed the 'struct task'
in 'struct prison' so that we don't have to allocate memory from
prison_free() (which means we also defer the FREE()).

With this change, I believe grabbing Giant from crfree() can now be
removed, but need to check the uidinfo code paths.

To avoid header pollution, move the definition of 'struct task'
to _task.h, and recursively include from taskqueue.h and jail.h; much
preferably to all files including jail.h picking up a requirement to
include taskqueue.h.

Bumped into by:	sam
Reviewed by:	bde, tjr
2004-01-23 20:44:26 +00:00
Poul-Henning Kamp
ee57aeea65 Write 100 times for tomorrow:
"Always print time_t as %jd, you never know what width it has"
2004-01-22 19:50:06 +00:00
Ralf S. Engelschall
446655ac4f Fix generation of random multicast MAC address.
In case no real/physical IEEE 802 address is available, both the expired
"draft-leach-uuids-guids-01" (section "4. Node IDs when no IEEE 802
network card is available") and RFC 2518 (section "6.4.1 Node Field
Generation Without the IEEE 802 Address") recommend (quoted from RFC
2518):

  "The ideal solution is to obtain a 47 bit cryptographic quality random
  number, and use it as the low 47 bits of the node ID, with the _most_
  significant bit of the first octet of the node ID set to 1. This bit
  is the unicast/multicast bit, which will never be set in IEEE 802
  addresses obtained from network cards; hence, there can never be a
  conflict between UUIDs generated by machines with and without network
  cards."

Unfortunately, this incorrectly explains how to implement this and
the FreeBSD UUID generator code inherited this generation bug from
the broken reference code in the standards draft. They should instead
specify the "_least_ significant bit of the first octet of the node ID"
as the multicast bit in a memory and hexadecimal string representation
of a 48-bit IEEE 802 MAC address.

This standards bug arised from a false interpretation, as the multicast
bit is actually the _most_ significant bit in IEEE 802.3 (Ethernet)
_transmission order_ of an IEEE 802 MAC address. The standards authors
forgot that the bitwise order of an _octet_ from a MAC address _memory_
and hexadecimal string representation is still always from left (MSB,
bit 7) to right (LSB, bit 0).

Fortunately, this UUID generation bug could have occurred on systems
without any Ethernet NICs only.
2004-01-22 13:34:11 +00:00
Poul-Henning Kamp
4e74721cac Add a sysctl (default: off) which enables a log(LOG_INFO...) warning
if the clock is stepped.
2004-01-21 21:05:40 +00:00
Robert Watson
679365e7b9 Reduce gratuitous includes: don't include jail.h if it's not needed.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.

Result of:	self injury involving adding something to struct prison
2004-01-21 17:10:47 +00:00
Andrey A. Chernov
9bbee25931 pread/pwrite:
follow lseek spirit - return EINVAL on negative offset for non-VCHR
2004-01-20 01:27:42 +00:00
Poul-Henning Kamp
50d23be140 Add linenumber and source filename to panic(9) output.
Ideally a traceback should be printed too, any takers ?
2004-01-19 21:27:11 +00:00
Alexander Kabaev
54556cc7b8 One more instance of magic number used in place of IO_SEQSHIFT.
Submitted by:	alc
2004-01-19 20:45:43 +00:00
Ruslan Ermilov
0541040c46 Since "m" is not part of the "mp" chain, need to free() it.
Reported by:	Stanford Metacompilation research group
2004-01-18 14:02:53 +00:00
Andrew Gallatin
1c318b9665 Handle sf_buf_alloc() returning null. This can happen if the
process takes a signal while waiting for an sf_buf to become available.

Reviewed by: alc
2004-01-17 21:16:51 +00:00
Dag-Erling Smørgrav
a6d4491c71 Restore correct semantics for F_DUPFD fcntl. This should fix the errors
people have been getting with configure scripts.
2004-01-17 00:59:04 +00:00
Dag-Erling Smørgrav
56a9fc0e93 WITNESS won't let us hold two filedesc locks at the same time, so juggle
fdp and newfdp around a bit.
2004-01-16 21:54:56 +00:00
Robert Watson
bafc8f255a KASSERT() that initproc->p_pid is 1. Very bad things happen if init's
pid isn't 1, and it can actually occur if kthread_create() is called
before SUB_SI_CREATE_INIT without RFHIGHPID.

Discussed with:	jhb
2004-01-16 20:29:23 +00:00
Dag-Erling Smørgrav
ddce426f69 Remove two KASSERTs which were overly paranoid. 2004-01-16 08:45:56 +00:00
Dag-Erling Smørgrav
12d568c2b1 Take care to drop locks when calling malloc() 2004-01-15 18:50:11 +00:00
Dag-Erling Smørgrav
a2fe44e8cf New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos.  The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed.  It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).
2004-01-15 10:15:04 +00:00
Don Lewis
288e351b55 If a device attach routine fails during boot and calls bus_teardown_intr(),
ithread_remove_handler() may fail to remove the interrupt handler if
it decides to let the ithread do the removal.  The problem is that during
boot "cold" is set, which causes msleep() to return immediately.  This
will cause ithread_remove_handler() to fail to wait for the ithread
to do the removal from the handler TAILQ before freeing the handler
back to the heap.  Bad things will happen when some other user of the
TAILQ, such as ithread_add_handler() or the actual ithread attempts to use
the freed handler.  Fix the problem by forcing ithread_remove_handler()
to do the actual removal itself if the "cold" flag is set.

Reviewed by:	jhb
2004-01-13 22:55:46 +00:00
Dag-Erling Smørgrav
ac34dc4e79 Back out 1.160, which was committed by mistake. 2004-01-11 20:08:57 +00:00
Dag-Erling Smørgrav
d7a1c7e34b Back out 1.166, which was committed by mistake. 2004-01-11 20:07:15 +00:00
Dag-Erling Smørgrav
f1ea6d813d Mechanical whitespace cleanup + other minor style nits. 2004-01-11 19:56:42 +00:00
Dag-Erling Smørgrav
0e5dfade00 Mechanical whitespace cleanup. 2004-01-11 19:54:45 +00:00
Dag-Erling Smørgrav
05c3c5c8b6 Mechanical whitespace cleanup; parenthesize return values; other minor
style nits.  The #ifdefs in this file give me a headache...
2004-01-11 19:52:10 +00:00
Dag-Erling Smørgrav
e5aeaa0c67 Mechanical whitespace cleanup; parenthesize return values; other minor
style nits.
2004-01-11 19:48:19 +00:00
Dag-Erling Smørgrav
012b5531f4 Mechanical whitespace cleanup + minor style nits. 2004-01-11 19:43:14 +00:00
Dag-Erling Smørgrav
c9de31f55f Mechanical whitespace cleanup. 2004-01-11 19:39:14 +00:00
Alan Cox
0e88a71798 Remove long dead code, specifically, code related to munmapfd().
(See also vm/vm_mmap.c revision 1.173.)
2004-01-11 06:59:21 +00:00
Robert Watson
def055686c When not creating a core dump due to resource limits specifying
a maximum dump size of 0, return a size-related error, rather
than returning success.  Otherwise, waitpid() will incorrectly
return a status indicating that a core dump was created.  Note
that the specific error doesn't actually matter, since it's lost.

MFC after:	2 weeks
PR:		60367
Submitted by:	Valentin Nechayev <netch@netch.kiev.ua>
2004-01-11 02:28:06 +00:00
Jens Schweikhardt
85495c72ff s/Muliple/Multiple
Removed whitespace at EOL and EOF.
2004-01-10 18:34:01 +00:00
Dag-Erling Smørgrav
d41457da80 More unparenthesized return values. 2004-01-10 17:14:53 +00:00
Dag-Erling Smørgrav
b91a599717 Style: parenthesize return values. 2004-01-10 13:03:43 +00:00
Don Lewis
2b77864f1e Add a somewhat redundant check on the len arguement to getsockaddr() to
avoid relying on the minimum memory allocation size to avoid problems.
The check is somewhat redundant because the consumers of the returned
structure will check that sa_len is a protocol-specific larger size.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by:	nectar
MFC after:	30 days
2004-01-10 08:28:54 +00:00
Olivier Houchard
5cded90454 Prevent a race condition between fork1() and whatever changes the pgrp by
setting the new process' p_pgrp again before inserting it in the p_pglist.
Without it we can get the new process to be inserted in a different p_pglist
than the one p2->p_pgrp points to, and this is not something we want to happen.
This is not a fix, merely a bandaid, but it will work until someone finds a
better way to do it.

Discussed with: 	jhb (a long time ago)
2004-01-09 23:42:36 +00:00
Robert Watson
07eacae0d2 Improve the expressiveness of ttyinfo (^T) when dealing with threads
in slightly less usual states:

  If the thread is on a run queue, display "running" if the thread is
  actually running, otherwise, "runnable".

  If the thread is sleeping, and it's on a sleep queue, display the
  name of the queue, otherwise "unknown" -- previously, in this situation
  we would display "iowait".

  If the thread is waiting on a lock, display *lockname.

  If the thread is suspended, display "suspended" -- previously, in
  this situation we would display "iowait".

  If the thread is waiting for an interrupt, display "intrwait" --
  previously, in this situation we would display "iowait".

  If the thread is in a state not handled by the above, display
  "unknown" -- previously, we would print "iowait".

Among other things, this avoids displaying "iowait" when the foreground
process turns out to be suspended waiting for a debugger to properly
attach.
2004-01-08 22:49:23 +00:00
Robert Watson
047aa39b25 Drop the sigacts mutex around calls to stopevent() to avoid sleeping
holding the mutex.  Because the sigacts pointer can't change while
the process is "live" (proc locking (x)), we know our pointer is still
valid.

In communication with:	truckman
Reviewed by:		jhb
2004-01-08 22:44:54 +00:00
Alexander Kabaev
c969c60c60 Add pid to the info printed in lockmgr_printinfo. This makes VFS
diagnostic messages slightly more useful.
2004-01-06 04:34:13 +00:00
Alexander Kabaev
580ddfa64b More style fixes.
Obtained from:	bde
2004-01-05 23:40:46 +00:00
John Baldwin
eac097962f - Allow mtx_trylock() to recurse on a recursive mutex. Attempts to recurse
on a non-recursive mutex will fail but will not trigger any assertions.
- Add an assertion to mtx_lock() that one never recurses on a non-recursive
  mutex.  This is mostly useful for the non-WITNESS case.

Requested by:	deischen, julian, others (1)
2004-01-05 23:09:51 +00:00
Alexander Kabaev
b0fdf71656 style(9):
Add empty line before first code line in functions with no local
variables.
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.

Obtained from:	bde (partly)
2004-01-05 19:04:29 +00:00
Alexander Kabaev
3ff1b7c23f Cosmetics: strip '\n' from a string passed to Debugger(). 2004-01-04 03:42:20 +00:00
David Xu
a30ec4b99c Make sigaltstack as per-threaded, because per-process sigaltstack state
is useless for threaded programs, multiple threads can not share same
stack.
The alternative signal stack is private for thread, no lock is needed,
the orignal P_ALTSTACK is now moved into td_pflags and renamed to
TDP_ALTSTACK.
For single thread or Linux clone() based threaded program, there is no
semantic changed, because those programs only have one kernel thread
in every process.

Reviewed by: deischen, dfr
2004-01-03 02:02:26 +00:00
Nate Lawson
44bb5f52d3 Move the kernel power change printf under bootverbose since the
power_profile script now duplicates the message via syslog.
2004-01-02 18:24:13 +00:00
Sam Leffler
4f9f9cf3a4 m_tag fixups in preparation for heavier use:
o promote several m_tag_* routines to inline
o add an m_tag_setup inline to set the fixed fields in a packet tag
o add an m_tag_free method pointer to each mtag to support, for example,
  allocating tags from zones
o have m_tag_find check if the tag list is not empty before calling
  m_tag_locate to search

Reviewed by:	brooks, silence from others
2004-01-02 17:27:39 +00:00
David Malone
70ad6c2190 Plug a leak of open files that happens when you exec a suid program
with one of std{in,out,err} open. This helps with the file descriptor
leaks reported on -current. This should probably be merged into 5.2.

Reviewed by:	ru
Tested by:	Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net>
2003-12-28 19:27:14 +00:00
Bruce Evans
9efe7d9d83 v_vxproc was a bogus name for a thread (pointer). 2003-12-28 09:12:56 +00:00
Mike Silbersack
ddeb5b242e Track three new sendfile-related statistics:
- The number of times sendfile had to do disk I/O
- The number of times sfbuf allocation failed
- The number of times sfbuf allocation had to wait
2003-12-28 08:57:09 +00:00
Bruce Evans
d6c847f378 Fixed some style bugs (mainly, try to always use explicit comparisons with
NULL when checking for null pointers).
2003-12-28 04:37:59 +00:00
Bruce Evans
ca46e90ef4 Fixed some disordering in revs.1.194 and 1,196. Moved the exceve() syscall
function back to near the beginning of the file.  Rev.1.194 moved it into
the middle of auxiliary functions following kern_execve().  Moved the
__mac_execve() syscall function up together with execve().  It was new in
rev1.1.196 and perfectly misplaced after execve().
2003-12-28 04:18:13 +00:00
Mike Silbersack
69fba1650a Fix the maxpipekva warning message so that it points to the correct
sysctl, and shorten the message.

Noticed by:	bde
2003-12-28 01:19:58 +00:00
Alan Cox
34d2675761 Remove GIANT_REQUIRED from exec_unmap_first_page(). 2003-12-27 19:40:03 +00:00
Mike Silbersack
5eda9873e9 Track current and peak sfbuf usage, export the values via sysctl. 2003-12-27 07:52:47 +00:00
John Baldwin
c55bbb6cb7 Create a separate kthread that executes sched_cpu() once a second. Because
sched_cpu() locks an sx lock (allproc_lock) which can sleep if it fails to
acquire the lock, it is not safe to execute this in a callout handler from
softclock().
2003-12-26 17:07:29 +00:00
Alfred Perlstein
866e3b7e73 Put restrict back in, the compilation failure was my fault when I
did a bad merge from the PR.

Thanks to Bruce Evans for explaining.
2003-12-26 05:58:16 +00:00
Alfred Perlstein
4abb4ff34d Add __restrict qualifiers to copyinfrom, copyinstrfrom, copystr, copyinstr,
copyin and copyout.
2003-12-26 05:54:35 +00:00
David Malone
9322078275 In socket(2) we only need Giant around the call to socreate, so just
grab it there.
2003-12-25 23:44:38 +00:00
David Malone
1c58509c25 Don't TAILQ_INIT kq_head twice, once is enough. 2003-12-25 23:42:36 +00:00
Mike Silbersack
8dee2f6746 Fix another 0 / NULL mixup. 2003-12-25 01:17:27 +00:00
Alfred Perlstein
6502da1307 We're not ready for restrict qualifiers here. 2003-12-24 19:09:45 +00:00
Alfred Perlstein
9f144cff85 Add restrict qualifiers.
PR: 44394
Submitted by: Craig Rodrigues <rodrige@attbi.com>
2003-12-24 18:47:43 +00:00
Robert Watson
69546b2fbb Document that when we are addressing an open()/close() race, the reason
we call vn_close() manually rather than letting fdrop() take care of it
is that we haven't yet hooked up the various 'struct file' fields.
2003-12-24 17:13:01 +00:00
Alfred Perlstein
1805ed0772 Introduce mp_maxcpus which can be used by libkvm utils to find out
how many CPUs the system was compiled for.
Export the variable via a sysctl node 'kern.smp.maxcpus' as well.
2003-12-23 13:54:16 +00:00
Peter Wemm
2c74309622 Regen - this should be essentially a NOP, except for rcsid changes. 2003-12-23 03:52:14 +00:00
Peter Wemm
eec525a435 Remove namespc column and attempt to un-fold some of the longer lines
that now fit.
2003-12-23 03:51:36 +00:00
Peter Wemm
1a58b07149 Remove the namespace column from the syscalls tables. We don't actually
use it, if we ever did.  They have been been VERY poorly maintained for
some time, possibly because they were a NOP.  FWIW, This brings our table
formats back closer to the other *BSD's.
2003-12-23 03:50:43 +00:00
Peter Wemm
9b68618df0 Add an additional field to the elf brandinfo structure to support
quicker exec-time replacement of the elf interpreter on an emulation
environment where an entire /compat/* tree isn't really warranted.
2003-12-23 02:42:39 +00:00
Peter Wemm
a89ec05e3e Catch a few places where NULL (pointer) was used where 0 (integer) was
expected.
2003-12-23 02:36:43 +00:00
Peter Wemm
55cdddc0d8 Don't use NULL (pointer) when we mean 0 (integer) for the number of ticks
in msleep.
2003-12-23 02:28:42 +00:00
Jeff Roberson
249e0bea8f - Make our transfer decisions based on load and not transferable load. A
cpu could have been bogged down with non-transferable load and still not
   migrated a new thread to an idle cpu.  This required some benchmarking and
   tuning to get right as the comment above it suggests.
2003-12-20 22:35:20 +00:00
Jeff Roberson
e7a976f415 - Enable ithread migration on x86. This is done to work around a bug in the
IO APIC on Xeons that prevents round-robin interrupt assignment from
   working.
2003-12-20 20:36:19 +00:00
Alan Cox
96a7b42213 Remove a variable that has been initialized but otherwise unused since
revision 1.315.
2003-12-20 19:46:21 +00:00
Jeff Roberson
670c524f08 - In kseq_transfer() return if smp has not been started.
- In sched_add(), do the idle check prior to the transfer check so that we
   don't try to transfer load from an idle cpu.  This fixes panics caused by
   IPIs on UP machines running SMP kernels.

Reported/Debugged by:	seanc
2003-12-20 14:03:14 +00:00
Jeff Roberson
9b5f6f623d - Running interactive tasks with the minimum time-slice is fine for vi and
sh, but not so great for mozilla, X, etc.  Add a fixed define for the slice
   size granted to interactive KSEs.
2003-12-20 12:54:35 +00:00
Tim J. Robbins
f5925b7436 Reduce the overhead of semop() by using the kernel stack instead of
malloc'd memory to store the operations array if it is small enough
to fit.
2003-12-19 13:07:17 +00:00
John Baldwin
eb5b0e0565 Various style fixes.
Submitted by:	bde (mostly, if not all)
2003-12-17 21:13:04 +00:00
Jeff Roberson
958557e9c7 - In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT.
Submitted by:	Stephan Uphoff <ups@stups.com>
2003-12-16 17:08:27 +00:00
Jeff Roberson
d85213669b - When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
   However, this is not sufficient.  If we vclean() the vnode the pages owned
   by the vnode are lost, potentially while buffers reference them.  Implement
   parts of vclean() seperately in vgonechrl() so that the pages and bufs
   associated with a device vnode are not destroyed while in use.
2003-12-16 17:05:05 +00:00
Bruce M Simpson
5406529771 style(9) pass and type fixups.
Submitted by:	bde
2003-12-16 14:13:47 +00:00
Bruce M Simpson
37621fd5d9 Push m_apply() and m_getptr() up into the colleciton of standard mbuf
routines, and purge them from opencrypto.

Reviewed by:	sam
Obtained from:	NetBSD
Sponsored by:	spc.org
2003-12-15 21:49:41 +00:00
Jeff Roberson
86e1c22aa4 - Assign the ke_cpu field in kseq_notify() so that all of our callers do not
have to do it.
 - Set the ke_runq to NULL in sched_add() before calling kseq_notify().
   Otherwise we may panic in sched_add() if INVARIANTS is on.
2003-12-14 02:06:29 +00:00
Robert Watson
09a4a69c1d Although sometimes to the uninitiated, it may seem like goup, KSEGOUP
is actually spelt KSEGROUP.  Go figure.

Reported by:	samy@kerneled.com
2003-12-12 21:25:56 +00:00
Jeff Roberson
cac77d0422 - Now that we have kseq groups, balance them seperately.
- The new sched_balance_groups() function does intra-group balancing while
   sched_balance() balances the available groups.
 - Pick a random time between 0 ticks and hz * 2 ticks to restart each
   balancing process.  Each balancer has its own timeout.
 - Pick a random place in the list of groups to start the search for lowest
   and highest group loads.  This prevents us from prefering a group based on
   numeric position.
 - Use a nasty hack to stop us from preferring cpu 0.  The problem is that
   softclock always runs on cpu 0, so it always has a little extra load.  We
   ignore this load in the balancer for now.  In the future softclock should
   run on a random cpu and these hacks can go away.
2003-12-12 07:33:51 +00:00
Jeff Roberson
2e227f0406 - Don't let the pctcpu rate limiter throttle us if we have recorded over
SCHED_CPU_TICKS ticks.  This was allowing processes to display
   (1/SCHED_CPU_TIME * 100) % more cpu than they had used.
2003-12-11 04:23:39 +00:00
Jeff Roberson
b11fdad0fc - In sched_switch(), if a thread has been assigned, don't touch the runqueues
or load.  These things have already been taken care of in sched_bind()
   which should be the only place that we're switching in an assigned thread.
2003-12-11 04:00:49 +00:00
Jeff Roberson
80f86c9f88 - Add support for CPU groups to ule. All SMT cores on the same physical
cpu are added to a group.
 - Don't place a cpu into the kseq_idle bitmask until all cpus in that group
   have idled.
 - Prefer idle groups over idle group members in the new kseq_transfer()
   function.  In this way we will prefer to balance load across full cores
   rather than add further load a partial core.
 - Before a cpu goes idle, check the other group members for threads.  Since
   SMT cpus may freely share threads, this is cheap.
 - SMT cores may be individually pinned and bound to now.  This contrasts the
   old mechanism where binding or pinning would have allowed a thread to run
   on any available cpu.
 - Remove some unnecessary logic from sched_switch().  Priority propagation
   should be properly taken care of in sched_prio() now.
2003-12-11 03:57:10 +00:00
Peter Wemm
5be4b10c89 Regen 2003-12-10 22:18:54 +00:00
Peter Wemm
5352eb6bb1 Update file locations for syscall tables to copy to. 2003-12-10 22:08:37 +00:00
Marcel Moolenaar
ccb46feb8e Write the thread pointer (val) in the kse mailbox (loc) before we
set the new context in kse_switchin(2). This allows us to return
an error to the calling context when the suword() fails.
2003-12-10 01:59:23 +00:00
John Baldwin
67ba867827 Adjust an assertion for the TDF_TSNOBLOCK race handling in
turnstile_unpend().  A racing thread that does not have TDI_LOCK set may
either be running on another CPU or it may be sitting on a run queue if it
was preempted during the very small window in turnstile_wait() between
unlocking the turnstile chain lock and locking sched_lock.
2003-12-09 21:14:31 +00:00
John Baldwin
da1d503b22 Assert that the we never give a thread a NULL turnstile when waking it up. 2003-12-09 21:09:54 +00:00
John Baldwin
6b6bd95ee5 Revert the previous race fix and replace it with a more general fix. The
case of a turnstile having no threads is just one instance of the more
general case where the thread we are examining has been partially awakened
already in that it has been removed from the turnstile's blocked list but
still has TDI_LOCK set.  We detect that case by checking to see if the
thread has already had a turnstile reassigned to it.
2003-12-09 21:09:04 +00:00
David Xu
a9a48d6862 Lock and unlock sched_lock when walking through thread list, current we
insert kse upcall thread into thread list at mi_switch time, process lock
is not enough.
2003-12-07 23:47:15 +00:00
Don Lewis
50105bcf1a Pass MTX_DEF as the last argument to mtx_init() instead of 0. This
is not a functional change.  The code happened to work properly only
because MTX_DEF is defined as 0.
2003-12-07 21:53:41 +00:00
Poul-Henning Kamp
377e7be416 Make the DIAGNOSTIC code which complains about long {call|time}out(9)
functions less noisy:  We printf if a new function took longer than
the previous record holder, or of the previous record holder took
more than twice as long as the current record.
2003-12-07 20:03:28 +00:00
Marcel Moolenaar
cfa4b1e7b1 Regen due to kse_switchin(2). 2003-12-07 19:36:16 +00:00
Marcel Moolenaar
702b2a179c Add kse_switchin(2). This syscall can be used by KSE implementations
to have the kernel switch to a new thread, instead of doing it in
userland. It is in fact needed on ia64 where syscall restarts do not
return to userland first. It's completely handled inside the kernel.
As such, any context created by the kernel as part of an upcall and
caused by some syscall needs to be restored by the kernel.
2003-12-07 19:34:29 +00:00
Peter Wemm
a2640c9ba9 rqb_bits[] may be an int64_t (eg: on alpha, and recently on amd64).
Be sure to shift (long)1 << 33 and higher, not (int)1.  Otherwise bad
things happen(TM).  This is why beast.freebsd.org paniced with ULE.

Reviewed by:  jeff
2003-12-07 09:57:51 +00:00
Scott Long
774114995e Re-arrange and consolidate some random debugging stuff 2003-12-07 05:04:49 +00:00
Alan Cox
bca62663ab - Giant is no longer required by vm_thread_new(). 2003-12-07 04:16:49 +00:00
Robert Watson
56d9e93207 Rename mac_create_cred() MAC Framework entry point to mac_copy_cred(),
and the mpo_create_cred() MAC policy entry point to
mpo_copy_cred_label().  This is more consistent with similar entry
points for creation and label copying, as mac_create_cred() was
called from crdup() as opposed to during process creation.  For
a number of policies, this removes the requirement for special
handling when copying credential labels, and improves consistency.

Approved by:	re (scottl)
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-12-06 21:48:03 +00:00
John Baldwin
b6c71225a9 Fix all users of mp_maxid to use the same semantics, namely:
1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.

Approved by:	re (scottl)
Tested on:	i386, amd64, alpha
2003-12-03 14:57:26 +00:00
John Baldwin
45c1c90f6a Export a few SMP related symbols in UP kernels as well. This is needed to
aid other kernel code, especially code which can be in a module such as
the acpi_cpu(4) driver, to work properly with both SMP and UP kernels.
The exported symbols include mp_ncpus, all_cpus, mp_maxid, smp_started, and
the smp_rendezvous() function.  This also means that CPU_ABSENT() is now
always implemented the same on all kernels.

Approved by:	re (scottl)
2003-12-03 14:55:31 +00:00
David Greenman
186e347f2c Fixed a bug in sendfile(2) where the sent data would be corrupted due
to sendfile(2) being erroneously automatically restarted after a signal
is delivered. Fixed by converting ERESTART to EINTR prior to exiting.

Updated manual page to indicate the potential EINTR error, its cause
and consequences.

Approved by: re@freebsd.org
2003-12-01 22:12:50 +00:00
Ian Dowse
25cb5d7a6b In dounmount(), only call checkdirs() prior to VFS_UNMOUNT() in the
forced unmount case. Otherwise, a file system that is referenced
only by process fd_cdir/fd_rdir references to the file system root
vnode will be successfully unmounted without the MNT_FORCE flag.

The previous behaviour was not compatible with the unmount semantics
required by amd(8), so file systems could be unexpectedly unmounted
while there were still references to the file system root directory.

Reported by:	Erez Zadok <ezk@cs.sunysb.edu>
Approved by:	re (scottl)
2003-11-30 23:30:09 +00:00
Jeff Roberson
a6c6a93c89 - Don't forget to unlock the vnode interlock in the LK_NOWAIT case.
Submitted by:	Stephan Uphoff <ups@stups.com>
Approved by:	re (rwatson)
2003-11-30 22:09:58 +00:00
Alexander Kabaev
97c43a540a Do not attempt to destroy NULL vfs options list.
Approved by: re (scottl)
Reported by: Christian Laursen <xi atborderworlds dot dk>
2003-11-23 17:13:48 +00:00
John Baldwin
798a45964d - Split cpu_mp_probe() into two parts. cpu_mp_setmaxid() is still called
very early (SI_SUB_TUNABLES - 1) and is responsible for setting mp_maxid.
  cpu_mp_probe() is now called at SI_SUB_CPU and determines if SMP is
  actually present and sets mp_ncpus and all_cpus.  Splitting these up
  allows an architecture to probe CPUs later than SI_SUB_TUNABLES by just
  setting mp_maxid to MAXCPU in cpu_mp_setmaxid().  This could allow the
  CPU probing code to live in a module, for example, since modules
  sysinit's in modules cannot be invoked prior to SI_SUB_KLD.  This is
  needed to re-enable the ACPI module on i386.
- For the alpha SMP probing code, use LOCATE_PCS() instead of duplicating
  its contents in a few places.  Also, add a smp_cpu_enabled() function
  to avoid duplicating some code.  There is room for further code
  reduction later since much of this code is also present in cpu_mp_start().
- All archs besides i386 still set mp_maxid to the same values they set it
  to before this change.  i386 now sets mp_maxid to MAXCPU.

Tested on:	alpha, amd64, i386, ia64, sparc64
Approved by:	re (scottl)
2003-11-21 22:23:26 +00:00
Mark Murray
4e3a7a14d9 Fix a major faux pas of mine. I was causing 2 very bad things to
happen in interrupt context; 1) sleep locks, and 2) malloc/free
calls.

1) is fixed by using spin locks instead.

2) is fixed by preallocating a FIFO (implemented with a STAILQ)
   and using elements from this FIFO instead. This turns out
   to be rather fast.

OK'ed by:	re (scottl)
Thanks to:	peter, jhb, rwatson, jake
Apologies to:	*
2003-11-20 15:35:48 +00:00
Mark Murray
3fed54aaaa Hackfix to patch around a kernel panic I introduced. Real fix to
follow. In the meanwhile, we are not harvesting interrupt entropy.

Approved by:	re (jhb)
2003-11-18 14:35:43 +00:00
Robert Watson
a557af222b Introduce a MAC label reference in 'struct inpcb', which caches
the   MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols.  This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks.  Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by:	sam, bms
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
Robert Watson
64d19c2ea7 Add a sysctl, security.bsd.see_other_gids, similar in semantics
to see_other_uids but with the logical conversion.  This is based
on (but not identical to) the patch submitted by Samy Al Bahra.

Submitted by:	Samy Al Bahra <samy@kerneled.com>
2003-11-17 20:20:53 +00:00
Peter Wemm
0d2a298904 Initial landing of SMP support for FreeBSD/amd64.
- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
  (in particular, bootstrap)
- This is still a WIP.  It seems that there are some highly bogus bioses
  on nVidia nForce3-150 boards.  I can't stress how broken these boards
  are.  I have a workaround in mind, but right now the Asus SK8N is broken.
  The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE.  SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
  atpic' in addition, because they somehow managed to forget to connect the
  8254 timer to the apic, even though its in the same silicon!  ARGH!
  This directly violates the ACPI spec.
2003-11-17 08:58:16 +00:00
Jeff Roberson
fa9c971710 - Mark ksq_assigned as volatile so that when this code is used without
sched_lock we can be sure that we'll pick up the new value.
2003-11-17 08:27:11 +00:00
Jeff Roberson
093c05e39d - Remove long dead code. rslices hasn't been used in some time and neither
has sched_pickcpu().
2003-11-17 08:24:14 +00:00
Peter Wemm
90e3387e54 Expand the argument to the ithread enable/disable helper hooks from an
int to something big enough to hold a pointer.  amd64 needs this.
2003-11-17 06:08:10 +00:00
Robert Watson
b0323ea3aa Implement sockets support for __mac_get_fd() and __mac_set_fd()
system calls, and prefer these calls over getsockopt()/setsockopt()
for ABI reasons.  When addressing UNIX domain sockets, these calls
retrieve and modify the socket label, not the label of the
rendezvous vnode.

- Create mac_copy_socket_label() entry point based on
  mac_copy_pipe_label() entry point, intended to copy the socket
  label into temporary storage that doesn't require a socket lock
  to be held (currently Giant).

- Implement mac_copy_socket_label() for various policies.

- Expose socket label allocation, free, internalize, externalize
  entry points as non-static from mac_net.c.

- Use mac_socket_label_set() in __mac_set_fd().

MAC-aware applications may now use mac_get_fd(), mac_set_fd(), and
mac_get_peer() to retrieve and set various socket labels without
directly invoking the getsockopt() interface.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 23:31:45 +00:00
Robert Watson
9e71dd0feb Reduce gratuitous redundancy and length in function names:
mac_setsockopt_label_set() -> mac_setsockopt_label()
  mac_getsockopt_label_get() -> mac_getsockopt_label()
  mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel()

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 18:25:20 +00:00
Alan Cox
e45db9b837 - Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
 - Move the sf_buf API to its own header file; make struct sf_buf's
   definition machine dependent.  In this commit, we remove an
   unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
   Ultimately, we may eliminate struct sf_buf on those architecures
   except as an opaque pointer that references a vm page.
2003-11-16 06:11:26 +00:00
Robert Watson
12cbb9dc56 When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested.  This fixes process
queries of socket labels.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 03:53:36 +00:00
Bruce Evans
416ab90e6b Localized the cy driver's locking. 2003-11-16 00:55:54 +00:00
Poul-Henning Kamp
d87526cf43 Rename the debugging mutex "callout_no_sleep" to "dont_sleep_in_callout". 2003-11-15 18:33:54 +00:00
Tim J. Robbins
4d93f53e74 Initialize sequence numbers to 0 in seminit() instead of using whatever
garbage happens to be in memory. This did not seem to cause any problems
except making semaphore ID's unpredictable (and ugly in ipcs(1) output).
2003-11-15 11:56:53 +00:00
Poul-Henning Kamp
00cbe31bd8 Send B_PHYS out to pasture, it no longer serves any function. 2003-11-15 09:28:09 +00:00
Alan Cox
28c9416429 - Remove the remaining now unnecessary checks for the buf's b_object being
NULL.  See revision 1.421 for more detail.
 - Remove GIANT_REQUIRED from vfs_unbusy_pages().  Discussed with: jeff
2003-11-15 08:45:36 +00:00
Jeff Roberson
155b9987a3 - Introduce kseq_runq_{add,rem}() which are used to insert and remove
kses from the run queues.  Also, on SMP, we track the transferable
   count here.  Threads are transferable only as long as they are on the
   run queue.
 - Previously, we adjusted our load balancing based on the transferable count
   minus the number of actual cpus.  This was done to account for the threads
   which were likely to be running.  All of this logic is simpler now that
   transferable accounts for only those threads which can actually be taken.
   Updated various places in sched_add() and kseq_balance() to account for
   this.
 - Rename kseq_{add,rem} to kseq_load_{add,rem} to reflect what they're
   really doing.  The load is accounted for seperately from the runq because
   the load is accounted for even as the thread is running.
 - Fix a bug in sched_class() where we weren't properly using the PRI_BASE()
   version of the kg_pri_class.
 - Add a large comment that describes the impact of a seemingly simple
   conditional in sched_add().
 - Also in sched_add() check the transferable count and KSE_CAN_MIGRATE()
   prior to checking kseq_idle.  This reduces the frequency of access for
   kseq_idle which is a shared resource.
2003-11-15 07:32:07 +00:00
Olivier Houchard
1a29c80648 Better fix than my previous commit:
in exit1(), make sure the p_klist is empty after sending NOTE_EXIT.
The process won't report fork() or execve() and won't be able to handle
NOTE_SIGNAL knotes anyway.
This fixes some race conditions with do_tdsignal() calling knote() while
the process is exiting.

Reported by:	Stefan Farfeleder <stefan@fafoe.narf.at>
MFC after:	1 week
2003-11-14 18:49:01 +00:00
Alexander Kabaev
3b39740df8 Fix a number of style(9) bugs introduced in r1.113 by me.
Suggested by:	bde
2003-11-14 05:27:41 +00:00
Jeff Roberson
808674fd0e - regen. 2003-11-14 03:49:41 +00:00
Jeff Roberson
5c49a0566a - Revision 1.156 marked ptrace() SMP safe. Unfortunately, alpha implements
parts of ptrace using proc_rwmem().  proc_rwmem() requires giant, and
   giant must be acquired prior to the proc lock, so ptrace must require giant
   still.
2003-11-14 03:48:37 +00:00
Poul-Henning Kamp
555a5de270 Various minor details:
Give the HZ/overflow check a 10% margin.
	Eliminate bogus newline.
	If timecounters have equal quality, prefer higher frequency.

Some inspiration from:	bde
2003-11-13 10:03:58 +00:00
John Baldwin
79a13d0182 - Close a race where a thread on another CPU could release a contested lock
and empty its turnstile while the blocking threads still pointed to the
  turnstile.  If the thread on the first CPU blocked on a lock owned by
  one of the threads blocked on the turnstile just woken up, then the
  first CPU could try to manipulate a bogus thread queue in the turnstile
  during priority propagation.
- Update locking notes for ts_owner and always clear ts_owner, not just
  under INVARIANTS.

Tested by:      sam (1)
2003-11-12 23:48:42 +00:00
Kirk McKusick
48b0f4b67d At the request of several developers, restore the DIAGNOSIC code
deleted in 1.81. Increase the initial timeout limit to 2ms to
eliminate spurious messages of excessive timeouts in the NFS
client code.

Requested by:	Poul-Henning Kamp <phk@phk.freebsd.dk>
Requested by:	Mike Silbersack <silby@silby.com>
Requested by:	Sam Leffler <sam@errno.com>
2003-11-12 22:28:27 +00:00
Robert Watson
f0ab044241 Mark __mac_get_pid() as MPSAFE in the comment, as it runs without
Giant and is also MPSAFE.

Push Giant further down into __mac_get_fd() and __mac_set_fd(),
grabbing it only for constrained regions dealing with VFS, and
dropping it entirely for operations related to labeling of pipes.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-12 22:19:15 +00:00
Peter Wemm
cde6302bf0 MNAMELEN is back to an int again after Kirk's statfs commit
kern/vfs_mount.c:1305: warning: signed size_t format, different type arg (arg 4)
*** Error code 1
2003-11-12 17:09:12 +00:00
John Baldwin
861a7db56f Fix a typo in a comment.
Submitted by:	das
2003-11-12 14:55:45 +00:00
Poul-Henning Kamp
1415a09d42 Replace B_PHYS conditional assignment to bio_offset with KASSERT check
to see that the originating code already did it right.
2003-11-12 10:27:06 +00:00
Kirk McKusick
1977597b34 Update the five files derived from /sys/kern/syscalls.master
after the additions made for the new statfs structure (version
1.157). These must be updated in a separate checkin after
syscalls.master has been checked in so that they reflect its
new CVS identity. As these are purely derived files, it is not
clear to me why they are under CVS at all. I presume that it has
something to do with having `make world' operate properly.
2003-11-12 08:09:19 +00:00
Kirk McKusick
fde81c7d8e Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by:	Bruce Evans <bde@zeta.org.au>
Reviewed by:	Tim Robbins <tjr@freebsd.org>
Reviewed by:	Julian Elischer <julian@elischer.org>
Reviewed by:	the hoards of <arch@freebsd.org>
Sponsored by:   DARPA & NAI Labs.
2003-11-12 08:01:40 +00:00
Robert Watson
eca8a663d4 Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c).  This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE.  This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time.  This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change.  Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from:	bmilekic
Obtained from:		TrustedBSD Project
Sponsored by:		DARPA, Network Associates Laboratories
2003-11-12 03:14:31 +00:00
Alexander Kabaev
5c957adbf1 1. Consolidate mount struct allocation/destruction into a common code in
vfs_mount_alloc/vfs_mount_destroy functions and take care to completely
destroy the mount point along with its locks. Mount struct has grown in
coplexity recently and depending on each failure path to destroy it
completely isn't working anymore.

2. Eliminate largely identical vfs_mount and vfs_unmount question by
moving the code to handle both cases into a newly introduced vfs_domount
function.

3. Simplify nfs_mount_diskless to always expect an allocated mount
struct and never attempt an allocation/destruction itself. The
vfs_allocroot allocation was there to support 'magic' swap space
configuration for diskless clients that was already removed by PHK some
time ago.

4. Include a vfs_buildopts cleanups by Peter Edwards to validate the
sanity of nmount parameters passed from userland.

Submitted by:  (4) Peter Edwards <peter.edwards@openet-telecom.com>
Reviewed by:    rwatson
2003-11-12 02:54:47 +00:00
John Baldwin
961a7b244d Add an implementation of turnstiles and change the sleep mutex code to use
turnstiles to implement blocking isntead of implementing a thread queue
directly.  These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.

Turnstiles do not come out of a fixed-sized pool.  Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads.  The queue associated with a given lock
is found by a lookup in a simple hash table.  The turnstile itself is
protected by a lock associated with its entry in the hash table.  This
means that sched_lock is no longer needed to contest on a mutex.  Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.

Currently turnstiles only support mutexes.  Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation.  For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.

The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win).  Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.

Tested on:	i386 SMP, sparc64 SMP, alpha SMP
2003-11-11 22:07:29 +00:00
Joseph Koshy
a5896914f0 Bound the number of iterations a thread can perform inside
ktr_resize_pool(); this eliminates a potential livelock.

Return ENOSPC only if we encountered an out-of-memory condition when
trying to increase the pool size.

Reviewed by:	jhb, bde (style)
2003-11-11 09:09:26 +00:00
Joseph Koshy
b10221ffd9 Have utrace(2) return ENOMEM if malloc() fails. Document this error
return in its manual page.

Reviewed by:	jhb
2003-11-11 04:54:11 +00:00
Alan Cox
e35e0182c3 - Revision 1.469 of vfs_subr.c resulted in the buf's b_object field being
consistency initialized.  Consequently, a number of conditionals that
   checked the validity of b_object before passing it to VM_OBJECT_LOCK()
   and VM_OBJECT_UNLOCK() are no longer needed.
2003-11-11 04:45:37 +00:00
Robert Watson
c8e7bf92ad Whitespace sync to MAC branch, expand comment at the head of the file. 2003-11-11 03:40:04 +00:00
Alfred Perlstein
cd3c61b93d Fix a bug where the taskqueue kproc was being parented by init
because RFNOWAIT was being passed to kproc_create.

The result was that shutdown took quite a bit longer because this
errant "child" would not respond to termination signals from init
at system shutdown.

RFNOWAIT dissassociates itself from the caller by attaching to init
as a parent proc.  We could have had the taskqueue proc listen for
SIGKILL, but being able to SIGKILL a potentially critical system
process doesn't seem like a good idea.
2003-11-10 20:39:44 +00:00
Tim J. Robbins
541c3b66b5 When there are no free sem_undo structs available in semu_alloc(), only
free one sem_undo with un_cnt == 0 instead of all of them. This is a
temporary workaround until the SLIST_FOREACH_PREVPTR loop gets fixed so
that it doesn't cause cycles in semu_list when removing multiple adjacent
items. It might be easier to just use (doubly-linked) LISTs here instead
of complicated SLIST code to achieve O(1) removals.

This bug manifested itself as a complete lockup under heavy semaphore use
by multiple processes with the SEM_UNDO flag set.

PR:		58984
2003-11-10 07:22:41 +00:00
Marcel Moolenaar
fcaa2925a9 Change the clear_ret argument of get_mcontext() to be a flags argument.
Since all callers either passed 0 or 1 for clear_ret, define bit 0 in
the flags for use as clear_ret. Reserve bits 1, 2 and 3 for use by MI
code for possible (but unlikely) future use. The remaining bits are for
use by MD code.

This change is triggered by a need on ia64 to have another knob for
get_mcontext().
2003-11-09 20:31:04 +00:00
Bruce Evans
b698380f33 Quick fix for scaling of statclock ticks in the SMP case. As explained
in the log message for kern_sched.c 1.83 (which should have been
repo-copied to preserve history for this file), the (4BSD) scheduler
algorithm only works right if stathz is nearly 128 Hz.  The old
commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz
but there was a scale factor of 4 to give the requirement of 64 Hz,
and rev.1.83 changed the scale factor so that the requirement became
128 Hz.  The change of the scale factor was incomplete in the SMP
case.  Then scheduling ticks are provided by smp_ncpu CPUs, and the
scheduler cannot tell the difference between this and 1 CPU providing
scheduling ticks smp_ncpu times faster, so we need another scale
factor of smp_ncp or an algorithm change.

This quick fix uses the scale factor without even trying to optimize
the runtime divisions required for this as is done for the other
scale factor.

The main algorithmic problem is the clamp on the scheduling tick counts.
This was 295; it is now approximately 295 * smp_ncpu.  When the limit
is reached, threads get free timeslices and scheduling becomes very
unfair to the threads that don't hit the limit.  The limit can be
reached and maintained in the worst case if the load average is larger
than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with
2 CPUs before this change), so there are algorithmic problems even for
a load average of 1.  Fortunately, the worst case isn't common enough
for the problem to be very noticeable (it is mainly for niced CPU hogs
competing with less nice CPU hogs).
2003-11-09 13:45:54 +00:00
Seigo Tanimura
512824f8f7 - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
Sam Leffler
7902224c6b o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
  operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
  calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
  Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
  have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by:	FreeBSD Foundation
2003-11-08 22:28:40 +00:00
David Xu
685a6c448a Return a reasonable number for top or ps to display for M:N thread,
since there is no direct association between M:N thread and kse,
sometimes, a thread does not have a kse, in that case, return a pctcpu
from its last kse, it is not perfect, but gives a good number to be
displayed.
2003-11-08 03:03:17 +00:00
John Baldwin
dac33f12cc Regen. 2003-11-07 20:30:30 +00:00
John Baldwin
c055e5d412 Mark ptrace(), ktrace(), utrace(), sysarch(), and issetugid() as MP safe.
The parts of these calls that are not yet MP safe acquire Giant explicitly.
2003-11-07 20:23:23 +00:00
Robert Watson
a2f88a8b7c Slight whitespace consistency improvement:
Trim trailing whitespace.
  Remove unmatched " " before ")".
2003-11-07 04:47:14 +00:00
Jeff Roberson
f28b3340c1 - Somehow I botched my last commit. Add an extra ( to fix things up. I'm
still not sure how this happened.

Reported by:	ps
2003-11-06 07:56:01 +00:00
Alan Cox
3b2c54e7bc - Delay the allocation of memory for the pipe mutex until we need it.
This avoids the need to free said memory in various error cases along
   the way.
2003-11-06 05:58:26 +00:00
Alan Cox
fc17df5264 - Simplify pipespace() by eliminating the explicit creation of vm objects.
Instead, let the vm objects be lazily instantiated at fault time.  This
   results in the allocation of fewer vm objects and vm map entries due to
   aggregation in the vm system.
2003-11-06 05:08:12 +00:00
Robert Watson
83b7b0edca Remove the flags argument from mac_externalize_*_label(), as it's not
passed into policies or used internally to the MAC Framework.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-06 03:42:43 +00:00
Jeff Roberson
a70d729bff - Remove the local definition of sched_pin and unpin. They are provided in
sched.h now.
 - Respect the td pin count.
2003-11-06 03:09:51 +00:00
Sam Leffler
d3be1471c7 o make debug_mpsafenet globally visible
o move it from subr_bus.c to netisr.c where it more properly belongs
o add NET_PICKUP_GIANT and NET_DROP_GIANT macros that will be used to
  grab Giant as needed when MPSAFE operation is enabled

Supported by:	FreeBSD Foundation
2003-11-05 23:42:51 +00:00
Warner Losh
252af39a96 Minor style(9) nit 2003-11-05 06:14:48 +00:00
Jeff Roberson
46f8b26550 - It's ok if sched_runnable() has races in it, we don't need the sched_lock
here unless we have something on the assigned queue.
2003-11-05 05:30:12 +00:00
Alexander Kabaev
ca430f2e92 Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff
2003-11-05 04:30:08 +00:00
Max Khon
2332251c6a Back out the following revisions:
1.36      +73 -60    src/sys/compat/linux/linux_ipc.c
1.83      +102 -48   src/sys/kern/sysv_shm.c
1.8       +4 -0      src/sys/sys/syscallsubr.h

That change was intended to support vmware3, but
wantrem parameter is useless because vmware3 uses SYSV shared memory
to talk with X server and X server is native application.
The patch worked because check for wantrem was not valid
(wantrem and SHMSEG_REMOVED was never checked for SHMSEG_ALLOCATED segments).

Add kern.ipc.shm_allow_removed (integer, rw) sysctl (default 0) which when set
to 1 allows to return removed segments in
shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().

MFC after:	1 week
2003-11-05 01:53:10 +00:00
Kirk McKusick
b932dd9b28 Get rid of DIAGNOSTIC that gives false positives on slow CPUs. 2003-11-04 08:03:11 +00:00
Jeff Roberson
9bacd788a1 - Add initial support for pinning and binding. 2003-11-04 07:45:41 +00:00
Kirk McKusick
15a93fcc31 Allow the bufdaemon and update daemon processes to skip the
waitrunningbufspace() calls so that they are always able to
proceed and clean up buffer space.

Submitted by:	Brian Fundakowski Feldman <green@freebsd.org>
2003-11-04 06:30:00 +00:00
Sam Leffler
3465702f13 disable MPSAFE network drivers; we aren't ready yet` 2003-11-04 02:01:42 +00:00
Olivier Houchard
7922cdc855 I believe kbyanc@ really meant this in rev 1.58.
Use zpfind() to see if the process became a zombie if pfind() doesn't find it
and if the caller wants to know about process death, so that the caller knows
the process died even if it happened before the kevent was actually registered.

MFC after:	1 week
2003-11-04 01:41:47 +00:00
Olivier Houchard
f44004690c Do not attempt to report proc event if NOTE_EXIT has already been received.
This fixes a race condition (specifically with signal events) that could
lead to the kn being re-inserted into the list after it has been destroyed,
which is not something we want to happen.

PR:		kern/58258
2003-11-04 01:14:58 +00:00
John Baldwin
8bc0846476 Don't require INTR_FAST handlers to be exclusive in the MI layer. Instead,
let the MD code choose whether or not to implement such a policy.  The new
i386 interrupt code allows multiple FAST handlers for a given source for
example.  However, the code does not allow FAST and non-FAST handlers to be
mixed.
2003-11-03 22:42:58 +00:00
John Baldwin
b95bb3e62b Update spin lock order list for new i386 interrupt and SMP code. 2003-11-03 22:38:30 +00:00
Robert Watson
730ecf8254 Unlock pipe mutex when failing MAC pipe ioctl access control check.
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-03 17:58:23 +00:00
Jeff Roberson
112b6d3aa9 - Remove kseq_find(), we no longer scan other cpu's run queues when we go
idle.  They figure out that we're idle fast enough that the cache pollution
   introduces by scanning their run queue is more expensive than waiting
   a little longer.
 - Add kseq_setidle() to mark us as being idle.  Use this in place of
   kseq_find().
 - Remove kseq_load_highest(), kseq_find() was the only consumer of this
   interface.  kseq_balance() has it's own customized version that finds the
   lowest and highest loads simultaneously.

Continuously told that this would be faster by:	terry
2003-11-03 03:27:22 +00:00
Jeff Roberson
ef1134c9ad - Remove the ksq_loads[] array. We are only interested in three counts,
the total load, the timeshare load, and the number of threads that can
   be migrated to another cpu.  Account for these seperately.
 - Introduce a KSE_CAN_MIGRATE() macro which determines whether or not a KSE
   can be migrated to another CPU.  Currently, this only checks to see if
   we're an interrupt handler.  Eventually this will also be used to support
   CPU binding.
2003-11-02 10:56:48 +00:00
Alexander Kabaev
cb9ddc80ae Take care not to call vput if thread used in corresponding vget
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.

The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.

This fixes the panic on shutdown people have reported.
2003-11-02 04:52:53 +00:00
Jeff Roberson
769a363537 - In sched_prio() only force us onto the current queue if our priority is
being elevated (numerically smaller).
2003-11-02 04:25:59 +00:00
Jeff Roberson
7d1a81b4dc - Rename SCHED_PRI_NTHRESH to SCHED_SLICE_NTHRESH since it is only used in
slice assignment.  Add a comment describing what it does.
 - Remove a stale XXX comment, the nice should not impact the interactivity,
   nice adjustments only effect non-interactive tasks in ULE.
 - Don't allow nice -20 tasks to totally starve nice 0 tasks.  Give them at
   least SCHED_SLICE_MIN ticks.  We still allow nice 0 tasks to starve nice
   +20 tasks as intended.
2003-11-02 04:10:15 +00:00
Jeff Roberson
a0a931cec7 - Remove uses of PRIO_TOTAL and replace them with SCHED_PRI_NRESV
- SCHED_PRI_NRESV does not have the off by one error in PRIO_TOTAL so we
   do not have to account for it in the few places that we use it.

Requested by:	bde
2003-11-02 03:49:32 +00:00
Jeff Roberson
d322132c62 - Change sched_interact_update() to only accept slp+runtime values between
0 and SCHED_SLP_RUN_MAX * 2.  This allows us to simplify the algorithm
   quite a bit.  Before, it dealt with arbitrary values which required us
   to do nasty integer division tricks that didn't quite work out correctly.
 - Chnage sched_wakeup() to detect conditions where the slp+runtime could
   exceed SCHED_SLP_RUN_MAX * 2.  This can happen if we go to sleep for
   longer than 6 seconds.  In this case, we'll just clear the runtime and
   set the sleep time to the max.
 - Define a new function, sched_interact_fork() which updates the slp+runtime
   of a newly forked thread.  We want to limit the amount of history retained
   from the parent so that we learn the child's behavior quickly.  We don't,
   however want to decay it to nothing.  Previously, we would simply divide
   each parameter by 100 whenever we forked.  After a few forks the values
   would reach 0 and tasks would not be considered interactive.
 - Add another KTR entry, cleanup some existing entries.
 - Remove a useless sched_interact_update() from sched_priority().  This is
   already done by the callers that require it.
2003-11-02 03:36:33 +00:00
Alexander Kabaev
492c1e68fb Temporarily undo parts of the stuct mount locking commit by jeff.
It is unsafe to hold a mutex across vput/vrele calls.

This will be redone when a better locking strategy is agreed upon.

Discussed with: jeff
2003-11-01 05:51:54 +00:00
Jeff Roberson
22bf7d9a0e - Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses.  This mechanism is
   broken down into several components.  This is intended to reduce cache
   thrashing by eliminating most cases where one cpu touches another's
   run queues.
 - kseq_notify() appends a kse to a lockless singly linked list and
   conditionally sends an IPI to the target processor.  Right now this is
   protected by sched_lock but at some point I'd like to get rid of the
   global lock.  This is why I used something more complicated than a
   standard queue.
 - kseq_assign() processes our list of kses that have been assigned to us
   by other processors.  This simply calls sched_add() for each item on the
   list after clearing the new KEF_ASSIGNED flag.  This flag is used to
   indicate that we have been appeneded to the assigned queue but not
   added to the run queue yet.
 - In sched_add(), instead of adding a KSE to another processor's queue we
   use kse_notify() so that we don't touch their queue.  Also in sched_add(),
   if KEF_ASSIGNED is already set return immediately.  This can happen if
   a thread is removed and readded so that the priority is recorded properly.
 - In sched_rem() return immediately if KEF_ASSIGNED is set.  All callers
   immediately readd simply to adjust priorites etc.
 - In sched_choose(), if we're running an IDLE task or the per cpu idle thread
   set our cpumask bit in 'kseq_idle' so that other processors may know that
   we are idle.  Before this, make a single pass through the run queues of
   other processors so that we may find work more immediately if it is
   available.
 - In sched_runnable(), don't scan each processor's run queue, they will IPI
   us if they have work for us to do.
 - In sched_add(), if we're adding a thread that can be migrated and we have
   plenty of work to do, try to migrate the thread to an idle kseq.
 - Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
   consideration.
 - No longer use kseq_choose() to steal threads, it can lose it's last
   argument.
 - Create a new function runq_steal() which operates like runq_choose() but
   skips threads based on some criteria.  Currently it will not steal
   PRI_ITHD threads.  In the future this will be used for CPU binding.
 - Create a kseq_steal() that checks each run queue with runq_steal(), use
   kseq_steal() in the places where we used kseq_choose() to steal with
   before.
2003-10-31 11:16:04 +00:00
John Baldwin
e57ea233d9 Ensure that mp_ncpus is set to 1 if mp_cpu_probe() fails. 2003-10-30 21:44:01 +00:00
Alexander Kabaev
0823d2996c Relock mntvnode_mtx if vget fails in vfs_stdsync. The loop is
always shoould entered with mutex locked.
2003-10-30 16:22:51 +00:00
David Xu
7eeaaf9b97 Try to fetch thread mailbox address in page fault trap, so when thread
blocks in page fault hanlder, and upcall thread can be scheduled. It is
useful if process is doing lots of mmap based I/O.
2003-10-30 02:55:43 +00:00
Sam Leffler
90fc7b7cb8 Add a temporary mechanism to disble INTR_MPSAFE from network interface
drivers.  This is prepatory to running more parts of the network system
w/o Giant.
2003-10-29 18:29:50 +00:00
Bruce Evans
b3aeaf2ed1 Removed mostly-dead code for setting switchtime after the idle loop
clobbers this variable.  Long ago, when the idle loop wasn't in a
process, it set switchtime.tv_sec to zero to indicate that the time
needs to be read after the idle loop finishes.  The special case for
this isn't needed now that there is an idle process (for each CPU).
The time is read in the normal way when the idle process is switched
away from.  The seconds component of the time is only zero for the
first second after the uptime is set, and the mostly-dead code was only
executed during this time.  (This was slightly broken by using uptimes
instead of times relative to the Epoch -- in the original version the
seconds component of the time was only 0 for the first second after
the Epoch.)

In mi_switch(), moved the setting of switchticks to just after the
first (and now only) setting of switchtime.  This setting used to be
delayed since a late setting was needed for the idle case and an early
setting was not needed.  Now the early setting is needed so that
fork_exit() doesn't need to set either switchtime or switchticks.
Removed now-completely-rotted comment attached to this.  Most of the
code described by the comment had already moved to sched_switch().
2003-10-29 15:23:09 +00:00
Bruce Evans
89674a9f77 Removed sched_nest variable in sched_switch(). Context switches always
begin with sched_lock held but not recursed, so this variable was
always 0.

Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch().  Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up.  Only
sched_lock.mtx_lock really needs a fixup.

Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up.  This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.
2003-10-29 14:40:41 +00:00
Sam Leffler
9c855a36c1 Introduce the notion of "persistent mbuf tags"; these are tags that stay
with an mbuf until it is reclaimed.  This is in contrast to tags that
vanish when an mbuf chain passes through an interface.  Persistent tags
are used, for example, by MAC labels.

Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp.  This fixes problems
with "tag leakage".

Pointed out by:	Jonathan Stone
Reviewed by:	Robert Watson
2003-10-29 05:40:07 +00:00
Sam Leffler
395bb18680 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
Jeff Roberson
1aca9909e5 - Only change the run queue in sched_prio() if the kse is non null. threads
can be in the TD_ON_RUNQ state and not have an associated kse.
 - Remove the PRI_IDLE special case from sched_clock(), it was not actually
   necessary.
2003-10-28 03:28:48 +00:00
Jeff Roberson
eab9cabf34 - Don't set td_priority directly here, use sched_prio(). 2003-10-27 07:15:47 +00:00
Jeff Roberson
3f741ca117 - Use a better algorithm in sched_pctcpu_update()
Contributed by:	Thomaswuerfl@gmx.de

 - In sched_prio(), adjust the run queue for threads which may need to move
   to the current queue due to priority propagation .
 - In sched_switch(), fix style bug introduced when the KSE support went in.
   Columns are 80 chars wide, not 90.
 - In sched_switch(), Fix the comparison in the idle case and explicitly
   re-initialize the runq in the not propagated case.
 - Remove dead code in sched_clock().
 - In sched_clock(), If we're an IDLE class td set NEEDRESCHED so that threads
   that have become runnable will get a chance to.
 - In sched_runnable(), if we're not the IDLETD, we should not consider
   curthread when examining the load.  This mimics the 4BSD behavior of
   returning 0 when the only runnable thread is running.
 - In sched_userret(), remove the code for setting NEEDRESCHED entirely.
   This is not necessary and is not implemented in 4BSD.
 - Use the correct comparison in sched_add() when checking to see if an idle
   prio task has had it's priority temporarily elevated.
2003-10-27 06:47:05 +00:00
Alfred Perlstein
6ff7636ea5 constify the second args to timevaladd() and timevalsub(). 2003-10-26 02:19:00 +00:00
Robert Watson
36bbf86ba6 Check (locked) before performing an advisory unlock following a failure
of vn_start_write().  Otherwise, we may inconsistently attempt to release
the advisory lock.

Pointed out by:	teggej
2003-10-25 16:43:50 +00:00
Robert Watson
c447f5b2f4 When generate a core dump, use advisory locking in an advisory way:
if we do acquire an advisory lock, great!  We'll release it later.
However, if we fail to acquire a lock, we perform the coredump
anyway.  This problem became particularly visible with NFS after
the introduction of rpc.lockd: if the lock manager isn't running,
then locking calls will fail, aborting the core dump (resulting in
a zero-byte dump file).

Reported by:	Yogeshwar Shenoy <ynshenoy@alumni.cs.ucsb.edu>
2003-10-25 16:14:09 +00:00
Robert Watson
67536f038c Allow MAC policies to block/revoke kern_alq write access to a file.
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
Reviewed by:	jeff
2003-10-25 16:10:41 +00:00
Warner Losh
17e02bb39b Convenience functions to generate notifications from the kernel. The ACPI
code will start using these shortly.

Reviewed by: njl
2003-10-24 22:41:54 +00:00