Commit Graph

7225 Commits

Author SHA1 Message Date
Scott Long
740d9ba692 Convert the other use of flags to mflags in soalloc(). 2004-03-01 01:14:28 +00:00
Robert Watson
2bc87dcfbe Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument.  Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1.  This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by:	sam
2004-02-29 17:54:05 +00:00
Poul-Henning Kamp
2cf6bdac50 Loudly announce WITNESS and DIAGNOSTIC options and warn about reduced
performance.
2004-02-29 16:56:54 +00:00
Poul-Henning Kamp
3d6e5ccb06 Make sure to disable the watchdog if we cannot honour the timeout. 2004-02-28 22:01:19 +00:00
Poul-Henning Kamp
4103b7652d Rename the WATCHDOG option to SW_WATCHDOG and make it use the
generic watchdoc(9) interface.

Make watchdogd(8) perform as watchdog(8) as well, and make it
possible to specify a check command to run, timeout and sleep
periods.

Update watchdog(4) to talk about the generic interface and add
new watchdog(8) page.
2004-02-28 20:56:35 +00:00
John Baldwin
44f3b09204 Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
  and condition variables.  Having sleep qeueus in a hash table avoids
  having to allocate a queue head for each wait channel.  Thus, struct cv
  has shrunk down to just a single char * pointer now.  However, the
  hash table does not hold threads directly, but queue heads.  This means
  that once you have located a queue in the hash bucket, you no longer have
  to walk the rest of the hash chain looking for threads.  Instead, you have
  a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
  differentiates between cv's and sleep/wakeup.  For example, calls to
  abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
  Thus, the TDF_CVWAITQ flag is removed.  Also, calls to unsleep() and
  cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
  sleep's no longer inherently bump the priority.  Instead, this is soley
  a propery of msleep() which explicitly calls sched_prio() before
  blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used.  The
  associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
  dropped and replaced with a single explicit clearing of td_wchan.
  TD_SET_ONSLEEPQ() would really have only made sense if it had taken
  the wait channel and message as arguments anyway.  Now that that only
  happens in one place, a macro would be overkill.
2004-02-27 18:52:44 +00:00
John Baldwin
e5bb601d87 Drop sched_lock around the wakeup of the parent process after setting
the process state to zombie when a process exits to avoid a lock order
reversal with the sleepqueue locks.  This appears to be the only place
that we call wakeup() with sched_lock held.
2004-02-27 18:39:09 +00:00
John Baldwin
dd75b0a90d Add an implementation of a generic sleep queue abstraction that is used
to queue threads sleeping on a wait channel similar to how turnstiles are
used to queue threads waiting for a lock.  This subsystem will be used as
the backend for sleep/wakeup and condition variables initially.  Eventually
it will also be used to replace the ithread-specific iwait thread
inhibitor.

Sleep queues are also not locked by sched_lock, so this splits sched_lock
up a bit further increasing concurrency within the scheduler.  Sleep queues
also natively support timeouts on sleeps and interruptible sleeps allowing
for the reduction of a lot of duplicated code between the sleep/wakeup and
condition variable implementations.  For more details on the sleep queue
implementation, check the comments in sys/sleepqueue.h and
kern/subr_sleepqueue.c.
2004-02-27 18:33:09 +00:00
Dag-Erling Smørgrav
21885af505 Add sysctl_move_oid() which reparents an existing OID. 2004-02-27 17:13:23 +00:00
John Baldwin
5b7de7e19e Clarify and tweak some comments. 2004-02-27 16:14:27 +00:00
John Baldwin
03129ba97f Fix _sx_assert() to panic() rather than printf() when an assertion fails
and ignore assertions if we have already paniced.
2004-02-27 16:13:44 +00:00
John Baldwin
f4114c3d7f Replace the ktrace queue's semaphore with a condition variable instead as
it is slightly more efficient since we already have a mutex to protect the
queue.  Ktrace originally used a semaphore more as a proof of concept.
2004-02-26 19:30:22 +00:00
Don Lewis
47934cef8f Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire().  Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails.  Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by:	bms
2004-02-26 00:27:04 +00:00
Robert Watson
049ffe98a8 Assert pipe mutex in pipeselwakeup(), as we manipulate pipe_state
in a non-atomic manner.  It appears to always be called with the
mutex (good).
2004-02-26 00:18:22 +00:00
Robert Watson
094bdd260c Update comment regarding MAC labels: we no longer pass endpoints
into the MAC Framework, just the pipe pair.

GC 'hadpeer' used in pipedestroy(), which is no longer needed as
we check pipe_present flags on the pair.
2004-02-25 23:30:56 +00:00
Dag-Erling Smørgrav
854a417d92 Whitespace cleanup 2004-02-24 19:31:30 +00:00
Poul-Henning Kamp
652d04726d Fix two oversights here: don't trash the freelist, and properly cleanup
the cdevsw{}.

Submitted by:	tegge
2004-02-23 08:42:55 +00:00
Brian Feldman
240160d48b Correct some major SMP-harmful problems in the pipe implementation. First
of all, PIPE_EOF is not checked pervasively after everything that can drop
the pipe mutex and msleep(), so fix.  Additionally, though it might not
harm anything, pipelock() and pipeunlock() are not used consistently.
Third, the kqueue support functions do not use the pipe mutex correctly.
Last, but absolutely not least, is a race: if pipe_busy is not set on
the closing side of the pipe, the other side that is trying to write to
that will crash BECAUSE PIPE_EOF IS NOT SET!  Unconditionally set
PIPE_EOF, and get rid of all the lockups/crashes I have seen trying
to build ports.
2004-02-22 23:00:14 +00:00
Daniel Eischen
2648efa621 Add sysctls to allow showing threads for pgrp, tty, uid, ruid,
and pid.
2004-02-22 17:54:32 +00:00
Pawel Jakub Dawidek
63dba32b76 Reimplement sysctls handling by MAC framework.
Now I believe it is done in the right way.

Removed some XXMAC cases, we now assume 'high' integrity level for all
sysctls, except those with CTLFLAG_ANYBODY flag set. No more magic.

Reviewed by:	rwatson
Approved by:	rwatson, scottl (mentor)
Tested with:	LINT (compilation), mac_biba(4) (functionality)
2004-02-22 12:31:44 +00:00
Colin Percival
b17dd2bcc0 If we're going to panic(), do it before dereferencing a NULL pointer.
Reported by:	"Ted Unangst" <tedu@coverity.com>
Approved by:	rwatson (mentor)
2004-02-22 01:11:53 +00:00
Robert Watson
f6a4109212 Update my personal copyrights and NETA copyrights in the kernel
to use the "year1-year3" format, as opposed to "year1, year2, year3".
This seems to make lawyers more happy, but also prevents the
lines from getting excessively long as the years start to add up.

Suggested by:	imp
2004-02-22 00:33:12 +00:00
Poul-Henning Kamp
ded67d0f77 Check for NODEV return from udev2dev() 2004-02-21 23:52:03 +00:00
Poul-Henning Kamp
cd690b60de Device megapatch 6/6:
This is what we came here for:  Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:

Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.

Hold a dev_t reference while the device is open.

KASSERT that we do not enter the driver on a non-referenced dev_t.

Remove old D_NAG code, anonymous dev_t's are not a problem now.

When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list.  When the refcount drops, free it.

Check that cdevsw->d_version is correct.  If not, set all methods
to the dead_*() methods to prevent entrance into driver.  Print
warning on console to this effect.  The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
2004-02-21 21:57:26 +00:00
Poul-Henning Kamp
816d62bbb9 Device megapatch 5/6:
Remove the unused second argument from udev2dev().

Convert all remaining users of makedev() to use udev2dev().  The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.

Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
2004-02-21 21:32:15 +00:00
Poul-Henning Kamp
dc08ffec87 Device megapatch 4/6:
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.

Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
2004-02-21 21:10:55 +00:00
Poul-Henning Kamp
8e1f1df080 Device megapatch 3/6:
Add missing D_TTY flags to various drivers.

Complete asserts that dev_t's passed to ttyread(), ttywrite(),
ttypoll() and ttykqwrite() have (d_flags & D_TTY) and a struct tty
pointer.

Make ttyread(), ttywrite(), ttypoll() and ttykqwrite() the default
cdevsw methods for D_TTY drivers and remove the explicit initializations
in various drivers cdevsw structures.
2004-02-21 20:41:11 +00:00
Poul-Henning Kamp
b0b0334878 Device megapatch 2/6:
This commit adds a couple of functions for pseudodrivers to use for
implementing cloning in a manner we will be able to lock down (shortly).

Basically what happens is that pseudo drivers get a way to ask for
"give me the dev_t with this unit number" or alternatively "give
me a dev_t with the lowest guaranteed free unit number" (there is
unfortunately a lot of non-POLA in the exact numeric value of this
number, just live with it for now)

Managing the unit number space this way removes the need to use
rman(9) to do so in the drivers this greatly simplifies the code in
the drivers because even using rman(9) they still needed to manage
their dev_t's anyway.

I have taken the if_tun, if_tap, snp and nmdm drivers through the
mill, partly because they (ab)used makedev(), but mostly because
together they represent three different problems for device-cloning:

if_tun and snp is the plain case: just give me a device.

if_tap has two kinds of devices, with a flag for device type.

nmdm has paired devices (ala pty) can you can clone either of them.
2004-02-21 20:29:52 +00:00
Brian Feldman
fe5f3a72ac Make sure to wake up any select waiters when closing a kqueue (also, not
crash).  I am fairly sure that only people with SMP and multi-threaded
apps using kqueue will be affected by this, so I have a stress-testing
program on my web site:
<URL:http://green.homeunix.org/~green/getaddrinfo-pthreads-stresstest.c>
2004-02-20 04:00:48 +00:00
John Baldwin
712f57d8ab Tidy up the thread taskqueue implementation and close a lost wakeup race.
Instead of creating a mutex that we msleep on but don't actually lock when
doing the corresponding wakeup(), in the kthread, lock the mutex associated
with our taskqueue and msleep while the queue is empty.  Assert that the
queue is locked when the callback function is called to wake the kthread.
2004-02-19 22:03:52 +00:00
Jacques Vidrine
57f22bd4af Rework jail_attach(2) so that an already jailed process cannot hop
to another jail.

Submitted by:	rwatson
2004-02-19 21:03:20 +00:00
Pawel Jakub Dawidek
461167c289 Added sysctl security.jail.jailed.
It returns 1 is process is inside of jail and 0 if it is not.
Information if we are in jail or not is not a secret, there is plenty of
ways to discover it. Many people are using own hack to check this and
this will be a legal way from now on.

It will be great if our starting scripts will take advantage of this sysctl
to allow clean "boot" inside jail.

Approved by:	rwatson, scottl (mentor)
2004-02-19 14:29:14 +00:00
Pawel Jakub Dawidek
f6739b1ddc Simplify check. We are only able to check exclusive lock and if
2nd condition is true, first one is true for sure.

Approved by:	jhb, scottl (mentor)
2004-02-19 14:19:31 +00:00
Don Lewis
cf93aa166c When reparenting a process in the PT_DETACH code, only set p_sigparent
to SIGCHLD if the new parent process is initproc.

MFC after:	2 weeks
2004-02-19 10:39:42 +00:00
Don Lewis
6567eef757 A Linux thread created using clone() should not send SIGCHLD to its
parent if no signal is specified in the clone() flags argument.

PR:		42457
MFC after:	2 weeks
2004-02-19 06:43:48 +00:00
Nate Lawson
32869e71fb Add support for 'h' and 'hh' modifiers for printf(9).
Submitted by:	Bruno Ducrot <ducrot AT poupinou.org>
Reviewed by:	bde
2004-02-19 05:29:39 +00:00
Colin Percival
3a1bdbf8d1 Don't ignore errors from vfs_allocate_syncvnode.
PR:		kern/18503
Submitted by:	Anatoly Vorobey <mellon@pobox.com>
Approved by:	rwatson (mentor)
2004-02-18 05:20:54 +00:00
Peter Wemm
df7c361e64 Checkpoint a hack to enable running i386 libc_r binaries on a 64 bit
kernel.  I'm not happy with it yet - refinements are to come.
This hack allows the kern.ps_strings and kern.usrstack sysctls to respond
to a 32 bit request, such as those coming from emulated i386 binaries.
2004-02-18 00:54:17 +00:00
David Malone
a1cc6206fb Correct a comment.
Reviewed by:	alfred, tanimura
2004-02-17 12:30:32 +00:00
Dag-Erling Smørgrav
963385cf22 Mechanical whistespace cleanup. 2004-02-17 10:21:03 +00:00
Dag-Erling Smørgrav
44f4b94b38 Don't bother storing a result when all you need are the side effects. 2004-02-16 18:38:46 +00:00
David Malone
a82294d01c In fdcheckstd the descriptor table should never be shared, so just
KASSERT this rather than trying to deal with what happens when file
descriptors change out from under us.
2004-02-15 21:14:48 +00:00
Bruce Evans
72632ef235 Fixed style bugs near previous commit (mainly formatting errors and
missing parentheses).  Use default handling (trap to debugger) for
udev2dev(x, 1) since it is an error and doesn't happen anywhere in
the sys tree except in bogusly commented out code in coda.
2004-02-15 20:14:47 +00:00
Colin Percival
a20e9655b9 Remove opv_desc_vector from vfs_add_vnodeops, since it is defined
and given a value, but never used.  This has no effect on the
resulting binaries, since gcc optimizes the variable away anyway.

PR:		kern/62684
Approved by:	rwatson (mentor)
2004-02-15 17:27:33 +00:00
Poul-Henning Kamp
2a3faf2fbd Split the initialization of the cdevsw into a separate function. 2004-02-15 10:35:33 +00:00
Robert Watson
402d7aa884 Remove excess brackets. 2004-02-15 00:43:22 +00:00
Poul-Henning Kamp
d60d18d491 Use standard style for cdevsw initialization. 2004-02-14 20:03:36 +00:00
Robert Watson
679a106075 By default, don't allow processes in a jail to list the set of
jails in the system.  Previous behavior (allowed) may be restored
by setting security.jail.list_allowed=1.
2004-02-14 19:19:47 +00:00
Robert Watson
7e440242e5 Fix mismerge in last commit: check that cred->cr_prison is NULL
before dereferencing the prison pointer.
2004-02-14 18:52:43 +00:00
Robert Watson
f08df373a3 By default, when a process in jail calls getfsstat(), only return the
data for the file system on which the jail's root vnode is located.
Previous behavior (show data for all mountpoints) can be restored
by setting security.jail.getfsstatroot_only to 0.  Note: this also
has the effect of hiding other mounts inside a jail, such as /dev,
/tmp, and /proc, but errs on the side of leaking less information.
2004-02-14 18:31:11 +00:00
Poul-Henning Kamp
d08c5d0b9b Remove the check which used to protect us against make_dev() being
called until DEVFS had a chance to initialize.  Since DEVFS is mandatory
and things over in that department coincidentally works from without
any initialization now, this is safe.
2004-02-14 17:19:43 +00:00
Brian Feldman
a0ed09c0af T -CURRENT DO NOT CRASH UPON ^T K PLZ THX.
Also, use sched_pctcpu() instead of assuming td->td_kse is non-NULL.
2004-02-14 01:30:06 +00:00
Brian Feldman
f662a93197 Always socantsendmore() before deallocating a socket. This, in turn,
calls selwakeup() if necessary (which it is, if you don't want freed
memory hanging around on your td->td_selq).

Props to:	alfred
2004-02-12 01:48:40 +00:00
Don Lewis
55b5f2a202 When reparenting a process to init, make sure that p_sigparent is
set to SIGCHLD.  This avoids the creation of orphaned Linux-threaded
zombies that init is unable to reap.  This can occur when the parent
process sets its SIGCHLD to SIG_IGN.  Fix a similar situation in the
PT_DETACH code.

Tested by:	"Steven Hartland" <killing AT multiplay.co.uk>
2004-02-11 22:06:02 +00:00
John Baldwin
e7a44cace2 Argh! Fix a bogon. lim_cur() was returning the hard (max) limit rather
than the soft (cur) limit.

Submitted by:	bde
2004-02-11 18:04:13 +00:00
Mike Silbersack
b49d824e8b Add the SF_NODISKIO flag to sendfile. This flag causes sendfile to be
mindful of blocking on disk I/O and instead return EBUSY when such
blocking would occur.

Results from the DeBox project indicate that blocking on disk I/O
can slow the performance of a kqueue/poll based webserver.  Using
a flag such as SF_NODISKIO and throwing connections that would block
to helper processes/threads helped increase performance.

Currently, only the Flash webserver uses this flag, although it could
probably be applied to thttpd with relative ease.

Idea by:	Yaoping Ruan & Vivek Pai
2004-02-08 07:35:48 +00:00
Alan Cox
c5aebf380c swp_pager_async_iodone() no longer requires Giant. Modify bufdone()
and swapgeom_done() to perform swp_pager_async_iodone() without Giant.

Reviewed by:	tegge
2004-02-07 08:54:50 +00:00
John Baldwin
a875f38546 - Convert the plimit lock to a pool mutex lock.
- Hide struct plimit from userland.

Submitted by:	bde (2)
2004-02-06 19:35:14 +00:00
John Baldwin
f4daf05619 - Correct the translation of old rlimit values to properly handle the old
RLIM_INFINITY case for ogetrlimit().
- Use %jd and intmax_t to output negative time in usec in calcru().
- Rework getrusage() to make a copy of the rusage struct into a local
  variable while holding Giant and then do the copyout from the local
  variable to avoid having to have the original process rusage struct
  locked while doing the copyout (which would not be safe).  This also
  includes a few style fixes from Bruce to getrusage().

Submitted by:	bde (1, parts of 3)
Suggested by:	bde (2)
2004-02-06 19:30:12 +00:00
John Baldwin
99b6e02ba6 A few more style fixes from Bruce including a few I missed last time.
Submitted by:	bde
2004-02-06 19:25:34 +00:00
John Baldwin
4c3558aa82 Always set a process' state to normal when it is fully constructed in
fork1() rather than only doing it for the RFSTOPPED case and then having
to fix it up in other places later on.
2004-02-05 21:01:37 +00:00
John Baldwin
b4323d7729 - A lot of style and whitespace fixes.
- Update a few comments regarding locking notes.

Submitted by:	bde (1, mostly)
2004-02-05 20:53:25 +00:00
Jacques Vidrine
b00a3c85da Correct a reference counting bug in shmat(2). If vm_map_find(9)
failed, the reference count for the virtual memory object referenced
by the specified shared memory segment would have been erroneously
incremented.

Reported by:	Joost Pol <joost@pine.nl>
2004-02-05 18:00:35 +00:00
Alexander Kabaev
dec8868dcc Rename cn_unavailable to cnunavailable for little more consistency.
Garbage collect unused cndebug() function.

Suggested by:	bde
2004-02-05 17:35:28 +00:00
Mike Silbersack
b711d74eaf Style fixes: don't indent variable names.
Submitted by:	bde
2004-02-05 08:29:27 +00:00
Alexander Kabaev
e99c09e2dc Eliminate global cons_unavailable flag and replace it by the status
bit maintained on a per-device basis. Single variable is inadequate
on machines running with multiple consoles enabled.
2004-02-05 01:56:43 +00:00
John Baldwin
91d5354a2c Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count.  The plimit
  structure is treated similarly to struct ucred in that is is always copy
  on write, so having a reference to a structure is sufficient to read from
  it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
  limits from a process to keep the limit structure from changing out from
  under you while reading from it.
- Various global limits that are ints are not protected by a lock since
  int writes are atomic on all the archs we support and thus a lock
  wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
  behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
  either an rlimit, or the current or max individual limit of the specified
  resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
  other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
  (it didn't used the stackgap when it should have) but uses lim_rlimit()
  and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
  but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits.  It
  also no longer uses the stackgap for accessing sysctl's for the
  ibcs2_sysconf() syscall but uses kernel_sysctl() instead.  As a result,
  ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by:	mtm (mostly, I only did a few cleanups and catchups)
Tested on:	i386
Compiled on:	alpha, amd64
2004-02-04 21:52:57 +00:00
Mike Silbersack
ff5e43a3fd Rename iov_to_uio to uiofromiov to be more consistent with other
uio* functions.

Suggested by:	bde
2004-02-04 08:43:21 +00:00
Pawel Jakub Dawidek
19b0efd32d Allow assert that the current thread does not hold the sx(9) lock.
Reviewed by:		jhb
In cooperation with:	juli, jhb
Approved by:		jhb, scottl (mentor)
2004-02-04 08:14:58 +00:00
Mike Silbersack
2ccbe4b596 Style fixes
Submitted by:	bde
2004-02-04 08:14:47 +00:00
Robert Watson
5e312ddcc6 A variety of further cleanups to ttyinfo():
- Rename temporary variable names ("tmp", "tmp2") to more informative
  names ("load", "pctcpu", "rss", ...)

- Unclutter indentation and return paths: rather than lots of nested
  ifs, simply return earlier if it's not going to work out.  Simplify
  general structure and avoid "deep" code.

- Comment on the thread/process selection and locking.

- Correct handling of "running"/"runnable" states, avoid "unknown"
  that people were seeing for running processes.  This was due to
  a misunderstanding of the more complex state machine / inhibitors
  behavior of KSE.

- Do perform ttyinfo() printing on KSE (P_SA) processes, it seems
  generally to work.

While I initially attempted to formulate this as two commits (one
layout, the other content), I concluded that the layout changes were
really structural changes.

Many elements submitted by:  bde
2004-02-04 05:46:05 +00:00
John Baldwin
3e9ac3ebf2 Remove a bogus assertion.
Noticed by:	bde
Pointy hat to:	jhb
2004-02-03 15:14:27 +00:00
Daniel Eischen
b5426f096b Regen after adding ksem_timedwait(). 2004-02-03 05:11:31 +00:00
Daniel Eischen
aae94fbbb6 Add ksem_timedwait() to complement ksem_wait().
Glanced at by:	alfred
2004-02-03 05:08:32 +00:00
Robert Watson
4f638130c3 Don't dec/inc the amountpipes counter every time we resize a pipe --
instead, just dec/inc in the ctor/dtor.  For now, increment/decrement
in two's, since we're now performing the operation once per pair,
not once per pipe.  Not really any measurable performance change
in my micro-benchmarks, but doing less work is good, especially when
it comes to atomic operations.

Suggested by:	alc
2004-02-03 04:55:24 +00:00
Robert Watson
9a830ddc54 Catch instances of (pipe == NULL) that were obsoleted with recent
changes to jointly allocated pipe pairs.  Replace these checks
with pipe_present checks.  This avoids a NULL pointer dereference
when a pipe is half-closed.

Submitted by:	Peter Edwards <peter.edwards@openet-telecom.com>
2004-02-03 02:50:51 +00:00
John Baldwin
9c9c52a3ed - Assert that witness_cold is not true in enroll().
- Only check witness_watch once in enroll().

Reported by:	ru (2)
2004-02-02 22:15:17 +00:00
Pawel Jakub Dawidek
3410b19324 Fix many issues related to mount/unmount:
1. Root from inside a jail was able to unmount any file system
   (except /).
2. Unprivileged root was able to unmount file systems mounted by
   privileged root (execpt /).
3. User from inside a jail was able to mount file system when
   sysctl vfs.usermount was set to 1.
4. User was able to mount file system when vfs.usermount was set to 1
   (that's ok) and unmount it even if vfs.usermount was equal to 0
   (that's not correct).

Possibility from point 1 was reported by: Dariusz Kowalski <darek@76.pl>

Only a part of this fix will be MFC'ed (if approved).

PR:		kern/60149
Reviewed by:	rwatson
Approved by:	scottl (mentor)
MFC after:	3 days
2004-02-02 19:02:05 +00:00
Mike Silbersack
02ec600572 Remove debugging code that slipped into the previous commit.
Spotted by:	bde
2004-02-02 09:09:59 +00:00
Jeff Roberson
b209e5e3e4 - style fixes to the critical_exit() KASSERT().
Submitted by:	bde
2004-02-02 08:13:27 +00:00
Jeff Roberson
0392e39dff - Allow interactive tasks to use the maximum time-slice. This is not as
detrimental as I thought it would be in the case of massive process
   storms from a shell and it makes regular desktop usage noticeably
   better.
2004-02-01 10:38:13 +00:00
Mike Silbersack
beb699c7ba Rewrite sendfile's header support so that headers are now sent in the first
packet along with data, instead of in their own packet.  When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network.  Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)

Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.
2004-02-01 07:56:44 +00:00
Jeff Roberson
f2f51f8ab8 - Disable ithread binding in all cases for now. This doesn't make as much
sense with sched_4bsd as it does with sched_ule.
 - Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
   not a thread should be accounted for in sched_tdcnt.
2004-02-01 06:20:18 +00:00
Robert Watson
4795b82c13 Coalesce pipe allocations and frees. Previously, the pipe code
would allocate two 'struct pipe's from the pipe zone, and malloc a
mutex.

- Create a new "struct pipepair" object holding the two 'struct
  pipe' instances, struct mutex, and struct label reference.  Pipe
  structures now have a back-pointer to the pipe pair, and a
  'pipe_present' flag to indicate whether the half has been
  closed.

- Perform mutex init/destroy in zone init/destroy, avoiding
  reallocating the mutex for each pipe.  Perform most pipe structure
  setup in zone constructor.

- VM memory mappings for pageable buffers are still done outside of
  the UMA zone.

- Change MAC API to speak 'struct pipepair' instead of 'struct pipe',
  update many policies.  MAC labels are also handled outside of the
  UMA zone for now.  Label-only policy modules don't have to be
  recompiled, but if a module is recompiled, its pipe entry points
  will need to be updated.  If a module actually reached into the
  pipe structures (unlikely), that would also need to be modified.

These changes substantially simplify failure handling in the pipe
code as there are many fewer possible failure modes.

On half-close, pipes no longer free the 'struct pipe' for the closed
half until a full-close takes place.  However, VM mapped buffers
are still released on half-close.

Some code refactoring is now possible to clean up some of the back
references, etc; this patch attempts not to change the structure
of most of the pipe implementation, only allocation/free code
paths, so as to avoid introducing bugs (hopefully).

This cuts about 8%-9% off the cost of sequential pipe allocation
and free in system call tests on UP and SMP in my micro-benchmarks.
May or may not make a difference in macro-benchmarks, but doing
less work is good.

Reviewed by:	juli, tjr
Testing help:	dwhite, fenestro, scottl, et al
2004-02-01 05:56:51 +00:00
Jeff Roberson
40ece05382 - Revert rev 1.240 we no longer need a kthread for loadav(). 2004-02-01 05:37:36 +00:00
Jeff Roberson
e7f004fe23 - Use sched_load() rather than grabbing the sx lock and traversing the proc
table to discover the load.
2004-02-01 02:51:33 +00:00
Jeff Roberson
33916c360e - Add a new member to struct kseq called ksq_sysload. This is intended to
track the load for the sched_load() function.  In the SMP case this member
   is not defined because it would be redundant with the ksg_load member
   which already tracks the non ithd load.
 - For sched_load() in the UP case simply return ksq_sysload.  In the SMP
   case traverse the list of kseq groups and sum up their ksg_load fields.
2004-02-01 02:48:36 +00:00
Jeff Roberson
ca59f15272 - Keep a variable 'sched_tdcnt' that is used for the local implementation
of sched_load().  This variable tracks the number of running and runnable
   non ithd threads.  This removes the need to traverse the proc table and
   discover how many threads are runnable.
2004-02-01 02:46:47 +00:00
Robert Watson
fca542bcaa Move KASSERT regarding td_critnest to after the value of td is set to
curthread, to avoid warning and incorrect behavior.

Hoped not to mind:	jeff
2004-02-01 02:31:36 +00:00
Jeff Roberson
6767c6547b - Assert that td_critnest > 0 in critical_exit() to catch cases of
unbalanced uses of the critical_* api.
2004-02-01 01:24:54 +00:00
Robert Watson
26518e8d8c Fix an error in a KASSERT string: it's pipe_free_kmem(), not
pipespace(), that contains this KASSERT.
2004-01-31 23:03:22 +00:00
Poul-Henning Kamp
be8a62e821 Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
Robert Watson
30a9f26db2 Assert process lock in ptracestop(), since we're going to rely
on it, and later unlock it.
2004-01-29 00:58:21 +00:00
Robert Watson
94ffb20d72 Add a reset sysctl for mutex profiling: zeros all of the mutex
profiling buffers and hash table.  This makes it a lot easier to
do multiple profiling runs without rebooting or performing
gratuitous arithmetic.  Sysctl is named debug.mutex.prof.reset.

Reviewed by:	jake
2004-01-28 22:11:53 +00:00
John Baldwin
d5b75694e7 Move the loadav() callout into its own kthread since it uses allproc_lock
which is a sleepable lock and thus is not safe to acquire from a callout
routine.
2004-01-28 20:44:41 +00:00
John Baldwin
8d768e7676 Rework witness_lock() to make it slightly more useful and flexible.
- witness_lock() is split into two pieces: witness_checkorder() and
  witness_lock().  Witness_checkorder() determines if acquiring a specified
  lock at the time it is called would result in a lock order.  It
  optionally adds a new lock order relationship as well.  witness_lock()
  updates witness's data structures to assume that a lock has been acquired
  by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
  acquire a lock and continue to call witness_lock() after the acquire is
  completed.  This will let witness catch a deadlock before it happens
  rather than trying to do so after the threads have deadlocked (i.e. never
  actually report it).
- A new function witness_defineorder() has been added that adds a lock
  order between two locks at runtime without having to acquire the locks.
  If the lock order cannot be added it will return an error.  This function
  is available to programmers via the WITNESS_DEFINEORDER() macro which
  accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
  witness_checkorder() anywhere as a way of enforcing locking assertions
  in code that might acquire a certain lock in some situations.  The
  macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
  appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
  was unnested by using a goto to vastly improve the readability of this
  function.
2004-01-28 20:39:57 +00:00
John Baldwin
62a0fd943c Use mtx_assert() rather than using a home-rolled version. 2004-01-28 20:26:39 +00:00
Alexander Kabaev
975634280a Move the part of the comment which applies to osigsuspend where
it belongs. The current sigsuspend syscall does expect a pointer
to the mask as argument.

Submitted by:	Igor Sysoev <is at rambler-co dot ru>
2004-01-28 06:06:04 +00:00
Dag-Erling Smørgrav
84344f9fbf Rename the kern.vm.kmem.size tunable to the more logical vm.kmem_size. To
assure backward compatibility (conditional on !BURN_BRIDGES), look it up
by its old name first, and log a warning (but accept the setting) if it
was found.  If both the old and new name are defined, the new name takes
precedence.

Also export vm.kmem_size as a read-only sysctl variable; I find it hard to
tune a parameter when I don't know its default value, especially when that
default value is computed at boot time.
2004-01-27 15:59:38 +00:00
Robert Watson
6bea667f63 When aborting fork() due to a failure, if using MAC, make sure to clean
up the p_label field.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-01-25 18:42:18 +00:00
Ruslan Ermilov
33fe8fd0df Register the uart(4)'s spin lock with witness(4). 2004-01-25 15:04:37 +00:00
Jeff Roberson
c77ac1fdee - sched_strict has been dead for a long time now. Get rid of it. 2004-01-25 08:58:14 +00:00
Jeff Roberson
c494ddc8a1 - Clean up KASSERTS. 2004-01-25 08:57:38 +00:00
Jeff Roberson
5a2b158d8d - Correct function names listed in KASSERTs. These were copied from other
code and it was sloppy of me not to adjust these sooner.
2004-01-25 08:21:46 +00:00
Jeff Roberson
e17c57b14b - Implement cpu pinning and binding. This is acomplished by keeping a per-
cpu run queue that is only used for pinned or bound threads.

Submitted by:	Chris Bradfield <chrisb@ation.org>
2004-01-25 08:00:04 +00:00
Jeff Roberson
d1605f0ac9 - Use a unique string for the sched_setup SYSINIT and rename sched_setup to
synch_setup.  The schedulers use the sched_setup function name.
2004-01-25 07:49:45 +00:00
Jeff Roberson
29bcc4514f - Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL.  Assert that one of these is set in mi_switch() and propery
   adjust the rusage statistics.  This is to simplify the large number of
   users of this interface which were previously all required to adjust the
   proper counter prior to calling mi_switch().  This also facilitates more
   switch and locking optimizations.
 - Change all callers of mi_switch() to pass the appropriate paramter and
   remove direct references to the process statistics.
2004-01-25 03:54:52 +00:00
Robert Watson
8dc10be885 Add some basic support for measuring sleep mutex contention to the
mutex profiling code.  As with existing mutex profiling, measurement
is done with respect to mtx_lock() instances in the code, as opposed
to specific mutexes.  In particular, measure two things:

(1) Lock contention.  How often did this mtx_lock() call get made and
    have to sleep (or almost sleep) waiting for the lock.  This helps
    identify the "victims" of contention.

(2) Hold contention.  How often, while the lock was held by a thread
    as a result of this mtx_lock(), did another thread try to acquire
    the same mutex.  This helps identify the causes of contention.

I'm currently exploring adding measurement of "time waited for the
lock", but the current implementation has proven useful to me so far
so I figured I'd commit it so others could try it out.  Note that this
increases the size of mutexes when MUTEX_PROFILING is enabled, so you
might find you need to further bump UMA_BOOT_PAGES.  Fixes welcome.

The once over:	des, others
2004-01-25 01:59:27 +00:00
Poul-Henning Kamp
551260fc36 Deal with MOD_FREQUENCY before MOD_OFFSET because the latter is the
one which runs the actual update.  This fixes a bug where there were
a delay in applying the frequency adjustment.  In extreme cases this
could result in marginal stability of the kernel-pll.
2004-01-24 21:48:43 +00:00
Jeff Roberson
b9509b56fa - Move smp_topology to subr_smp.c so that it is defined on all architectures. 2004-01-24 19:52:48 +00:00
Robert Watson
646e29ccac Don't grab Giant in crfree(), since prison_free() no longer requires it.
The uidinfo code appears to be MPSAFE, and is referenced without Giant
elsewhere.  While this grab of Giant was only made in fairly rare
circumstances (actually GC'ing on refcount==0), grabbing Giant here
potentially introduces lock order issues with any locks held by the
caller.  So this probably won't help performance much unless you change
credentials a lot in an application, and leave a lot of file descriptors
and cached credentials around.  However, it simplifies locking down
consumers of the credential interfaces.

Bumped into by:	sam
Appeased:	tjr
2004-01-23 21:07:52 +00:00
Robert Watson
b3059e09f6 Defer the vrele() on a jail's root vnode reference from prison_free()
to a new prison_complete() task run by a task queue.  This removes
a requirement for grabbing Giant in crfree().  Embed the 'struct task'
in 'struct prison' so that we don't have to allocate memory from
prison_free() (which means we also defer the FREE()).

With this change, I believe grabbing Giant from crfree() can now be
removed, but need to check the uidinfo code paths.

To avoid header pollution, move the definition of 'struct task'
to _task.h, and recursively include from taskqueue.h and jail.h; much
preferably to all files including jail.h picking up a requirement to
include taskqueue.h.

Bumped into by:	sam
Reviewed by:	bde, tjr
2004-01-23 20:44:26 +00:00
Poul-Henning Kamp
ee57aeea65 Write 100 times for tomorrow:
"Always print time_t as %jd, you never know what width it has"
2004-01-22 19:50:06 +00:00
Ralf S. Engelschall
446655ac4f Fix generation of random multicast MAC address.
In case no real/physical IEEE 802 address is available, both the expired
"draft-leach-uuids-guids-01" (section "4. Node IDs when no IEEE 802
network card is available") and RFC 2518 (section "6.4.1 Node Field
Generation Without the IEEE 802 Address") recommend (quoted from RFC
2518):

  "The ideal solution is to obtain a 47 bit cryptographic quality random
  number, and use it as the low 47 bits of the node ID, with the _most_
  significant bit of the first octet of the node ID set to 1. This bit
  is the unicast/multicast bit, which will never be set in IEEE 802
  addresses obtained from network cards; hence, there can never be a
  conflict between UUIDs generated by machines with and without network
  cards."

Unfortunately, this incorrectly explains how to implement this and
the FreeBSD UUID generator code inherited this generation bug from
the broken reference code in the standards draft. They should instead
specify the "_least_ significant bit of the first octet of the node ID"
as the multicast bit in a memory and hexadecimal string representation
of a 48-bit IEEE 802 MAC address.

This standards bug arised from a false interpretation, as the multicast
bit is actually the _most_ significant bit in IEEE 802.3 (Ethernet)
_transmission order_ of an IEEE 802 MAC address. The standards authors
forgot that the bitwise order of an _octet_ from a MAC address _memory_
and hexadecimal string representation is still always from left (MSB,
bit 7) to right (LSB, bit 0).

Fortunately, this UUID generation bug could have occurred on systems
without any Ethernet NICs only.
2004-01-22 13:34:11 +00:00
Poul-Henning Kamp
4e74721cac Add a sysctl (default: off) which enables a log(LOG_INFO...) warning
if the clock is stepped.
2004-01-21 21:05:40 +00:00
Robert Watson
679365e7b9 Reduce gratuitous includes: don't include jail.h if it's not needed.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.

Result of:	self injury involving adding something to struct prison
2004-01-21 17:10:47 +00:00
Andrey A. Chernov
9bbee25931 pread/pwrite:
follow lseek spirit - return EINVAL on negative offset for non-VCHR
2004-01-20 01:27:42 +00:00
Poul-Henning Kamp
50d23be140 Add linenumber and source filename to panic(9) output.
Ideally a traceback should be printed too, any takers ?
2004-01-19 21:27:11 +00:00
Alexander Kabaev
54556cc7b8 One more instance of magic number used in place of IO_SEQSHIFT.
Submitted by:	alc
2004-01-19 20:45:43 +00:00
Ruslan Ermilov
0541040c46 Since "m" is not part of the "mp" chain, need to free() it.
Reported by:	Stanford Metacompilation research group
2004-01-18 14:02:53 +00:00
Andrew Gallatin
1c318b9665 Handle sf_buf_alloc() returning null. This can happen if the
process takes a signal while waiting for an sf_buf to become available.

Reviewed by: alc
2004-01-17 21:16:51 +00:00
Dag-Erling Smørgrav
a6d4491c71 Restore correct semantics for F_DUPFD fcntl. This should fix the errors
people have been getting with configure scripts.
2004-01-17 00:59:04 +00:00
Dag-Erling Smørgrav
56a9fc0e93 WITNESS won't let us hold two filedesc locks at the same time, so juggle
fdp and newfdp around a bit.
2004-01-16 21:54:56 +00:00
Robert Watson
bafc8f255a KASSERT() that initproc->p_pid is 1. Very bad things happen if init's
pid isn't 1, and it can actually occur if kthread_create() is called
before SUB_SI_CREATE_INIT without RFHIGHPID.

Discussed with:	jhb
2004-01-16 20:29:23 +00:00
Dag-Erling Smørgrav
ddce426f69 Remove two KASSERTs which were overly paranoid. 2004-01-16 08:45:56 +00:00
Dag-Erling Smørgrav
12d568c2b1 Take care to drop locks when calling malloc() 2004-01-15 18:50:11 +00:00
Dag-Erling Smørgrav
a2fe44e8cf New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos.  The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed.  It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).
2004-01-15 10:15:04 +00:00
Don Lewis
288e351b55 If a device attach routine fails during boot and calls bus_teardown_intr(),
ithread_remove_handler() may fail to remove the interrupt handler if
it decides to let the ithread do the removal.  The problem is that during
boot "cold" is set, which causes msleep() to return immediately.  This
will cause ithread_remove_handler() to fail to wait for the ithread
to do the removal from the handler TAILQ before freeing the handler
back to the heap.  Bad things will happen when some other user of the
TAILQ, such as ithread_add_handler() or the actual ithread attempts to use
the freed handler.  Fix the problem by forcing ithread_remove_handler()
to do the actual removal itself if the "cold" flag is set.

Reviewed by:	jhb
2004-01-13 22:55:46 +00:00
Dag-Erling Smørgrav
ac34dc4e79 Back out 1.160, which was committed by mistake. 2004-01-11 20:08:57 +00:00
Dag-Erling Smørgrav
d7a1c7e34b Back out 1.166, which was committed by mistake. 2004-01-11 20:07:15 +00:00
Dag-Erling Smørgrav
f1ea6d813d Mechanical whitespace cleanup + other minor style nits. 2004-01-11 19:56:42 +00:00
Dag-Erling Smørgrav
0e5dfade00 Mechanical whitespace cleanup. 2004-01-11 19:54:45 +00:00
Dag-Erling Smørgrav
05c3c5c8b6 Mechanical whitespace cleanup; parenthesize return values; other minor
style nits.  The #ifdefs in this file give me a headache...
2004-01-11 19:52:10 +00:00
Dag-Erling Smørgrav
e5aeaa0c67 Mechanical whitespace cleanup; parenthesize return values; other minor
style nits.
2004-01-11 19:48:19 +00:00
Dag-Erling Smørgrav
012b5531f4 Mechanical whitespace cleanup + minor style nits. 2004-01-11 19:43:14 +00:00
Dag-Erling Smørgrav
c9de31f55f Mechanical whitespace cleanup. 2004-01-11 19:39:14 +00:00
Alan Cox
0e88a71798 Remove long dead code, specifically, code related to munmapfd().
(See also vm/vm_mmap.c revision 1.173.)
2004-01-11 06:59:21 +00:00
Robert Watson
def055686c When not creating a core dump due to resource limits specifying
a maximum dump size of 0, return a size-related error, rather
than returning success.  Otherwise, waitpid() will incorrectly
return a status indicating that a core dump was created.  Note
that the specific error doesn't actually matter, since it's lost.

MFC after:	2 weeks
PR:		60367
Submitted by:	Valentin Nechayev <netch@netch.kiev.ua>
2004-01-11 02:28:06 +00:00
Jens Schweikhardt
85495c72ff s/Muliple/Multiple
Removed whitespace at EOL and EOF.
2004-01-10 18:34:01 +00:00
Dag-Erling Smørgrav
d41457da80 More unparenthesized return values. 2004-01-10 17:14:53 +00:00
Dag-Erling Smørgrav
b91a599717 Style: parenthesize return values. 2004-01-10 13:03:43 +00:00
Don Lewis
2b77864f1e Add a somewhat redundant check on the len arguement to getsockaddr() to
avoid relying on the minimum memory allocation size to avoid problems.
The check is somewhat redundant because the consumers of the returned
structure will check that sa_len is a protocol-specific larger size.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by:	nectar
MFC after:	30 days
2004-01-10 08:28:54 +00:00
Olivier Houchard
5cded90454 Prevent a race condition between fork1() and whatever changes the pgrp by
setting the new process' p_pgrp again before inserting it in the p_pglist.
Without it we can get the new process to be inserted in a different p_pglist
than the one p2->p_pgrp points to, and this is not something we want to happen.
This is not a fix, merely a bandaid, but it will work until someone finds a
better way to do it.

Discussed with: 	jhb (a long time ago)
2004-01-09 23:42:36 +00:00
Robert Watson
07eacae0d2 Improve the expressiveness of ttyinfo (^T) when dealing with threads
in slightly less usual states:

  If the thread is on a run queue, display "running" if the thread is
  actually running, otherwise, "runnable".

  If the thread is sleeping, and it's on a sleep queue, display the
  name of the queue, otherwise "unknown" -- previously, in this situation
  we would display "iowait".

  If the thread is waiting on a lock, display *lockname.

  If the thread is suspended, display "suspended" -- previously, in
  this situation we would display "iowait".

  If the thread is waiting for an interrupt, display "intrwait" --
  previously, in this situation we would display "iowait".

  If the thread is in a state not handled by the above, display
  "unknown" -- previously, we would print "iowait".

Among other things, this avoids displaying "iowait" when the foreground
process turns out to be suspended waiting for a debugger to properly
attach.
2004-01-08 22:49:23 +00:00
Robert Watson
047aa39b25 Drop the sigacts mutex around calls to stopevent() to avoid sleeping
holding the mutex.  Because the sigacts pointer can't change while
the process is "live" (proc locking (x)), we know our pointer is still
valid.

In communication with:	truckman
Reviewed by:		jhb
2004-01-08 22:44:54 +00:00
Alexander Kabaev
c969c60c60 Add pid to the info printed in lockmgr_printinfo. This makes VFS
diagnostic messages slightly more useful.
2004-01-06 04:34:13 +00:00
Alexander Kabaev
580ddfa64b More style fixes.
Obtained from:	bde
2004-01-05 23:40:46 +00:00
John Baldwin
eac097962f - Allow mtx_trylock() to recurse on a recursive mutex. Attempts to recurse
on a non-recursive mutex will fail but will not trigger any assertions.
- Add an assertion to mtx_lock() that one never recurses on a non-recursive
  mutex.  This is mostly useful for the non-WITNESS case.

Requested by:	deischen, julian, others (1)
2004-01-05 23:09:51 +00:00
Alexander Kabaev
b0fdf71656 style(9):
Add empty line before first code line in functions with no local
variables.
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.

Obtained from:	bde (partly)
2004-01-05 19:04:29 +00:00
Alexander Kabaev
3ff1b7c23f Cosmetics: strip '\n' from a string passed to Debugger(). 2004-01-04 03:42:20 +00:00
David Xu
a30ec4b99c Make sigaltstack as per-threaded, because per-process sigaltstack state
is useless for threaded programs, multiple threads can not share same
stack.
The alternative signal stack is private for thread, no lock is needed,
the orignal P_ALTSTACK is now moved into td_pflags and renamed to
TDP_ALTSTACK.
For single thread or Linux clone() based threaded program, there is no
semantic changed, because those programs only have one kernel thread
in every process.

Reviewed by: deischen, dfr
2004-01-03 02:02:26 +00:00
Nate Lawson
44bb5f52d3 Move the kernel power change printf under bootverbose since the
power_profile script now duplicates the message via syslog.
2004-01-02 18:24:13 +00:00
Sam Leffler
4f9f9cf3a4 m_tag fixups in preparation for heavier use:
o promote several m_tag_* routines to inline
o add an m_tag_setup inline to set the fixed fields in a packet tag
o add an m_tag_free method pointer to each mtag to support, for example,
  allocating tags from zones
o have m_tag_find check if the tag list is not empty before calling
  m_tag_locate to search

Reviewed by:	brooks, silence from others
2004-01-02 17:27:39 +00:00
David Malone
70ad6c2190 Plug a leak of open files that happens when you exec a suid program
with one of std{in,out,err} open. This helps with the file descriptor
leaks reported on -current. This should probably be merged into 5.2.

Reviewed by:	ru
Tested by:	Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net>
2003-12-28 19:27:14 +00:00
Bruce Evans
9efe7d9d83 v_vxproc was a bogus name for a thread (pointer). 2003-12-28 09:12:56 +00:00
Mike Silbersack
ddeb5b242e Track three new sendfile-related statistics:
- The number of times sendfile had to do disk I/O
- The number of times sfbuf allocation failed
- The number of times sfbuf allocation had to wait
2003-12-28 08:57:09 +00:00
Bruce Evans
d6c847f378 Fixed some style bugs (mainly, try to always use explicit comparisons with
NULL when checking for null pointers).
2003-12-28 04:37:59 +00:00
Bruce Evans
ca46e90ef4 Fixed some disordering in revs.1.194 and 1,196. Moved the exceve() syscall
function back to near the beginning of the file.  Rev.1.194 moved it into
the middle of auxiliary functions following kern_execve().  Moved the
__mac_execve() syscall function up together with execve().  It was new in
rev1.1.196 and perfectly misplaced after execve().
2003-12-28 04:18:13 +00:00
Mike Silbersack
69fba1650a Fix the maxpipekva warning message so that it points to the correct
sysctl, and shorten the message.

Noticed by:	bde
2003-12-28 01:19:58 +00:00
Alan Cox
34d2675761 Remove GIANT_REQUIRED from exec_unmap_first_page(). 2003-12-27 19:40:03 +00:00
Mike Silbersack
5eda9873e9 Track current and peak sfbuf usage, export the values via sysctl. 2003-12-27 07:52:47 +00:00
John Baldwin
c55bbb6cb7 Create a separate kthread that executes sched_cpu() once a second. Because
sched_cpu() locks an sx lock (allproc_lock) which can sleep if it fails to
acquire the lock, it is not safe to execute this in a callout handler from
softclock().
2003-12-26 17:07:29 +00:00
Alfred Perlstein
866e3b7e73 Put restrict back in, the compilation failure was my fault when I
did a bad merge from the PR.

Thanks to Bruce Evans for explaining.
2003-12-26 05:58:16 +00:00
Alfred Perlstein
4abb4ff34d Add __restrict qualifiers to copyinfrom, copyinstrfrom, copystr, copyinstr,
copyin and copyout.
2003-12-26 05:54:35 +00:00
David Malone
9322078275 In socket(2) we only need Giant around the call to socreate, so just
grab it there.
2003-12-25 23:44:38 +00:00
David Malone
1c58509c25 Don't TAILQ_INIT kq_head twice, once is enough. 2003-12-25 23:42:36 +00:00
Mike Silbersack
8dee2f6746 Fix another 0 / NULL mixup. 2003-12-25 01:17:27 +00:00
Alfred Perlstein
6502da1307 We're not ready for restrict qualifiers here. 2003-12-24 19:09:45 +00:00
Alfred Perlstein
9f144cff85 Add restrict qualifiers.
PR: 44394
Submitted by: Craig Rodrigues <rodrige@attbi.com>
2003-12-24 18:47:43 +00:00
Robert Watson
69546b2fbb Document that when we are addressing an open()/close() race, the reason
we call vn_close() manually rather than letting fdrop() take care of it
is that we haven't yet hooked up the various 'struct file' fields.
2003-12-24 17:13:01 +00:00
Alfred Perlstein
1805ed0772 Introduce mp_maxcpus which can be used by libkvm utils to find out
how many CPUs the system was compiled for.
Export the variable via a sysctl node 'kern.smp.maxcpus' as well.
2003-12-23 13:54:16 +00:00
Peter Wemm
2c74309622 Regen - this should be essentially a NOP, except for rcsid changes. 2003-12-23 03:52:14 +00:00
Peter Wemm
eec525a435 Remove namespc column and attempt to un-fold some of the longer lines
that now fit.
2003-12-23 03:51:36 +00:00
Peter Wemm
1a58b07149 Remove the namespace column from the syscalls tables. We don't actually
use it, if we ever did.  They have been been VERY poorly maintained for
some time, possibly because they were a NOP.  FWIW, This brings our table
formats back closer to the other *BSD's.
2003-12-23 03:50:43 +00:00
Peter Wemm
9b68618df0 Add an additional field to the elf brandinfo structure to support
quicker exec-time replacement of the elf interpreter on an emulation
environment where an entire /compat/* tree isn't really warranted.
2003-12-23 02:42:39 +00:00
Peter Wemm
a89ec05e3e Catch a few places where NULL (pointer) was used where 0 (integer) was
expected.
2003-12-23 02:36:43 +00:00
Peter Wemm
55cdddc0d8 Don't use NULL (pointer) when we mean 0 (integer) for the number of ticks
in msleep.
2003-12-23 02:28:42 +00:00
Jeff Roberson
249e0bea8f - Make our transfer decisions based on load and not transferable load. A
cpu could have been bogged down with non-transferable load and still not
   migrated a new thread to an idle cpu.  This required some benchmarking and
   tuning to get right as the comment above it suggests.
2003-12-20 22:35:20 +00:00
Jeff Roberson
e7a976f415 - Enable ithread migration on x86. This is done to work around a bug in the
IO APIC on Xeons that prevents round-robin interrupt assignment from
   working.
2003-12-20 20:36:19 +00:00
Alan Cox
96a7b42213 Remove a variable that has been initialized but otherwise unused since
revision 1.315.
2003-12-20 19:46:21 +00:00
Jeff Roberson
670c524f08 - In kseq_transfer() return if smp has not been started.
- In sched_add(), do the idle check prior to the transfer check so that we
   don't try to transfer load from an idle cpu.  This fixes panics caused by
   IPIs on UP machines running SMP kernels.

Reported/Debugged by:	seanc
2003-12-20 14:03:14 +00:00
Jeff Roberson
9b5f6f623d - Running interactive tasks with the minimum time-slice is fine for vi and
sh, but not so great for mozilla, X, etc.  Add a fixed define for the slice
   size granted to interactive KSEs.
2003-12-20 12:54:35 +00:00
Tim J. Robbins
f5925b7436 Reduce the overhead of semop() by using the kernel stack instead of
malloc'd memory to store the operations array if it is small enough
to fit.
2003-12-19 13:07:17 +00:00
John Baldwin
eb5b0e0565 Various style fixes.
Submitted by:	bde (mostly, if not all)
2003-12-17 21:13:04 +00:00
Jeff Roberson
958557e9c7 - In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT.
Submitted by:	Stephan Uphoff <ups@stups.com>
2003-12-16 17:08:27 +00:00
Jeff Roberson
d85213669b - When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
   However, this is not sufficient.  If we vclean() the vnode the pages owned
   by the vnode are lost, potentially while buffers reference them.  Implement
   parts of vclean() seperately in vgonechrl() so that the pages and bufs
   associated with a device vnode are not destroyed while in use.
2003-12-16 17:05:05 +00:00
Bruce M Simpson
5406529771 style(9) pass and type fixups.
Submitted by:	bde
2003-12-16 14:13:47 +00:00
Bruce M Simpson
37621fd5d9 Push m_apply() and m_getptr() up into the colleciton of standard mbuf
routines, and purge them from opencrypto.

Reviewed by:	sam
Obtained from:	NetBSD
Sponsored by:	spc.org
2003-12-15 21:49:41 +00:00
Jeff Roberson
86e1c22aa4 - Assign the ke_cpu field in kseq_notify() so that all of our callers do not
have to do it.
 - Set the ke_runq to NULL in sched_add() before calling kseq_notify().
   Otherwise we may panic in sched_add() if INVARIANTS is on.
2003-12-14 02:06:29 +00:00
Robert Watson
09a4a69c1d Although sometimes to the uninitiated, it may seem like goup, KSEGOUP
is actually spelt KSEGROUP.  Go figure.

Reported by:	samy@kerneled.com
2003-12-12 21:25:56 +00:00
Jeff Roberson
cac77d0422 - Now that we have kseq groups, balance them seperately.
- The new sched_balance_groups() function does intra-group balancing while
   sched_balance() balances the available groups.
 - Pick a random time between 0 ticks and hz * 2 ticks to restart each
   balancing process.  Each balancer has its own timeout.
 - Pick a random place in the list of groups to start the search for lowest
   and highest group loads.  This prevents us from prefering a group based on
   numeric position.
 - Use a nasty hack to stop us from preferring cpu 0.  The problem is that
   softclock always runs on cpu 0, so it always has a little extra load.  We
   ignore this load in the balancer for now.  In the future softclock should
   run on a random cpu and these hacks can go away.
2003-12-12 07:33:51 +00:00
Jeff Roberson
2e227f0406 - Don't let the pctcpu rate limiter throttle us if we have recorded over
SCHED_CPU_TICKS ticks.  This was allowing processes to display
   (1/SCHED_CPU_TIME * 100) % more cpu than they had used.
2003-12-11 04:23:39 +00:00
Jeff Roberson
b11fdad0fc - In sched_switch(), if a thread has been assigned, don't touch the runqueues
or load.  These things have already been taken care of in sched_bind()
   which should be the only place that we're switching in an assigned thread.
2003-12-11 04:00:49 +00:00
Jeff Roberson
80f86c9f88 - Add support for CPU groups to ule. All SMT cores on the same physical
cpu are added to a group.
 - Don't place a cpu into the kseq_idle bitmask until all cpus in that group
   have idled.
 - Prefer idle groups over idle group members in the new kseq_transfer()
   function.  In this way we will prefer to balance load across full cores
   rather than add further load a partial core.
 - Before a cpu goes idle, check the other group members for threads.  Since
   SMT cpus may freely share threads, this is cheap.
 - SMT cores may be individually pinned and bound to now.  This contrasts the
   old mechanism where binding or pinning would have allowed a thread to run
   on any available cpu.
 - Remove some unnecessary logic from sched_switch().  Priority propagation
   should be properly taken care of in sched_prio() now.
2003-12-11 03:57:10 +00:00
Peter Wemm
5be4b10c89 Regen 2003-12-10 22:18:54 +00:00
Peter Wemm
5352eb6bb1 Update file locations for syscall tables to copy to. 2003-12-10 22:08:37 +00:00
Marcel Moolenaar
ccb46feb8e Write the thread pointer (val) in the kse mailbox (loc) before we
set the new context in kse_switchin(2). This allows us to return
an error to the calling context when the suword() fails.
2003-12-10 01:59:23 +00:00
John Baldwin
67ba867827 Adjust an assertion for the TDF_TSNOBLOCK race handling in
turnstile_unpend().  A racing thread that does not have TDI_LOCK set may
either be running on another CPU or it may be sitting on a run queue if it
was preempted during the very small window in turnstile_wait() between
unlocking the turnstile chain lock and locking sched_lock.
2003-12-09 21:14:31 +00:00
John Baldwin
da1d503b22 Assert that the we never give a thread a NULL turnstile when waking it up. 2003-12-09 21:09:54 +00:00
John Baldwin
6b6bd95ee5 Revert the previous race fix and replace it with a more general fix. The
case of a turnstile having no threads is just one instance of the more
general case where the thread we are examining has been partially awakened
already in that it has been removed from the turnstile's blocked list but
still has TDI_LOCK set.  We detect that case by checking to see if the
thread has already had a turnstile reassigned to it.
2003-12-09 21:09:04 +00:00
David Xu
a9a48d6862 Lock and unlock sched_lock when walking through thread list, current we
insert kse upcall thread into thread list at mi_switch time, process lock
is not enough.
2003-12-07 23:47:15 +00:00
Don Lewis
50105bcf1a Pass MTX_DEF as the last argument to mtx_init() instead of 0. This
is not a functional change.  The code happened to work properly only
because MTX_DEF is defined as 0.
2003-12-07 21:53:41 +00:00
Poul-Henning Kamp
377e7be416 Make the DIAGNOSTIC code which complains about long {call|time}out(9)
functions less noisy:  We printf if a new function took longer than
the previous record holder, or of the previous record holder took
more than twice as long as the current record.
2003-12-07 20:03:28 +00:00
Marcel Moolenaar
cfa4b1e7b1 Regen due to kse_switchin(2). 2003-12-07 19:36:16 +00:00
Marcel Moolenaar
702b2a179c Add kse_switchin(2). This syscall can be used by KSE implementations
to have the kernel switch to a new thread, instead of doing it in
userland. It is in fact needed on ia64 where syscall restarts do not
return to userland first. It's completely handled inside the kernel.
As such, any context created by the kernel as part of an upcall and
caused by some syscall needs to be restored by the kernel.
2003-12-07 19:34:29 +00:00
Peter Wemm
a2640c9ba9 rqb_bits[] may be an int64_t (eg: on alpha, and recently on amd64).
Be sure to shift (long)1 << 33 and higher, not (int)1.  Otherwise bad
things happen(TM).  This is why beast.freebsd.org paniced with ULE.

Reviewed by:  jeff
2003-12-07 09:57:51 +00:00
Scott Long
774114995e Re-arrange and consolidate some random debugging stuff 2003-12-07 05:04:49 +00:00
Alan Cox
bca62663ab - Giant is no longer required by vm_thread_new(). 2003-12-07 04:16:49 +00:00
Robert Watson
56d9e93207 Rename mac_create_cred() MAC Framework entry point to mac_copy_cred(),
and the mpo_create_cred() MAC policy entry point to
mpo_copy_cred_label().  This is more consistent with similar entry
points for creation and label copying, as mac_create_cred() was
called from crdup() as opposed to during process creation.  For
a number of policies, this removes the requirement for special
handling when copying credential labels, and improves consistency.

Approved by:	re (scottl)
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-12-06 21:48:03 +00:00
John Baldwin
b6c71225a9 Fix all users of mp_maxid to use the same semantics, namely:
1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.

Approved by:	re (scottl)
Tested on:	i386, amd64, alpha
2003-12-03 14:57:26 +00:00
John Baldwin
45c1c90f6a Export a few SMP related symbols in UP kernels as well. This is needed to
aid other kernel code, especially code which can be in a module such as
the acpi_cpu(4) driver, to work properly with both SMP and UP kernels.
The exported symbols include mp_ncpus, all_cpus, mp_maxid, smp_started, and
the smp_rendezvous() function.  This also means that CPU_ABSENT() is now
always implemented the same on all kernels.

Approved by:	re (scottl)
2003-12-03 14:55:31 +00:00
David Greenman
186e347f2c Fixed a bug in sendfile(2) where the sent data would be corrupted due
to sendfile(2) being erroneously automatically restarted after a signal
is delivered. Fixed by converting ERESTART to EINTR prior to exiting.

Updated manual page to indicate the potential EINTR error, its cause
and consequences.

Approved by: re@freebsd.org
2003-12-01 22:12:50 +00:00
Ian Dowse
25cb5d7a6b In dounmount(), only call checkdirs() prior to VFS_UNMOUNT() in the
forced unmount case. Otherwise, a file system that is referenced
only by process fd_cdir/fd_rdir references to the file system root
vnode will be successfully unmounted without the MNT_FORCE flag.

The previous behaviour was not compatible with the unmount semantics
required by amd(8), so file systems could be unexpectedly unmounted
while there were still references to the file system root directory.

Reported by:	Erez Zadok <ezk@cs.sunysb.edu>
Approved by:	re (scottl)
2003-11-30 23:30:09 +00:00
Jeff Roberson
a6c6a93c89 - Don't forget to unlock the vnode interlock in the LK_NOWAIT case.
Submitted by:	Stephan Uphoff <ups@stups.com>
Approved by:	re (rwatson)
2003-11-30 22:09:58 +00:00
Alexander Kabaev
97c43a540a Do not attempt to destroy NULL vfs options list.
Approved by: re (scottl)
Reported by: Christian Laursen <xi atborderworlds dot dk>
2003-11-23 17:13:48 +00:00
John Baldwin
798a45964d - Split cpu_mp_probe() into two parts. cpu_mp_setmaxid() is still called
very early (SI_SUB_TUNABLES - 1) and is responsible for setting mp_maxid.
  cpu_mp_probe() is now called at SI_SUB_CPU and determines if SMP is
  actually present and sets mp_ncpus and all_cpus.  Splitting these up
  allows an architecture to probe CPUs later than SI_SUB_TUNABLES by just
  setting mp_maxid to MAXCPU in cpu_mp_setmaxid().  This could allow the
  CPU probing code to live in a module, for example, since modules
  sysinit's in modules cannot be invoked prior to SI_SUB_KLD.  This is
  needed to re-enable the ACPI module on i386.
- For the alpha SMP probing code, use LOCATE_PCS() instead of duplicating
  its contents in a few places.  Also, add a smp_cpu_enabled() function
  to avoid duplicating some code.  There is room for further code
  reduction later since much of this code is also present in cpu_mp_start().
- All archs besides i386 still set mp_maxid to the same values they set it
  to before this change.  i386 now sets mp_maxid to MAXCPU.

Tested on:	alpha, amd64, i386, ia64, sparc64
Approved by:	re (scottl)
2003-11-21 22:23:26 +00:00
Mark Murray
4e3a7a14d9 Fix a major faux pas of mine. I was causing 2 very bad things to
happen in interrupt context; 1) sleep locks, and 2) malloc/free
calls.

1) is fixed by using spin locks instead.

2) is fixed by preallocating a FIFO (implemented with a STAILQ)
   and using elements from this FIFO instead. This turns out
   to be rather fast.

OK'ed by:	re (scottl)
Thanks to:	peter, jhb, rwatson, jake
Apologies to:	*
2003-11-20 15:35:48 +00:00
Mark Murray
3fed54aaaa Hackfix to patch around a kernel panic I introduced. Real fix to
follow. In the meanwhile, we are not harvesting interrupt entropy.

Approved by:	re (jhb)
2003-11-18 14:35:43 +00:00
Robert Watson
a557af222b Introduce a MAC label reference in 'struct inpcb', which caches
the   MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols.  This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks.  Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by:	sam, bms
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
Robert Watson
64d19c2ea7 Add a sysctl, security.bsd.see_other_gids, similar in semantics
to see_other_uids but with the logical conversion.  This is based
on (but not identical to) the patch submitted by Samy Al Bahra.

Submitted by:	Samy Al Bahra <samy@kerneled.com>
2003-11-17 20:20:53 +00:00
Peter Wemm
0d2a298904 Initial landing of SMP support for FreeBSD/amd64.
- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
  (in particular, bootstrap)
- This is still a WIP.  It seems that there are some highly bogus bioses
  on nVidia nForce3-150 boards.  I can't stress how broken these boards
  are.  I have a workaround in mind, but right now the Asus SK8N is broken.
  The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE.  SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
  atpic' in addition, because they somehow managed to forget to connect the
  8254 timer to the apic, even though its in the same silicon!  ARGH!
  This directly violates the ACPI spec.
2003-11-17 08:58:16 +00:00
Jeff Roberson
fa9c971710 - Mark ksq_assigned as volatile so that when this code is used without
sched_lock we can be sure that we'll pick up the new value.
2003-11-17 08:27:11 +00:00
Jeff Roberson
093c05e39d - Remove long dead code. rslices hasn't been used in some time and neither
has sched_pickcpu().
2003-11-17 08:24:14 +00:00
Peter Wemm
90e3387e54 Expand the argument to the ithread enable/disable helper hooks from an
int to something big enough to hold a pointer.  amd64 needs this.
2003-11-17 06:08:10 +00:00
Robert Watson
b0323ea3aa Implement sockets support for __mac_get_fd() and __mac_set_fd()
system calls, and prefer these calls over getsockopt()/setsockopt()
for ABI reasons.  When addressing UNIX domain sockets, these calls
retrieve and modify the socket label, not the label of the
rendezvous vnode.

- Create mac_copy_socket_label() entry point based on
  mac_copy_pipe_label() entry point, intended to copy the socket
  label into temporary storage that doesn't require a socket lock
  to be held (currently Giant).

- Implement mac_copy_socket_label() for various policies.

- Expose socket label allocation, free, internalize, externalize
  entry points as non-static from mac_net.c.

- Use mac_socket_label_set() in __mac_set_fd().

MAC-aware applications may now use mac_get_fd(), mac_set_fd(), and
mac_get_peer() to retrieve and set various socket labels without
directly invoking the getsockopt() interface.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 23:31:45 +00:00
Robert Watson
9e71dd0feb Reduce gratuitous redundancy and length in function names:
mac_setsockopt_label_set() -> mac_setsockopt_label()
  mac_getsockopt_label_get() -> mac_getsockopt_label()
  mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel()

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 18:25:20 +00:00
Alan Cox
e45db9b837 - Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
 - Move the sf_buf API to its own header file; make struct sf_buf's
   definition machine dependent.  In this commit, we remove an
   unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
   Ultimately, we may eliminate struct sf_buf on those architecures
   except as an opaque pointer that references a vm page.
2003-11-16 06:11:26 +00:00
Robert Watson
12cbb9dc56 When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested.  This fixes process
queries of socket labels.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 03:53:36 +00:00
Bruce Evans
416ab90e6b Localized the cy driver's locking. 2003-11-16 00:55:54 +00:00
Poul-Henning Kamp
d87526cf43 Rename the debugging mutex "callout_no_sleep" to "dont_sleep_in_callout". 2003-11-15 18:33:54 +00:00
Tim J. Robbins
4d93f53e74 Initialize sequence numbers to 0 in seminit() instead of using whatever
garbage happens to be in memory. This did not seem to cause any problems
except making semaphore ID's unpredictable (and ugly in ipcs(1) output).
2003-11-15 11:56:53 +00:00
Poul-Henning Kamp
00cbe31bd8 Send B_PHYS out to pasture, it no longer serves any function. 2003-11-15 09:28:09 +00:00
Alan Cox
28c9416429 - Remove the remaining now unnecessary checks for the buf's b_object being
NULL.  See revision 1.421 for more detail.
 - Remove GIANT_REQUIRED from vfs_unbusy_pages().  Discussed with: jeff
2003-11-15 08:45:36 +00:00
Jeff Roberson
155b9987a3 - Introduce kseq_runq_{add,rem}() which are used to insert and remove
kses from the run queues.  Also, on SMP, we track the transferable
   count here.  Threads are transferable only as long as they are on the
   run queue.
 - Previously, we adjusted our load balancing based on the transferable count
   minus the number of actual cpus.  This was done to account for the threads
   which were likely to be running.  All of this logic is simpler now that
   transferable accounts for only those threads which can actually be taken.
   Updated various places in sched_add() and kseq_balance() to account for
   this.
 - Rename kseq_{add,rem} to kseq_load_{add,rem} to reflect what they're
   really doing.  The load is accounted for seperately from the runq because
   the load is accounted for even as the thread is running.
 - Fix a bug in sched_class() where we weren't properly using the PRI_BASE()
   version of the kg_pri_class.
 - Add a large comment that describes the impact of a seemingly simple
   conditional in sched_add().
 - Also in sched_add() check the transferable count and KSE_CAN_MIGRATE()
   prior to checking kseq_idle.  This reduces the frequency of access for
   kseq_idle which is a shared resource.
2003-11-15 07:32:07 +00:00
Olivier Houchard
1a29c80648 Better fix than my previous commit:
in exit1(), make sure the p_klist is empty after sending NOTE_EXIT.
The process won't report fork() or execve() and won't be able to handle
NOTE_SIGNAL knotes anyway.
This fixes some race conditions with do_tdsignal() calling knote() while
the process is exiting.

Reported by:	Stefan Farfeleder <stefan@fafoe.narf.at>
MFC after:	1 week
2003-11-14 18:49:01 +00:00
Alexander Kabaev
3b39740df8 Fix a number of style(9) bugs introduced in r1.113 by me.
Suggested by:	bde
2003-11-14 05:27:41 +00:00
Jeff Roberson
808674fd0e - regen. 2003-11-14 03:49:41 +00:00
Jeff Roberson
5c49a0566a - Revision 1.156 marked ptrace() SMP safe. Unfortunately, alpha implements
parts of ptrace using proc_rwmem().  proc_rwmem() requires giant, and
   giant must be acquired prior to the proc lock, so ptrace must require giant
   still.
2003-11-14 03:48:37 +00:00
Poul-Henning Kamp
555a5de270 Various minor details:
Give the HZ/overflow check a 10% margin.
	Eliminate bogus newline.
	If timecounters have equal quality, prefer higher frequency.

Some inspiration from:	bde
2003-11-13 10:03:58 +00:00
John Baldwin
79a13d0182 - Close a race where a thread on another CPU could release a contested lock
and empty its turnstile while the blocking threads still pointed to the
  turnstile.  If the thread on the first CPU blocked on a lock owned by
  one of the threads blocked on the turnstile just woken up, then the
  first CPU could try to manipulate a bogus thread queue in the turnstile
  during priority propagation.
- Update locking notes for ts_owner and always clear ts_owner, not just
  under INVARIANTS.

Tested by:      sam (1)
2003-11-12 23:48:42 +00:00
Kirk McKusick
48b0f4b67d At the request of several developers, restore the DIAGNOSIC code
deleted in 1.81. Increase the initial timeout limit to 2ms to
eliminate spurious messages of excessive timeouts in the NFS
client code.

Requested by:	Poul-Henning Kamp <phk@phk.freebsd.dk>
Requested by:	Mike Silbersack <silby@silby.com>
Requested by:	Sam Leffler <sam@errno.com>
2003-11-12 22:28:27 +00:00
Robert Watson
f0ab044241 Mark __mac_get_pid() as MPSAFE in the comment, as it runs without
Giant and is also MPSAFE.

Push Giant further down into __mac_get_fd() and __mac_set_fd(),
grabbing it only for constrained regions dealing with VFS, and
dropping it entirely for operations related to labeling of pipes.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-12 22:19:15 +00:00
Peter Wemm
cde6302bf0 MNAMELEN is back to an int again after Kirk's statfs commit
kern/vfs_mount.c:1305: warning: signed size_t format, different type arg (arg 4)
*** Error code 1
2003-11-12 17:09:12 +00:00
John Baldwin
861a7db56f Fix a typo in a comment.
Submitted by:	das
2003-11-12 14:55:45 +00:00
Poul-Henning Kamp
1415a09d42 Replace B_PHYS conditional assignment to bio_offset with KASSERT check
to see that the originating code already did it right.
2003-11-12 10:27:06 +00:00
Kirk McKusick
1977597b34 Update the five files derived from /sys/kern/syscalls.master
after the additions made for the new statfs structure (version
1.157). These must be updated in a separate checkin after
syscalls.master has been checked in so that they reflect its
new CVS identity. As these are purely derived files, it is not
clear to me why they are under CVS at all. I presume that it has
something to do with having `make world' operate properly.
2003-11-12 08:09:19 +00:00
Kirk McKusick
fde81c7d8e Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by:	Bruce Evans <bde@zeta.org.au>
Reviewed by:	Tim Robbins <tjr@freebsd.org>
Reviewed by:	Julian Elischer <julian@elischer.org>
Reviewed by:	the hoards of <arch@freebsd.org>
Sponsored by:   DARPA & NAI Labs.
2003-11-12 08:01:40 +00:00
Robert Watson
eca8a663d4 Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c).  This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE.  This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time.  This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change.  Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from:	bmilekic
Obtained from:		TrustedBSD Project
Sponsored by:		DARPA, Network Associates Laboratories
2003-11-12 03:14:31 +00:00
Alexander Kabaev
5c957adbf1 1. Consolidate mount struct allocation/destruction into a common code in
vfs_mount_alloc/vfs_mount_destroy functions and take care to completely
destroy the mount point along with its locks. Mount struct has grown in
coplexity recently and depending on each failure path to destroy it
completely isn't working anymore.

2. Eliminate largely identical vfs_mount and vfs_unmount question by
moving the code to handle both cases into a newly introduced vfs_domount
function.

3. Simplify nfs_mount_diskless to always expect an allocated mount
struct and never attempt an allocation/destruction itself. The
vfs_allocroot allocation was there to support 'magic' swap space
configuration for diskless clients that was already removed by PHK some
time ago.

4. Include a vfs_buildopts cleanups by Peter Edwards to validate the
sanity of nmount parameters passed from userland.

Submitted by:  (4) Peter Edwards <peter.edwards@openet-telecom.com>
Reviewed by:    rwatson
2003-11-12 02:54:47 +00:00
John Baldwin
961a7b244d Add an implementation of turnstiles and change the sleep mutex code to use
turnstiles to implement blocking isntead of implementing a thread queue
directly.  These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.

Turnstiles do not come out of a fixed-sized pool.  Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads.  The queue associated with a given lock
is found by a lookup in a simple hash table.  The turnstile itself is
protected by a lock associated with its entry in the hash table.  This
means that sched_lock is no longer needed to contest on a mutex.  Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.

Currently turnstiles only support mutexes.  Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation.  For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.

The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win).  Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.

Tested on:	i386 SMP, sparc64 SMP, alpha SMP
2003-11-11 22:07:29 +00:00