Commit Graph

10771 Commits

Author SHA1 Message Date
Konstantin Belousov
a888d54d39 Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by:	pho
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:08:15 +00:00
Ed Schouten
ceef66c0e3 Properly unlock the init/lock-state devices when invoking TIOCSETA.
For some reason a return-statement crept into this code, where it
shouldn't belong. This means we didn't properly unlock the TTY before
returning to userspace.

Submitted by:	Tor Egge <tor egge cvsup no freebsd org>
2008-08-27 19:37:21 +00:00
John Baldwin
9c2bf0cce2 - Only count the number of CPUs in the rendezvous map once rather than
doing it on every CPU.
- Use CPU_ABSENT() rather than pcpu_find() to determine if a CPU is not
  present.
- Count up to mp_maxid rather than MAXCPU when iterating over CPUs to
  match the rest of the code in the kernel.

MFC after:	1 week
2008-08-27 18:23:55 +00:00
Konstantin Belousov
cbc158449b Implement WNOWAIT flag for wait4(2). It specifies that process whose status
is returned shall be kept in the waitable state.
Add WSTOPPED as an alias for WUNTRACED.

Submitted by:	Jukka Ukkonen <jau at iki fi>
PR:	standards/116221
MFC after:	2 weeks
2008-08-26 12:37:16 +00:00
Konstantin Belousov
eaad109973 When calculating arguments to the interpreter for the shebang script
executed by fexecve(2), imgp->args->fname is NULL. Moreover, there is
no way to recover the path to the script being executed.
Do what some other U*ixes do unconditionally, namely supply /dev/fd/n
as the script path when called from fexecve(). Document requirement of
having fdescfs mounted as caveat.
2008-08-26 10:53:32 +00:00
John Baldwin
cf22c63dd5 Resort a few accessor routines so that they are consistently grouped
with 'set_foo/get_foo' adjacent to each other.
2008-08-25 16:16:57 +00:00
Robert Watson
3f3978840e More fully audit fexecve(2) and its arguments.
Obtained from:	TrustedBSD Project
Sponsored by:	Google, Inc.
2008-08-25 13:50:01 +00:00
Robert Watson
5ae504055a Regenerate following r182123. 2008-08-24 21:23:08 +00:00
Robert Watson
e484af13ed When MPSAFE ttys were merged, a new BSM audit event identifier was
allocated for posix_openpt(2).  Unfortunately, that identifier
conflicts with other events already allocated to other systems in
OpenBSM.  Assign a new globally unique identifier and conform
better to the AUE_ event naming scheme.

This is a stopgap until a new OpenBSM import is done with the
correct identifier, so we'll maintain this as a local diff in svn
until then.

Discussed with:	ed
Obtained from:	TrustedBSD Project
2008-08-24 21:20:35 +00:00
Christian S.J. Peron
e451733718 Remove worrying printf warning on bootup when processing vnodes which
have NULL mount-points.  This is the case for special vnodes, such as the
one used in nameiinit() which is used for crossing mount points in lookup()
to avoid  lock ordering issues.

MFC after:	2 weeks
Discussed with:	rwatson, kib
2008-08-24 20:16:44 +00:00
Ed Schouten
1a643b0f02 Allow the user to suppress the rate-limited pty(4) warning.
The pty(4) driver raises up to warnings when an old BSD-style PTY is
created. The reason why I added this warning, was to make it easier to
spot applications that allocate BSD-style PTY's, while they should just
use openpty() or posix_openpt().

Add a sysctl, which allows you to override the number of remaining
messages, making it possible to suppress the warnings.

Requested by:	kib
Reviewed by:	kib
2008-08-23 16:03:00 +00:00
Robert Watson
6356dba0b4 Introduce two related changes to the TrustedBSD MAC Framework:
(1) Abstract interpreter vnode labeling in execve(2) and mac_execve(2)
    so that the general exec code isn't aware of the details of
    allocating, copying, and freeing labels, rather, simply passes in
    a void pointer to start and stop functions that will be used by
    the framework.  This change will be MFC'd.

(2) Introduce a new flags field to the MAC_POLICY_SET(9) interface
    allowing policies to declare which types of objects require label
    allocation, initialization, and destruction, and define a set of
    flags covering various supported object types (MPC_OBJECT_PROC,
    MPC_OBJECT_VNODE, MPC_OBJECT_INPCB, ...).  This change reduces the
    overhead of compiling the MAC Framework into the kernel if policies
    aren't loaded, or if policies require labels on only a small number
    or even no object types.  Each time a policy is loaded or unloaded,
    we recalculate a mask of labeled object types across all policies
    present in the system.  Eliminate MAC_ALWAYS_LABEL_MBUF option as it
    is no longer required.

MFC after:	1 week ((1) only)
Reviewed by:	csjp
Obtained from:	TrustedBSD Project
Sponsored by:	Apple, Inc.
2008-08-23 15:26:36 +00:00
John Baldwin
969bf150df Fix a race condition with concurrent LOOKUP namecache operations for a vnode
not in the namecache when shared lookups are enabled (vfs.lookup_shared=1,
it is currently off by default) and the filesystem supports shared lookups
(e.g. NFS client).  Specifically, if multiple concurrent LOOKUPs both miss
in the name cache in parallel, each of the lookups may each end up adding an
entry to the namecache resulting in duplicate entries in the namecache
for the same pathname.  A subsequent removal of the mapping of that
pathname to that vnode (via remove or rename) would only evict one of the
entries from the name cache.  As a result, subseqent lookups for that
pathname would still return the old vnode.

This race was observed with shared lookups over NFS where a file was updated
by writing a new file out to a temporary file name and then renaming that
temporary file to the "real" file to effect atomic updates of a file.  Other
processes on the same client that were periodically reading the file would
occasionally receive an ESTALE error from open(2) because the VOP_GETATTR()
in nfs_open() would receive that error when given the stale vnode.

The fix here is to check for duplicates in cache_enter() and just return
if an entry for this same directory and leaf file name for this vnode is
already in the cache.  The check for duplicates is done by walking the
per-vnode list of name cache entries.  It is expected that this list should
be very small in the common case (usually 0 or 1 entries during a
cache_enter() since most files only have 1 "leaf" name).

Reviewed by:	ups, scottl
MFC after:	2 months
2008-08-23 15:13:39 +00:00
Ed Schouten
ce570f82cc Remove unused tty_gone() checks inside ttyoutq_read_uio().
When my earlier MPSAFE TTY prototypes still implemented line
disciplines, we needed a mechanism to abort read()'s on PTY master
devices when inside the line discipline. Because this is no longer the
case, these checks have become unneeded.
2008-08-23 13:32:21 +00:00
Craig Rodrigues
d5bdb2f68d In nmount(), when we see the "force" option,
set the MNT_FORCE flag, but do not persist "force"
in the options list, since it is a command, not a persistent property
of a mount.

Similarly, when we see "reload", set MNT_RELOAD,
but delete "reload" from the options list.

MFC after:	1 week
2008-08-23 01:16:09 +00:00
Kip Macy
6205924afd Submit a band-aid for interrupt set up race.
MFC after:	1 month
2008-08-22 23:24:53 +00:00
Ed Schouten
0f0a7c27c5 Fix two small bugs in tcsetattr().
- According to POSIX, tcsetattr() must not fail when any of the bits in
  the structure are unsupported, but it must leave the unsupported flags
  alone.

- The CIGNORE flag (set by TCSASOFT, extension) was not cleared from
  c_cflag, which means using it would cause it to be applied during its
  entire lifespan. Eventually make sure we clear the flag.

I don't really like CIGNORE, but I think we must keep it alive right
now. With our new TTY layer, we don't actually need this mechanism,
because if you leave c_cflag, c_ispeed and c_ospeed alone, we won't make
a call into the device driver anyway.

Reported by:	naddy
Tested by:	naddy
2008-08-22 21:27:37 +00:00
John Baldwin
7847a9daec A suspended thread can, in fact, be swapped out. Thus,
thread_unsuspend_one() needs to optionally wakeup the swapper.  Since we
hold the thread lock for that entire function, however, we have to push
that requirement up into the caller.

Found by:	rwatson
2008-08-22 16:15:58 +00:00
John Baldwin
814f26da8a Use |= rather than += when aggregrating requests to wakeup the swapper.
What we really want is an inclusive or of all the requests, and += can
in theory roll over to 0.
2008-08-22 16:14:23 +00:00
Ed Schouten
6137be4386 Fix pts(4) error codes when slave device is closed.
Unlike pre-MPSAFE TTY, the pts(4) driver always returned ENXIO when a
read() or write() was performed on a pseudo-terminal master device when
the slave device was not opened. The old implementation had different
semantics:

- When the slave device had not been opened yet, read() and write() just
  blocked.
- When the slave device had been closed, a read() call would return 0
  bytes length.
- When the slave device had been closed, a write() call would return
  EIO.

Change the new implementation to return 0 and EIO as well. We don't
implement the first rule, but I suspect this is not needed, because
routines like openpty() also open the slave device node. posix_openpt()
users also do similar things.

Reported by:	rink
Tested by:	rink
2008-08-22 10:40:21 +00:00
Ed Schouten
7dc843ca92 Prevent VSTART flooding when turning on software flow control.
It turned out we transmitted VSTART after each successful read on a TTY
when software flow control was turned on. This was because of a very
evil bug where we tested the TF_HIWAT_IN flag the other way around.

Reported by:	Christian Weisgerber <naddy mips inka de>
2008-08-22 05:15:52 +00:00
David E. O'Brien
35c316caaf Add comments on NOARGS, NODEF, and NOPROTO. 2008-08-21 22:57:31 +00:00
Ed Schouten
40572ab385 Properly lock proctree_lock before locking the process while accounting.
During the import of the MPSAFE TTY layer (r181905), I changed
acct_process() to lock proctree_lock instead of SESS_LOCK, because
s_ttyp is now locked using proctree_lock. One of the things I forgot,
was to lock it before we PROC_LOCK.

Commit this patch, written by kib@. To ensure we hold proctree_lock as
short as possible, obtaining `ac_tty' has now been made the first step
of filling `acct'.

Reported by:	Kevin <kevinxlinuz 163 com>
Solved by:	kib
2008-08-21 15:02:17 +00:00
Ed Schouten
040b1db930 Remove the now unused `lbolt' variable from the kernel.
We used to have a single wait channel inside the kernel which could be
used by threads that just wanted to sleep for some time (the next
second). The old TTY layer was the only piece of code that still used
lbolt, because I already removed the use of lbolt from the NFS clients
and the VFS syncer.

Approved by:	philip
2008-08-20 12:20:22 +00:00
Kip Macy
fc3a86f6e9 remove scheduler_running as xenbus no longer needs it
MFC after:	1 month
2008-08-20 09:21:24 +00:00
Ed Schouten
18cf135421 Update system call tables.
The previous commit also included changes to all the system call lists,
but it is a tradition to update these lists in a second commit, so rerun
make sysent to update the $FreeBSD$ tags inside these files to refer to
the latest version of syscalls.master.

Requested by:	rwatson
2008-08-20 08:39:10 +00:00
Ed Schouten
bc093719ca Integrate the new MPSAFE TTY layer to the FreeBSD operating system.
The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

  The old TTY layer has a driver model that is not abstract enough to
  make it friendly to use. A good example is the output path, where the
  device drivers directly access the output buffers. This means that an
  in-kernel PPP implementation must always convert network buffers into
  TTY buffers.

  If a PPP implementation would be built on top of the new TTY layer
  (still needs a hooks layer, though), it would allow the PPP
  implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

  With the old TTY layer, it isn't entirely safe to destroy TTY's from
  the system. This implementation has a two-step destructing design,
  where the driver first abandons the TTY. After all threads have left
  the TTY, the TTY layer calls a routine in the driver, which can be
  used to free resources (unit numbers, etc).

  The pts(4) driver also implements this feature, which means
  posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

  One of the major improvements is the per-TTY mutex, which is expected
  to improve scalability when compared to the old Giant locking.
  Another change is the unbuffered copying to userspace, which is both
  used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from:		//depot/projects/mpsafetty/...
Approved by:		philip (ex-mentor)
Discussed:		on the lists, at BSDCan, at the DevSummit
Sponsored by:		Snow B.V., the Netherlands
dcons(4) fixed by:	kan
2008-08-20 08:31:58 +00:00
Konstantin Belousov
2bb4c6f922 In brelse, put the B_NEEDSGIANT buffer on the QUEUE_DIRTY_GIANT queue,
instead of QUEUE_DIRTY.

Tested by:	pho
Reviewed by:	attilio
MFC after:	3 days
2008-08-19 11:31:49 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Alfred Perlstein
cbd3ba3edf Prevent crashes due to unlocked access to hash buckets in two sysctls.
Use CACHE_LOCK to prevent crashes.

Sysctls fixed: debug.hashstat.nchash and debug.hashstat.rawnchash.

Obtained from: Juniper Networks
MFC After: 1 week
2008-08-16 21:48:10 +00:00
Kip Macy
e77d7f7143 Add flag to indicate to xen support code that threads are running (and thus we can block).
MFC after:	1 month
2008-08-15 21:03:13 +00:00
Attilio Rao
3d06b4b330 Introduce some WITNESS improvements:
- Speedup the lock orderings lookup modifying the witness graph from a
  linked tree to a matrix. A table lookup caches the lock orderings in
  order to make a O(1) access for them. Any witness object has an unique
  index withing this lookup cache table.
- Reduce the lock contention on w_mtx acquiring it only when the LOR
  actually happens and not in a sane case. In order to do this don't totally
  flush lock lists (per-CPU spinlocks list and per-thread sleeplocks list)
  but check for ll_count anytime we need to have to verify allocations sanity.
- Introduce the function witness_thread_exit() in the witness namespace which
  should verify a thread doesn't hold any witness occurrence why exiting.
- Rename the sysctl debug.witness.graphs into debug.witness.fullgraph and
  add debug.witness.badstacks which prints out stacks for LOR revealed.
  This is implemented using the stack(9) support, which makes WITNESS to be
  dependent by the STACK option or by the DDB (including STACK) option.
- Fix style(9) for src/sys/kern/subr_witness.c

The hash table approach has been developed by Ilya Maykov on the behalf of
Isilon Systems which kindly released the patch.
Jeff Roberson, ported the patch to -CURRENT and fixed w_mtx contention, on the
behalf of Nokia.

Submitted by:	Ilya Maykov <ivmaykov at gmail dot com> (Isilon Systems), jeff
Sponsored by:	Nokia
2008-08-13 18:24:22 +00:00
Christian S.J. Peron
ded7d39cb9 Reduce the scope of the vnode lock such that it does not cover
the various copyouts associated with initializing the process's
argv/env data in userspace.  It is possible that these copyout
operations can fault under memory pressure, possibly resulting
in dead locks.  This is believed to be safe since none of the
copyout_strings() operations need to interact with the vnode here.

Submitted by:	Zhouyi Zhou
PR:		kern/111260
Discussed with:	kib
MFC after:	3 weeks
2008-08-12 21:27:48 +00:00
Konstantin Belousov
e792b09be2 Revert r181345.
Move the NULL pointer check to the vfs_deleteopt() function.

Discussed with:	rodrigc
MFC after:	3 days
2008-08-10 12:15:36 +00:00
Ed Schouten
79da190c16 Remove unneeded D_NEEDGIANT from /dev/fd/{0,1,2}.
There is no reason the fdopen() routine needs Giant. It only sets
curthread->td_dupfd, based on the device unit number of the cdev.

I guess we won't get massive performance improvements here, but still, I
assume we eventually want to get rid of Giant.
2008-08-09 12:42:12 +00:00
Dag-Erling Smørgrav
2616144e43 Add sbuf_new_auto as a shortcut for the very common case of creating a
completely dynamic sbuf.

Obtained from:	Varnish
MFC after:	2 weeks
2008-08-09 11:14:05 +00:00
Dag-Erling Smørgrav
546d78908b Switch to simplified BSD license (with phk's approval), plus whitespace
and style(9) cleanup.
2008-08-09 10:26:21 +00:00
John Baldwin
414e7679cb Permit Giant to be passed as the explicit interlock either to
msleep/mtx_sleep or the various cv_*wait*() routines.  Currently, the
"unlock" behavior of PDROP and cv_wait_unlock() with Giant is not
permitted as it is will be confusing since Giant is fully unrecursed and
unlocked during a thread sleep.

This is handy for subsystems which wish to allow unlocked drivers to
continue to use Giant such as CAM, the new TTY layer, and the new USB
stack.  CAM currently uses a hack that I told Scott to use because I
really didn't want to permit this behavior, and the TTY and USB patches
both have various patches to permit this.

MFC after:	2 weeks
2008-08-07 21:00:13 +00:00
John Baldwin
da7bbd2c08 If a thread that is swapped out is made runnable, then the setrunnable()
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup().  When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).

With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup.  However, this required
grabbing proc0's thread lock to perform the wakeup.  If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.

Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked.  The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up.  The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().

Discussed with:	jeff
Glanced at by:	sam
Tested by:	Jurgen Weber  jurgen - ish com au
MFC after:	2 weeks
2008-08-05 20:02:31 +00:00
John Baldwin
0f3dd6ff0d Close two different races with concurrent opens of pty master devices
that could result in leaked ttys or a leaked pty + tty pair.

MFC after:	1 week
2008-08-04 19:51:23 +00:00
John Baldwin
0bc7bc0ec8 - Close a race with concurrent open's of a pts master device which could
result in leaked tty structures.
- When constructing a new pty, allocate it's tty structure before adding
  it to the list.

MFC after:	1 week
2008-08-04 19:49:05 +00:00
Antoine Brodin
f8062a0b0f Kill a dead variable
PR:		126223
Submitted by:	Mateusz Guzik
2008-08-03 21:07:19 +00:00
Robert Watson
1d986c5ff1 Remove broken code to replace st_mode value with ACCESSPERMS when
lstat(2) is called on symlinks -- this code appears never to have
worked.  The PR this addresses suggests that the intended
original behavior is the right one, but as bde points out in the
PR comments, we do actually support storing a mode on symlinks,
so returning it seems reasonable.

This is consistent with Mac OS X, which despite documentation to
the contrary does return the mode set on a symlink, but not some
other platforms.  The Single Unix Spec requires only that the
returned bits be "meaningful", which seems at best unhelpful as
advice goes.

PR:		25018
MFC after:	3 days
2008-08-03 15:44:56 +00:00
Konstantin Belousov
4f7afc20e0 Calling linker_load_dependencies() while holding the module'
vnode lock may cause a LOR between kld_sx lock and vnode lock.
linker_load_dependencies() drops kld_sx, and another thread may attempt
to load the same kld.

Reported and tested by:	pjd
MFC after:	1 week
2008-08-03 13:33:45 +00:00
Sam Leffler
6e0186d5ee add callout_schedule; besides being useful it also improves
compatibility with other systems

Reviewed by:	ed, battlez
2008-08-02 17:42:38 +00:00
Christian S.J. Peron
dfc714fba1 Currently, BSM audit pathname token generation for chrooted or jailed
processes are not producing absolute pathname tokens.  It is required
that audited pathnames are generated relative to the global root mount
point.  This modification changes our implementation of audit_canon_path(9)
and introduces a new function: vn_fullpath_global(9) which performs a
vnode -> pathname translation relative to the global mount point based
on the contents of the name cache.  Much like vn_fullpath,
vn_fullpath_global is a wrapper function which called vn_fullpath1.

Further, the string parsing routines have been converted to use the
sbuf(9) framework.  This change also removes the conditional acquisition
of Giant, since the vn_fullpath1 method will not dip into file system
dependent code.

The vnode locking was modified to use vhold()/vdrop() instead the vref()
and vrele().  This will modify the hold count instead of modifying the
user count.  This makes more sense since it's the kernel that requires
the reference to the vnode.  This also makes sure that the vnode does not
get recycled we hold the reference to it. [1]

Discussed with:	rwatson
Reviewed by:	kib [1]
MFC after:	2 weeks
2008-07-31 16:57:41 +00:00
Ed Schouten
e7ea30e404 Remove the use of lbolt from the VFS syncer.
It seems we only use `lbolt' inside the VFS syncer and the TTY layer
now.  Because I'm planning to replace the TTY layer next month, there's
no reason to keep `lbolt' if it's only used in a single thread inside
the kernel.

Because the syncer code wanted to wake up the syncer thread before the
timeout, it called sleepq_remove(). Because we now just use a condvar(9)
with a timeout value of `hz', we can wake it up using cv_broadcast()
without waking up any unrelated threads.

Reviewed by:	phk
2008-07-30 12:39:18 +00:00
Ed Schouten
911d490140 Don't make subr_clist.c depend on the TTY layer.
After the import of the new TTY layer, the TTY_QUOTE definition will not
be present anymore. To make sure clists will still work as expected,
introduce an internal definition called QUOTEMASK.

Maybe we can decide to remove the quote bits entirely, but we still have
to look into this. There may be drivers that still use the quote bits.

Obtained from:	//depot/projects/mpsafetty
2008-07-30 12:32:42 +00:00
John Baldwin
c3ea337801 When choosing a CPU for a thread in a cpuset, prefer the last CPU that the
thread ran on if there are no other CPUs in the set with a shorter per-CPU
runqueue.
2008-07-28 20:39:21 +00:00
John Baldwin
f7f1cc1518 Really fix this. 2008-07-28 18:33:43 +00:00
Pawel Jakub Dawidek
7224dd4dad Properly check if td_name is empty and if it is, print process name,
instead of empty thread name.

Reviewed by:	jhb
2008-07-28 18:10:26 +00:00
John Baldwin
f200843b72 Implement support for cpusets in the 4BSD scheduler.
- When a cpuset is applied to a thread, walk the cpuset to see if it is a
  "full" cpuset (includes all available CPUs).  If not, set a new
  TDS_AFFINITY flag to indicate that this thread can't run on all CPUs.
  When inheriting a cpuset from another thread during thread creation, the
  new thread also inherits this flag.  It is in a new ts_flags field in
  td_sched rather than using one of the TDF_SCHEDx flags because fork()
  clears td_flags after invoking sched_fork().
- When placing a thread on a runqueue via sched_add(), if the thread is not
  pinned or bound but has the TDS_AFFINITY flag set, then invoke a new
  routine (sched_pickcpu()) to pick a CPU for the thread to run on next.
  sched_pickcpu() walks the cpuset and picks the CPU with the shortest
  per-CPU runqueue length.  Note that the reason for the TDS_AFFINITY flag
  is to avoid having to walk the cpuset and examine runq lengths in the
  common case.
- To avoid walking the per-CPU runqueues in sched_pickcpu(), add an array
  of counters to hold the length of the per-CPU runqueues and update them
  when adding and removing threads to per-CPU runqueues.

MFC after:	2 weeks
2008-07-28 17:25:24 +00:00
John Baldwin
8aa3d7ffc0 Various and sundry style and whitespace fixes. 2008-07-28 15:52:02 +00:00
Kip Macy
947265b6bd - track maximum wait time
- resize columns based on actual observed numerical values

MFC after:	3 days
2008-07-27 21:45:20 +00:00
Pawel Jakub Dawidek
5573021d78 Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel()
functions.

Reviewed by:	kib
2008-07-27 11:48:15 +00:00
Pawel Jakub Dawidek
610507ae00 - Move vp test for beeing NULL under IGNORE_LOCK().
- Check if panicstr isn't set, if it is ignore the lock. This helps to avoid
  confusion, because lockmgr is a no-op when panicstr isn't NULL, so
  asserting anything at this point doesn't make sense and can just race with
  other panic.

Discussed with:	kib
2008-07-27 11:46:42 +00:00
Tom Rhodes
be6b130476 Fill in a few sysctl descriptions.
Approved by:	rwatson
2008-07-26 00:55:35 +00:00
Ed Schouten
bea45cdda3 Move ttyinfo() into its own C file.
The ttyinfo() routine generates the fancy output when pressing ^T. Right
now it is stored in tty.c. In the MPSAFE TTY code it is already stored
in tty_info.c. To make integration of the MPSAFE TTY code a little
easier, take the same approach.

This makes the TTY code a little bit more readable, because having the
proc_*/thread_* routines in tty.c is very distractful.

Approved by:	philip (mentor)
2008-07-25 14:31:00 +00:00
Konstantin Belousov
58e8af1bf5 Call pargs_drop() unconditionally in do_execve(), the function correctly
handles the NULL argument.
Make pargs_free() static.

MFC after:	1 week
2008-07-25 11:55:32 +00:00
Konstantin Belousov
96f1567fa7 s/alredy/already/ in the comments and the log message. 2008-07-25 11:22:25 +00:00
Konstantin Belousov
8b4a2800de Do the pargs_hold() on the copy of the pointer to the p_args of the
child process immediately after bulk bcopy() without dropping the
process lock.

Since process is not single-threaded when forking, dropping and
reacquiring the lock allows an other thread to change the process title
of the parent in between, and results in hold being done on the invalid
pointer. The problem manifested itself as the double free of the old
p_args.

Reported by:	kris
Reviewed by:	jhb
MFC after:	1 week
2008-07-23 08:45:25 +00:00
Attilio Rao
09400d5abe - Disallow XFS mounting in write mode. The write support never worked really
and there is no need to maintain it.
- Fix vn_get() in order to let it call vget(9) with a valid locking
  request.  vget(9) returns the vnode locked in order to prevent recycling,
  but in this case internal XFS locks alredy prevent it from happening, so
  it is safe to drop the vnode lock before to return by vn_get().
- Add a VNASSERT() in vget(9) in order to catch malformed locking requests.

Discussed with:	kan, kib
Tested by:	Lothar Braun <lothar at lobraun dot de>
2008-07-21 23:01:09 +00:00
Robert Watson
828e07694c If run_interrupt_driven_config_hooks() waits 360 seconds and INVARIANTS
is compiled into the kernel, then panic.

MFC after:	3 days
Discussed with:	scottl
2008-07-21 20:50:49 +00:00
Pawel Jakub Dawidek
7f41115ef6 Implement the following macros for completeness:
SYSCTL_QUAD()
	SYSCTL_ADD_QUAD()
	TUNABLE_QUAD()
	TUNABLE_QUAD_FETCH()

Now we can use 64bit tunables on 32bit systems.
2008-07-21 15:05:25 +00:00
Kip Macy
dd0e6c383a Add accessor functions for socket fields.
MFC after:	1 week
2008-07-21 00:49:34 +00:00
Alan Cox
14e69e48b8 Eliminate dead code. (The commit message for revision 1.287 explains why
this code is dead.)
2008-07-20 04:13:51 +00:00
Robert Watson
1cc2bd820b Rather than simply waiting silently and indefinitely for all
interrupt-driven configuration handlers to complete, print out a
diagnostic message every 60 second indicating which handlers are
still running.  Do this at most 5 times per run so as to avoid
scrolling out any useful information from the kernel message
buffer.

The interval of 60 seconds was selected based on a best guess as
to the nature of "long enough" and may want to be tuned higher
or lower depending on real-world tolerances.

MFC after:	3 days
Discussed with:	scottl
2008-07-19 19:08:35 +00:00
Robert Watson
1a4b919f8e witness_addgraph() is required even if DDB isn't compiled into the kernel,
so exclude it from #ifdef DDB.

Submitted by:	attilio
2008-07-19 17:47:23 +00:00
Robert Watson
51c0f94ed7 Add DDB "show conifhk" command, which lists hooks currently waiting
for completion in run_interrupt_driven_config_hooks().  This is
helpful when trying to figure out which device drivers have gone
into la-la land during boot-time autoconfiguration.

MFC after:	3 days
2008-07-19 12:12:54 +00:00
Jeff Roberson
9fc51b0bf4 Fix a race which could result in some timeout buckets being skipped.
- When a tick occurs on a cpu, iterate from cs_softticks until ticks.
   The per-cpu tick processing happens asynchronously with the actual
   adjustment of the 'ticks' variable.  Sometimes the results may
   be visible before the local call and sometimes after.  Previously this
   could cause a one tick window where we didn't evaluate the bucket.
 - In softclock fetch curticks before incrementing cc_softticks so we
   don't skip insertions which were made for the current time.

Sponsored by:	Nokia
2008-07-19 05:18:29 +00:00
Jeff Roberson
e980fff622 - Check whether we've recorded this tick in ts_ticks on another cpu in
sched_tick() to prevent multiple increments for one tick.  This pushes
   the value out of range and breaks priority calculation.

Reviewed by:	kib
Found by:	pho/nokia
Sponsored by:	Nokia
MFC after:	3 days
2008-07-19 05:13:47 +00:00
Kip Macy
694382c8eb revert local change 2008-07-18 07:10:33 +00:00
Kip Macy
2976b312a1 revert change from local tree 2008-07-18 07:07:57 +00:00
Kip Macy
4af83c8cff import vendor fixes to cxgb 2008-07-18 06:12:31 +00:00
Konstantin Belousov
9a75ea2333 Pair the VOP_OPEN call from do_execve() with the reciprocal VOP_CLOSE.
This was unnoticed because local filesystems usually do nothing
non-trivial in the close vop.

Reported and tested by:	Rick Macklem
MFC after:	2 weeks
2008-07-17 16:44:07 +00:00
Antoine Brodin
23d5e112eb Staticize M_STACK.
Approved by:	rwatson (mentor)
MFC after:	1 month
2008-07-13 17:15:05 +00:00
Craig Rodrigues
1aad294b4e In nmount(), if we see "update" in the mount options,
set MNT_UPDATE in fsflags, and delete the
"update" option from the global mount options.

MNT_UPDATE is a command, and not a property of a mount
that should persist after the command is executed.

We need to do similar things for MNT_FORCE and MNT_RELOAD.

All mount flags are prefixed by MNT_..... it would
be nice if flags which were commands were named differently
from flags which are persistent properties of a mount.
This was not such a big deal in the pre-nmount() days,
but with nmount() it is more important.

Requested by:	yar
MFC after:	2 weeks
2008-07-12 20:12:40 +00:00
David E. O'Brien
b474c780b5 Improve readability and cscope searches a little bit by not using the
same variable name in closely related (but not conflicting) contexts.
2008-07-11 14:48:28 +00:00
Konstantin Belousov
ae95dc623a Make it atomic for the devfs_populate_loop() to see the setting of
SI_ALIAS flag and initialization of the si_parent when alias is created.
Assert that supplied parent device is not NULL.

Both situations could cause NULL dereference in the
devfs_populate_loop() when creating a symlink for SI_ALIAS'ed device.
Namely, cdp->cdp_c.si_parent may be NULL.

Reported by:	mav
MFC after:	2 weeks
2008-07-11 11:22:19 +00:00
David E. O'Brien
4f2945f832 Revert r180431.
r180431 broke the AMD64 build (the only arch using kern/link_elf_obj.c)
2008-07-11 01:10:40 +00:00
David E. O'Brien
f55ffb3990 Allow 'elf_file_t' to be used in a wider scope. 2008-07-10 16:35:57 +00:00
Edwin Groothuis
552f9f63c1 Improve the output of kldload(8) to show which module can't be loaded.
Was:		kldload: Unsupported file type
Is now:		kldload: /boot/modules/test.ko: Unsupported file type

PR:		kern/121276
Submitted by:	Edwin Groothuis <edwin@mavetju.org>
Approved by:	bde (mentor)
MFC after:	1 week
2008-07-08 23:51:38 +00:00
Bjoern A. Zeeb
dea0ed6690 Add a `show cpusets' DDB command to print numbered root and
assigned CPU affinity sets.

Reviewed by:	brooks
2008-07-07 21:32:02 +00:00
Bjoern A. Zeeb
45e48455cd MFp4 144659:
Plug a memory leak with jail services.

PR:		125257
Submitted by:	Mateusz Guzik <mjguzik gmail.com>
MFC after:	6 days
2008-07-07 20:53:49 +00:00
Bjoern A. Zeeb
7a8f695a21 Move cpuset_refroot and cpuset_refbase functions up, grouping the
cpuset_ref* functions together. Will make it easier to read and
add code without forward declarations.
No functional changes.
2008-07-07 20:45:55 +00:00
Konstantin Belousov
7054ee4e38 The kqueue_register() function assumes that it is called from the top of
the syscall code and acquires various event subsystem locks as needed.
The handling of the NOTE_TRACK for EVFILT_PROC is currently done by
calling the kqueue_register() from filt_proc() filter, causing recursive
entrance of the kqueue code. This results in the LORs and recursive
acquisition of the locks.

Implement the variant of the knote() function designed to only handle
the fork() event. It mostly copies the knote() body, but also handles
the NOTE_TRACK, removing the handling from the filt_proc(), where it
causes problems described above. The function is called from the fork1()
instead of knote().

When encountering NOTE_TRACK knote, it marks the knote as influx
and drops the knlist and kqueue lock. In this context call to
kqueue_register is safe from the problems.

An error from the kqueue_register() is reported to the observer as
NOTE_TRACKERR fflag.

PR:	108201
Reviewed by:	jhb, Pramod Srinivasan <pramod juniper net> (previous version)
Discussed with:	jmg
Tested by:	pho
MFC after:	2 weeks
2008-07-07 09:30:11 +00:00
Konstantin Belousov
e1a32fd42b The r178914 I erronously put the setting of the KQ_FLUXWAIT flag before
KQ_FLUX_WAKEUP(). Since the later macro clears the KQ_FLUXWAIT, the
kqueue_scan() thread may be not woken up.

Move the setting of KQ_FLUXWAIT after wakeup to correct the issue.

Reported and tested by:	pho
MFC after:	3 days
2008-07-07 09:15:29 +00:00
Alan Cox
b89eaf4e9f Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout
2008-07-05 19:34:33 +00:00
Robert Watson
4f7d1876d5 Introduce a new lock, hostname_mtx, and use it to synchronize access
to global hostname and domainname variables.  Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout().  A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.

Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.

MFC after:	3 weeks
2008-07-05 13:10:10 +00:00
Alan Cox
6819e13eeb Correct an error in the comments for init_param3().
Discussed with: silby
2008-07-04 19:36:58 +00:00
Robert Watson
59dd72d040 Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred).  Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired.  This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by:	bz
MFC after:	1 month
X-MFC note:	We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
		but the rest can go back.
2008-07-04 00:21:38 +00:00
Ed Maste
7928893d83 Use bcopy instead of strlcpy in uipc_bind and unp_connect, since
soun->sun_path isn't a null-terminated string.  As UNIX(4) states, "the
terminating NUL is not part of the address."  Since strlcpy has to return
"the total length of the string [it] tried to create," it walks off the end
of soun->sun_path looking for a \0.

This reverts r105332.

Reported by:    Ryan Stone
2008-07-03 23:26:10 +00:00
Julian Elischer
f44e6e2ecc Change a variable name to not shadow a global
Obtained from:	vimage
2008-07-03 08:35:59 +00:00
Robert Watson
6992381eca Update copyright date in light of soreceive_dgram(9). 2008-07-03 06:47:45 +00:00
Robert Watson
5df3e83946 Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP.  This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by:	kris, ps
2008-07-02 23:23:27 +00:00
Roman Divacky
bff2d4d5ff Use msleep_spin() instead of unlock/tsleep/lock. This was
already commited but with a wrong msleep variant and then
backed out. Note that this changes the semantic a little
as msleep_spin does not let us to specify priority after
wakeup.

Approved by:	wkoszek, cognet
Approved by:	kib (mentor)
2008-07-02 20:44:33 +00:00
Bjoern A. Zeeb
04a58b9d5f Remove an unneeded error variable to make clear that if reaching
the end of the function we never return an error.
2008-06-29 18:26:07 +00:00
Bjoern A. Zeeb
ba931c0855 Add a new priv 'PRIV_SCHED_CPUSET' to check if manipulating cpusets is
allowed and replace the suser() call. Do not allow it in jails.

Reviewed by:	rwatson
2008-06-29 17:58:16 +00:00
John Baldwin
6bc1e9cd84 Rework the lifetime management of the kernel implementation of POSIX
semaphores.  Specifically, semaphores are now represented as new file
descriptor type that is set to close on exec.  This removes the need for
all of the manual process reference counting (and fork, exec, and exit
event handlers) as the normal file descriptor operations handle all of
that for us nicely.  It is also suggested as one possible implementation
in the spec and at least one other OS (OS X) uses this approach.

Some bugs that were fixed as a result include:
- References to a named semaphore whose name is removed still work after
  the sem_unlink() operation.  Prior to this patch, if a semaphore's name
  was removed, valid handles from sem_open() would get EINVAL errors from
  sem_getvalue(), sem_post(), etc.  This fixes that.
- Unnamed semaphores created with sem_init() were not cleaned up when a
  process exited or exec'd.  They were only cleaned up if the process
  did an explicit sem_destroy().  This could result in a leak of semaphore
  objects that could never be cleaned up.
- On the other hand, if another process guessed the id (kernel pointer to
  'struct ksem' of an unnamed semaphore (created via sem_init)) and had
  write access to the semaphore based on UID/GID checks, then that other
  process could manipulate the semaphore via sem_destroy(), sem_post(),
  sem_wait(), etc.
- As part of the permission check (UID/GID), the umask of the proces
  creating the semaphore was not honored.  Thus if your umask denied group
  read/write access but the explicit mode in the sem_init() call allowed
  it, the semaphore would be readable/writable by other users in the
  same group, for example.  This includes access via the previous bug.
- If the module refused to unload because there were active semaphores,
  then it might have deregistered one or more of the semaphore system
  calls before it noticed that there was a problem.  I'm not sure if
  this actually happened as the order that modules are discovered by the
  kernel linker depends on how the actual .ko file is linked.  One can
  make the order deterministic by using a single module with a mod_event
  handler that explicitly registers syscalls (and deregisters during
  unload after any checks).  This also fixes a race where even if the
  sem_module unloaded first it would have destroyed locks that the
  syscalls might be trying to access if they are still executing when
  they are unloaded.

  XXX: By the way, deregistering system calls doesn't do any blocking
  to drain any threads from the calls.
- Some minor fixes to errno values on error.  For example, sem_init()
  isn't documented to return ENFILE or EMFILE if we run out of semaphores
  the way that sem_open() can.  Instead, it should return ENOSPC in that
  case.

Other changes:
- Kernel semaphores now use a hash table to manage the namespace of
  named semaphores nearly in a similar fashion to the POSIX shared memory
  object file descriptors.  Kernel semaphores can now also have names
  longer than 14 chars (up to MAXPATHLEN) and can include subdirectories
  in their pathname.
- The UID/GID permission checks for access to a named semaphore are now
  done via vaccess() rather than a home-rolled set of checks.
- Now that kernel semaphores have an associated file object, the various
  MAC checks for POSIX semaphores accept both a file credential and an
  active credential.  There is also a new posixsem_check_stat() since it
  is possible to fstat() a semaphore file descriptor.
- A small set of regression tests (using the ksem API directly) is present
  in src/tools/regression/posixsem.

Reported by:	kris (1)
Tested by:	kris
Reviewed by:	rwatson (lightly)
MFC after:	1 month
2008-06-27 05:39:04 +00:00
Julian Elischer
9dcc73ed79 Someone cut and pasted a bunch of stuff here so lots of
indents were spaces when they should have been tabs,
screwing up diffs and patches..

Whitespace commit as my first SVN commit. (yay)

MFC after:	1 week
2008-06-26 22:45:04 +00:00
Doug Rabson
c675522fc4 Re-implement the client side of rpc.lockd in the kernel. This implementation
provides the correct semantics for flock(2) style locks which are used by the
lockf(1) command line tool and the pidfile(3) library. It also implements
recovery from server restarts and ensures that dirty cache blocks are written
to the server before obtaining locks (allowing multiple clients to use file
locking to safely share data).

Sponsored by:	Isilon Systems
PR:		94256
MFC after:	2 weeks
2008-06-26 10:21:54 +00:00
Ruslan Ermilov
d03c587ffa Fix a chicken-and-egg problem: this files implements SSP support,
so we cannot compile it with -fstack-protector[-all] flags (or
it will self-recurse); this is ensured in sys/conf/files.  This
OTOH means that checking for defines __SSP__ and __SSP_ALL__ to
determine if we should be compiling the support is impossible
(which it was trying, resulting in an empty object file).  Fix
this by always compiling the symbols in this files.  It's good
because it allows us to always have SSP support, and then compile
with SSP selectively.

Repoted by:	tinderbox
2008-06-26 07:52:45 +00:00
Ruslan Ermilov
042df2e2da Enable GCC stack protection (aka Propolice) for userland:
- It is opt-out for now so as to give it maximum testing, but it may be
  turned opt-in for stable branches depending on the consensus.  You
  can turn it off with WITHOUT_SSP.
- WITHOUT_SSP was previously used to disable the build of GNU libssp.
  It is harmless to steal the knob as SSP symbols have been provided
  by libc for a long time, GNU libssp should not have been much used.
- SSP is disabled in a few corners such as system bootstrap programs
  (sys/boot), process bootstrap code (rtld, csu) and SSP symbols themselves.
- It should be safe to use -fstack-protector-all to build world, however
  libc will be automatically downgraded to -fstack-protector because it
  breaks rtld otherwise.
- This option is unavailable on ia64.

Enable GCC stack protection (aka Propolice) for kernel:
- It is opt-out for now so as to give it maximum testing.
- Do not compile your kernel with -fstack-protector-all, it won't work.

Submitted by:	Jeremie Le Hen <jeremie@le-hen.org>
2008-06-25 21:33:28 +00:00
David Xu
7de1ecef2d Add two commands to _umtx_op system call to allow a simple mutex to be
locked and unlocked completely in userland. by locking and unlocking mutex
in userland, it reduces the total time a mutex is locked by a thread,
in some application code, a mutex only protects a small piece of code, the
code's execution time is less than a simple system call, if a lock contention
happens, however in current implemenation, the lock holder has to extend its
locking time and enter kernel to unlock it, the change avoids this disadvantage,
it first sets mutex to free state and then enters kernel and wake one waiter
up. This improves performance dramatically in some sysbench mutex tests.

Tested by: kris
Sounds great: jeff
2008-06-24 07:32:12 +00:00
John Baldwin
c4f3a35a54 Remove the posixsem_check_destroy() MAC check. It is semantically identical
to doing a MAC check for close(), but no other types of close() (including
close(2) and ksem_close(2)) have MAC checks.

Discussed with:	rwatson
2008-06-23 21:37:53 +00:00
Robert Watson
3319d71265 If S_IFIFO is passed to mknod(2), invoke kern_mkfifoat(9) to create a
FIFO, as required by SUSv3.  No specific privilege check is performed
in this case, as FIFOs may be created by unprivileged processes
(subject to the normal file system name space restrictions that may be
in place).

Unlike the Apple implementation, we reject requests to create a FIFO
using mknod(2) if there is a non-zero dev argument to the system call,
which is permitted by the Open Group specification ("... undefined
...").  We might want to revise this if we find it causes
compatibility problems for applications in practice.

PR:		kern/74242, kern/68459
Obtained from:	Apple, Inc.
MFC after:	3 weeks
2008-06-22 21:51:32 +00:00
Oleksandr Tymoshenko
22035f4727 Use minimum of max_aio_procs and target_aio_procs when spawning new
aiod since there should be no more then max_aio_procs processes.
2008-06-21 11:34:34 +00:00
Warner Losh
c14909b6e2 Split out the probing magic of device_probe_and_attach into
device_probe() so that it can be used by busses that may wish to do
additional processing between probe and attach.

Reviewed by:	dfr@
2008-06-20 16:58:15 +00:00
Alan Cox
ac68d1c960 Enforce the mapping of kernel loadable modules in the uppermost 2GB of the
kernel virtual address space on amd64.
2008-06-20 06:24:34 +00:00
Xin LI
2110d913c0 Revert rev. 178124 as requested by kris@. Having jail id not being
reused too frequently is useful for script controlled environment.
2008-06-19 21:41:57 +00:00
Oleksandr Tymoshenko
23c8064e66 Renew semaphore's pointer after wakeup since during msleep
sem_base may have been modified by destroying one of semaphores
and semptr would not be valid in this case.

PR: kern/123731
2008-06-19 18:08:42 +00:00
Konstantin Belousov
05427aafc6 Struct cdev is always the member of the struct cdev_priv. When devfs
needed to promote cdev to cdev_priv, the si_priv pointer was followed.

Use member2struct() to calculate address of the wrapping cdev_priv.
Rename si_priv to __si_reserved.

Tested by:	pho
Reviewed by:	ed
MFC after:	2 weeks
2008-06-16 17:34:59 +00:00
John Birrell
5d846378f7 Remove code that isn't required. It actually breaks the case where KDTRACE_HOOKS
is defined and KDB isn't. This is the case that it was intended for.
2008-06-16 04:44:29 +00:00
Ed Schouten
0f03ce1bb8 Turn dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
Now that we got rid of the minor-to-unit conversion and the constraints
on device minor numbers, we can convert the functions that operate on
minor and unit numbers to simple macro's. The unit2minor() and
minor2unit() macro's are now no-ops.

The ZFS code als defined a macro named `minor'. Change the ZFS code to
use umajor() and uminor() here, as it is the correct approach to do
this. Also add $FreeBSD$ to keep SVN happy.

Approved by:	philip (mentor), pjd
2008-06-12 08:30:54 +00:00
Ed Schouten
29d4cb241b Don't enforce unique device minor number policy anymore.
Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.

Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.

This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.

The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.

Approved by:	philip (mentor)
2008-06-11 18:55:19 +00:00
Oleksandr Tymoshenko
c9688a603b Keep proper track of nsegs counter: sem_free is called for all
allocated semaphores, so it's wrong to increase it conditionally,
  in this case for every over-the-limit semaphore nsegs is decreased
  without being previously increased.

  PR:	kern/123685
  Approved by:	cognet (mentor)
2008-06-10 20:55:10 +00:00
Konstantin Belousov
a70537835f Provide the mutual exclusion between the nfs export list modifications
and nfs requests processing. Lockmgr lock provides the shared locking for
nfs requests, while exclusive mode is used for modifications. The writer
starvation is handled by lockmgr too.

Reported by:	kris, pho, many
Based on the submission by:	mohan
Tested by:	pho
MFC after:	2 weeks
2008-06-09 10:31:38 +00:00
Wojciech A. Koszek
2e75877f12 Remove checks against DDB, which isn't used in this file.
My intention is to bring no functional change.

Discussion on:	IRC
Reviewed by:	ed, kan, rink,
2008-06-08 20:43:27 +00:00
Ed Schouten
5db88944ac Remove unneeded Giant locking of /dev/tty.
The Giant lock is acquired in two places in tty_tty.c. In both places,
it is unneeded.

There is no reason to specify D_NEEDGIANT on this device node. The
device node has only been designed to return ENXIO when opened. It
doesn't make any sense to lock/unlock Giant, just to return this error.
D_TTY is also unneeded. The unimplemented functions don't need to be
patched by devfs.

We don't need to lock Giant when we want to lookup the proper TTY vnode.
s_ttyvp is already protected by proctree_lock (see devfs_vnops.c).

Approved by:	philip (mentor)
2008-06-03 12:38:00 +00:00
David Xu
6e24e61797 Use a seperated hash table for mutex and rwlock, avoid wasting some time
on walking through idle threads sleeping on condition variables.
2008-05-30 02:18:54 +00:00
Ed Schouten
06d425f92e Remove the distinction between device minor and unit numbers.
Even though we got rid of device major numbers some time ago, device
drivers still need to provide unique device minor numbers to make_dev().
These numbers are only used inside the kernel. They are not related to
device major and minor numbers which are visible in devfs. These are
actually based on the inode number of the device.

It would eventually be nice to remove minor numbers entirely, but we
don't want to be too agressive here.

Because the 8-15 bits of the device number field (si_drv0) are still
reserved for the major number, there is no 1:1 mapping of the device
minor and unit numbers. Because this is now unused, remove the
restrictions on these numbers.

The MAXMAJOR definition was actually used for two purposes. It was used
to convert both the userspace and kernelspace device numbers to their
major/minor pair, which is why it is now named UMINORMASK.

minor2unit() and unit2minor() have now become useless. Both minor() and
dev2unit() now serve the same purpose. We should eventually remove some
of them, at least turning them into macro's. If devfs would become
completely minor number unaware, we could consider using si_drv0 directly,
just like si_drv1 and si_drv2.

Approved by:	philip (mentor)
2008-05-29 12:50:46 +00:00
Ed Schouten
cc8945d204 Remove redundant checks from fcntl()'s F_DUPFD.
Right now we perform some of the checks inside the fcntl()'s F_DUPFD
operation twice. We first validate the `fd' argument. When finished,
we validate the `arg' argument. These checks are also performed inside
do_dup().

The reason we need to do this, is because fcntl() should return different
errno's when the `arg' argument is out of bounds (EINVAL instead of
EBADF). To prevent the redundant locking of the PROC_LOCK and
FILEDESC_SLOCK, patch do_dup() to support the error semantics required
by fcntl().

Approved by:	philip (mentor)
2008-05-28 20:25:19 +00:00
Ed Schouten
09a80aba8e Rename tty_subr.c' to subr_clist.c'.
Because clists are also used outside the TTY layer, rename the file
containing the clist routines to something more accurate.

The mpsafetty TTY layer doesn't use clists. It uses its own buffers,
which also implement the unbuffered copying to userspace. We cannot
simply remove the clist routines then, because this would break various
drivers that are present within the kernel.

Approved by:	philip (mentor)
2008-05-27 06:41:50 +00:00
Attilio Rao
48972152ee Improve a comment which, in the actual CVS stock, doesn't completely
explain the logic of the code chunk.
2008-05-27 00:27:50 +00:00
Konstantin Belousov
887aedc64e Take into account possible overflow when multiplying. The casuality is
the malloc call later, panicing kernel due to the oversized allocation.

Reported by:	pho
Reviewed by:	jeff
2008-05-26 10:01:13 +00:00
Robert Watson
e4372ceba0 Remove netatm from HEAD as it is not MPSAFE and relies on the now removed
NET_NEEDS_GIANT.  netatm has been disconnected from the build for ten
months in HEAD/RELENG_7.  Specifics:

- netatm include files
- netatm command line management tools
- libatm
- ATM parts in rescue and sysinstall
- sample configuration files and documents
- kernel support as a module or in NOTES
- netgraph wrapper nodes for netatm
- ctags data for netatm.
- netatm-specific device drivers.

MFC after:	3 weeks
Reviewed by:	bz
Discussed with:	bms, bz, harti
2008-05-25 22:11:40 +00:00
Attilio Rao
5047a8fd88 The "if" semantic is not needed, just fix this. 2008-05-25 16:11:27 +00:00
Attilio Rao
258f4727f1 Replace direct atomic operation for the file refcount witht the
refcount interface.
It also introduces the correct usage of memory barriers, as sometimes
fdrop() and fhold() are used with shared locks, which don't use any
release barrier.
2008-05-25 14:57:43 +00:00
John Birrell
6f5f25e521 Add the vtime (virtual time) hooks for DTrace. 2008-05-25 01:44:58 +00:00
John Birrell
5d217f173c Add DTrace 'proc' provider probes using the Statically Defined Trace
(sdt) mechanism.
2008-05-24 06:22:16 +00:00
Craig Rodrigues
a9722ace80 Do not convert the "snapshot" string to the MNT_SNAPSHOT flag here, since
we do it further down in ffs_vfsops.c

MFC after:	1 month
2008-05-23 23:33:07 +00:00
Konstantin Belousov
15822fcdbe Rev. 1.274 put the ttyrel() call before the destroy_dev() in the
ttyfree(), freeing the tty. Since destroy_dev() may call d_purge()
cdevsw method, that is the ttypurge() for the tty, the code ends up
accessing freed tty structure.

Put the ttyrel() after destroy_dev() in the ttyfree. To prevent the
panic the rev. 1.274 provided fix for, check the TS_GONE in sysctl
handler and refuse to provide information on such tty.

Reported, debugging help and tested by:	pho
DIscussed with and reviewed by:	jhb
MFC after:	1 week
2008-05-23 16:47:55 +00:00
Konstantin Belousov
cc57af357b The dev_refthread() in the tty_gettp() may fail, because Giant is taken
in the giant_trick routines after the dev_refthread increments the
si_threadcount. Remove assert, do not perform dev_relthread() for failed
dev_refthread(), and handle failure in the tty_gettp() callers (cdevsw
tty methods).

Before kern_conf.c 1.210 and 1.211, the kernel usually paniced in the
giant_trick routines dereferencing NULL cdevsw, not taking this fault.

Reported by:	Vince Hoffman <jhary unsane co uk>
Debugging help and tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2008-05-23 16:46:27 +00:00
Konstantin Belousov
ca091c56e3 Use the t_state for the TS_GONE test.
Submitted by:   jhb
MFC after:	3 days
2008-05-23 16:43:59 +00:00
Konstantin Belousov
06fe11294d Assert that si_threadcount > 0 before decrementing it. This helps catching
the improper use of the dev_refthread/dev_relthread.

Tested by:	pho
MFC after:	1 week
2008-05-23 16:38:38 +00:00
Ed Schouten
8837b0dd09 Move TTY unrelated bits out of <sys/tty.h>.
For some reason, the <sys/tty.h> header file also contains routines of the
clists and console that are used inside the TTY layer. Because the clists
are not only used by the TTY layer (example: various input drivers), we'd
better move the entire clist programming interface into <sys/clist.h>. Also
remove a declaration of nonexistent variable.

The <sys/tty.h> header also contains various definitions for the console
code (tty_cons.c). Also move these to <sys/cons.h>, because they are
not implemented inside the TTY layer.

While there, create separate malloc pools for the clist and console code.

Approved by:	philip (mentor)
2008-05-23 16:06:35 +00:00
Konstantin Belousov
741b6cf8a5 Another problem caused by the knlist_cleardel() potentially dropping
PIPE_MTX().

Since the pipe_present is cleared before (potentially) sleeping, the
second thread may enter the pipeclose() for the reciprocal pipe end.
The test at the end of the pipeclose() for the pipe_present == 0 would
succeed, allowing the second thread to free the pipe memory. First
threads then accesses the freed memory after being woken up.

Properly track the closing state of the pipe in the pipe_present.
Introduce the intermediate state that marks the pipe as mostly
dismantled but might be sleeping waiting for the knote list to be
cleared. Free the pipe pair memory only when both ends pass that point.

Debugging help and tested by:	pho
Discussed with:	jmg
MFC after:	2 weeks
2008-05-23 11:14:03 +00:00
Konstantin Belousov
e2e1693f15 Destruction of the pipe calls knlist_cleardel() to remove the knotes
monitoring the pipe. The code sets pipe_present = 0 and enters
knlist_cleardel(), where the PIPE_MTX might be dropped when knl->kl_list
cannot be cleared due to influx knotes.

If the following often encountered code fragment
                if (!(kn->kn_status & KN_DETACHED))
                        kn->kn_fop->f_detach(kn);
                knote_drop(kn, td); [1]
is executed while the knlist lock is dropped, then the knote memory is freed
by the knote_drop() without knote being removed from the knlist, since
the filt_pipedetach() contains the following:
        if (kn->kn_filter == EVFILT_WRITE) {
                if (!cpipe->pipe_peer->pipe_present) {
                        PIPE_UNLOCK(cpipe);
                        return;

Now, the memory may be reused in the zone, causing the access to the
freed memory. I got the panics caused by the marker knote appearing on
the knlist, that, I believe, manifestation of the issue. In the Peter
Holm test scenarious, we got unkillable processes too.

The pipe_peer that has the knote for write shall be present. Ignore the
pipe_present value for EVFILT_WRITE in filt_pipedetach().

Debugging help and tested by:	pho
Discussed with:	jmg
MFC after:	2 weeks
2008-05-23 11:09:50 +00:00
John Birrell
4b3d60930a Add the ctf_get function and update the args to linker_file_function_listall. 2008-05-23 07:08:59 +00:00
John Birrell
82c4945b5b Add the ctf_get method. 2008-05-23 04:06:49 +00:00
John Birrell
833b4a131a Allow a rendezvous with just a specified CPU too.
Make the API work in the non-smp case too so that a kernel module
can work the same regardless of whether or not it is loaded on a SMP
kernel or not.
2008-05-23 04:05:26 +00:00
John Birrell
75d94ef6ca Add the CTF source file which gets shared with link_elf.c and link_elf_obj.c. 2008-05-23 03:04:27 +00:00
John Birrell
a2024a3edf Add hooks for the Compact C Type Format (CTF) data to be attached to
the elf files. This is complicated by the fact that the actual CTF
parsing has to be done in CDDL'd code, so the BSD licensed code only
knows about the opaque data which it must be able to free.
2008-05-23 00:49:39 +00:00
John Birrell
91dd776cd2 Add support for the DTrace malloc provider which can enable probes
on a per-malloc type basis.
2008-05-23 00:43:36 +00:00
Robert Watson
17c2fc0cc7 When sendto(2) is called with an explicit destination address
argument, call mac_socket_check_connect() on that address before
proceeding with the send.  Otherwise policies instrumenting the
connect entry point for the purposes of checking destination
addresses will not have the opportunity to check implicit
connect requests.

MFC after:	3 weeks
Sponsored by:	nCircle Network Security, Inc.
2008-05-22 07:18:54 +00:00
Konstantin Belousov
82f4d64035 Implement the per-open file data for the cdev.
The patch does not change the cdevsw KBI. Management of the data is
provided by the functions
int	devfs_set_cdevpriv(void *priv, cdevpriv_dtr_t dtr);
int	devfs_get_cdevpriv(void **datap);
void	devfs_clear_cdevpriv(void);
All of the functions are supposed to be called from the cdevsw method
contexts.

- devfs_set_cdevpriv assigns the priv as private data for the file
  descriptor which is used to initiate currently performed driver
  operation. dtr is the function that will be called when either the
  last refernce to the file goes away, the device is destroyed  or
  devfs_clear_cdevpriv is called.
- devfs_get_cdevpriv is the obvious accessor.
- devfs_clear_cdevpriv allows to clear the private data for the still
  open file.

Implementation keeps the driver-supplied pointers in the struct
cdev_privdata, that is referenced both from the struct file and struct
cdev, and cannot outlive any of the referee.

Man pages will be provided after the KPI stabilizes.

Reviewed by:	jhb
Useful suggestions from:	jeff, antoine
Debugging help and tested by:	pho
MFC after:	1 month
2008-05-21 09:31:44 +00:00
Pawel Jakub Dawidek
988f0e193a Be more friendly for DDB pager.
Educated by:	jhb's BSDCan presentation
2008-05-18 21:08:12 +00:00
John Birrell
80544aebe3 Add support for the DTrace struct proc and struct thread extended
data via ctor and dtor event handlers.

The size of the extra data is allocated opaquely and this file
contains a function which the dtrace module can call to check
that the kernel supports at least the amount of data that it needs.

This file is optionally compiled into nthe kernel if the KDTRACE_HOOKS
kernel option is defined.
2008-05-18 19:43:52 +00:00
John Birrell
5572901b33 Add kernel support for the Statically Defined Trace provider.
This is BSD licensed code written specifically for FreeBSD.

It initialises using SYSINIT so that the SDT provider, probe and
argument description linkage is done whenever a module is loaded,
regardless of whether the DTrace modules are loaded or not.

This file is optionally compiled into the kernel if the KDTRACE_HOOKS
option is defined.
2008-05-18 19:32:36 +00:00
Rui Paulo
221351b7a5 devctl_process_running(): Check for devsoftc.inuse == 1 instead of
devsoftc.async_proc != NULL because the latter might not be true
sometimes.
This way /etc/rc.suspend gets executed.

Reviwed	by:	njl
Submitted by:	Mitsuru IWASAKI <iwasaki at jp.FreeBSD.org>
Tested also by:	Andreas Wetzel <mickey242 at gmx.net>
MFC after:	1 week
2008-05-18 13:55:51 +00:00
Robert Watson
8e230e30b7 Attempt to improve convergence of POSIX semaphore code with style(9).
MFC after:	3 days
2008-05-16 18:10:07 +00:00
George V. Neville-Neil
49f287f8c5 Update the kernel to count the number of mbufs and clusters
(all types) used per socket buffer.

Add support to netstat to print out all of the socket buffer
statistics.

Update the netstat manual page to describe the new -x flag
which gives the extended output.

Reviewed by:	rwatson, julian
2008-05-15 20:18:44 +00:00
Attilio Rao
90356491d7 - Embed the recursion counter for any locking primitive directly in the
lock_object, using an unified field called lo_data.
- Replace lo_type usage with the w_name usage and at init time pass the
  lock "type" directly to witness_init() from the parent lock init
  function.  Handle delayed initialization before than
  witness_initialize() is called through the witness_pendhelp structure.
- Axe out LO_ENROLLPEND as it is not really needed.  The case where the
  mutex init delayed wants to be destroyed can't happen because
  witness_destroy() checks for witness_cold and panic in case.
- In enroll(), if we cannot allocate a new object from the freelist,
  notify that to userspace through a printf().
- Modify the depart function in order to return nothing as in the current
  CVS version it always returns true and adjust callers accordingly.
- Fix the witness_addgraph() argument name prototype.
- Remove unuseful code from itismychild().

This commit leads to a shrinked struct lock_object and so smaller locks,
in particular on amd64 where 2 uintptr_t (16 bytes per-primitive) are
gained.

Reviewed by:	jhb
2008-05-15 20:10:06 +00:00
John Baldwin
ccd3953e5f Go back to using the process command name (p_comm) for the file name and
command line arguments stored in the note at the beginning of a core dump
instead of the current thread name.

Reviewed by:	julian
2008-05-15 03:07:34 +00:00
Konstantin Belousov
48504cc25b Add the devctl notifications for the cdev create/destroy events.
Based on the submission by: Andriy Gapon <avg icyb net ua>
MFC after:	2 weeks
2008-05-14 14:29:54 +00:00
Julian Elischer
681e40627d fix typo in runz_fuzz
noticed by:Elijah Buck
2008-05-12 06:42:06 +00:00
Alan Cox
3202ed7523 Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find().  (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().
2008-05-10 21:46:20 +00:00
Konstantin Belousov
e15864efd8 Kqueue_scan() may sleep when encountered the influx knotes. On the other
hand, it may cause other threads to sleep since kqueue_scan() may mark
some knotes as infux. This could lead to the deadlock.

Before kqueue_scan() sleeps, wakeup the threads that are waiting for the
influx knotes produced by this thread.

Tested by:	pho (previous version)
Reviewed by:	jmg
MFC after:	2 weeks
2008-05-10 11:37:05 +00:00
Konstantin Belousov
2e711e4d0d The kqueue_close() encountering the KN_INFLUX knotes on the kq being
closed is the legitimate situation. For instance, filedescriptor with
registered events may be closed in parallel with closing the kqueue.
Properly handle the case instead of asserting that this cannot happen.

Reported and tested by:	pho
Reviewed by:	jmg
MFC after:	2 weeks
2008-05-10 11:35:32 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Doug Rabson
06c85cef9d When blocking on an F_FLOCK style lock request which is upgrading a
shared lock to exclusive, drop the shared lock before deadlock
detection.

MFC after: 2 days
2008-05-09 10:34:23 +00:00
Pawel Jakub Dawidek
b109dd74fe - Export HZ value via kern.hz sysctl (this is the same name as for the
loader tunable).
- Document other sysctls in this file and also mark them as loader tunable
  via CTLFLAG_RDTUN flag.

Reviewed by:	roberto
2008-05-09 07:42:02 +00:00
Attilio Rao
688b98135c Add a new witness sysctl which returns the relations between any lock
and its children in the form:
"parent","child"
so that head and bottom of an oriented graph can be easilly detected and
various form of diagrams can be build.
The sysctl is called debug.witness.graphs and it is read-only; in order
to get the list of relations, a simple:
#sysctl debug.witness.graphs
will do the trick.

This approach has been choosen in order to support easilly things like
the DOT format and such.  Soon, an auto-explicative awk script, which
filters simple informations returned by the sysctl and converts them into
a real DOT script, will be committed to the repository between examples.

Discussed with:	rwatson
2008-05-07 21:41:36 +00:00
Kip Macy
c8c7ad9260 add malloc flag to blist so that it can be used in ithread context
Reviewed by: alc, bsdimp
2008-05-05 19:48:54 +00:00
John Baldwin
be00f6053b Fix a few edge cases with error handling in cpufreq(4)'s CPUFREQ_GET()
method:
- If the last of the child cpufreq drivers returns an error while trying to
  fetch its list of supported frequencies but an earlier driver found the
  requested frequency, don't return an error to the caller.
- If all of the child cpufreq drivers fail and the attempt to match the
  frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than
  returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN.

MFC after:	3 days
PR:		kern/121433
Reported by:	Eugene Grosbein  eugen ! kuzbass dot ru
2008-05-05 19:13:52 +00:00
Peter Wemm
43d7128c14 Expand kdb_alt_break a little, most commonly used with the option
ALT_BREAK_TO_DEBUGGER.  In addition to "Enter ~ ctrl-B" (to enter the
debugger), there is now "Enter ~ ctrl-P" (force panic) and
"Enter ~ ctrl-R" (request clean reboot, ala ctrl-alt-del on syscons).

We've used variations of this at work.  The force panic sequence is
best used with KDB_UNATTENDED for when you just want it to dump and
get on with it.

The reboot request is a safer way of getting into single user than
a power cycle.  eg: you've hosed the ability to log in (pam, rtld, etc).
It gives init the reboot signal, which causes an orderly reboot.

I've taken my best guess at what the !x86 and non-sio code changes
should be.

This also makes sio release its spinlock before calling KDB/DDB.
2008-05-04 23:29:38 +00:00
Attilio Rao
60e2edce55 sync_vnode() has some messy code about locking in order to deal with
mount fs needing Giant to be held when processing bufobjs.
Use a different subqueue for pending workitems on filesystems requiring
Giant. This simplifies the code notably and also reduces the number of
Giant acquisitions (and the whole processing cost).

Suggested by:	jeff
Reviewed by:	kib
Tested by:	pho
2008-05-04 13:54:55 +00:00
Julian Elischer
2182c0cfbf Attempt to make the print types more friendly to other architectures.
Prodded by: Max Laier
Help from: BMS, jhb
2008-04-30 20:00:30 +00:00
Julian Elischer
c59b9a7659 Document the kproc_kthread_add() call
and fix a small detail of its implementation.
MFC after: 1 week
2008-04-29 22:43:15 +00:00
Roman Divacky
d6891277a4 Lock filedesc exclusively when modifying fd_[cr]dir.
This is probably harmless but it's better to lock it
correctly.

Approved by:	kib (mentor)
2008-04-29 21:40:11 +00:00
Julian Elischer
6eeac1d921 Add an option (compiled out by default)
to profile outoing packets for a number of mbuf chain
related parameters
e.g. number of mbufs, wasted space.
probably will do with further work later.

Reviewed by: various
2008-04-29 21:23:21 +00:00
David Xu
3bba58f287 Fix compiling problem. 2008-04-29 05:48:05 +00:00
David Xu
727158f6f6 Introduce command UMTX_OP_WAIT_UINT_PRIVATE and UMTX_OP_WAKE_PRIVATE
to allow userland to specify that an address is not shared by multiple
processes.
2008-04-29 03:48:48 +00:00
Robert Watson
ae11a989e6 When writing trailers in sendfile(2), don't call kern_writev()
while holding the socket buffer lock.  These leads to an
immediate panic due to recursing the socket buffer lock.  This
bug was introduced in uipc_syscalls.c:1.240, but masked by
another bug until that was fixed in uipc_syscalls.c:1.269.

Note that the current fix isn't perfect, but better than
panicking: normally we guarantee that simultaneous invocations
of a system call to write on a stream socket won't be
interlaced, which is ensured by use of the socket buffer sleep
lock.  This is guaranteed for the sendfile headers, but not
trailers.  In practice, this is likely not a problem, but
should be fixed.

MFC after:	3 days
Pointy hat to:	andre (1.240), cperciva (1.269)
2008-04-27 15:50:00 +00:00
Kris Kennaway
5894445dad * Correct a mis-merge that leaked the PROC_LOCK [1]
* Return ENOENT on error instead of 0 [2]

Submitted by: rdivacky [1], kib [2]
2008-04-26 13:16:55 +00:00
Pawel Jakub Dawidek
3800322fe2 Implement 'show mount' command in DDB. Without argument, it prints short
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.

MFC after:	2 weeks
2008-04-26 13:04:48 +00:00
Jeff Roberson
6c47aaae12 - Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
 - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
   suspended in cpu specific states.  This function can fail and cause the
   scheduler to fall back to another mechanism (ipi).
 - Implement support for mwait in cpu_idle() on i386/amd64 machines that
   support it.  mwait is a higher performance way to synchronize cpus
   as compared to hlt & ipis.
 - Allow selecting the idle routine by name via sysctl machdep.idle.  This
   replaces machdep.cpu_idle_hlt.  Only idle routines supported by the
   current machine are permitted.

Sponsored by:	Nokia
2008-04-25 05:18:50 +00:00
Kris Kennaway
b1ba81d948 fdhold can return NULL, so add the one remaining missing check for this
condition.

Reviewed by:    attilio
MFC after:      1 week
2008-04-24 22:08:36 +00:00
Konstantin Belousov
12e79a9bbc Allow the vnode zone to return the unused memory. The vnode reference
count is/shall be properly maintained for the long time, and VFS
shall be safe against the vnode memory reclamation.

Proposed by:	jeff
Tested by:	pho
2008-04-24 09:58:33 +00:00
Poul-Henning Kamp
9b4a8ab7ba Now that all platforms use genclock, shuffle things around slightly
for better structure.

Much of this is related to <sys/clock.h>, which should really have
been called <sys/calendar.h>, but unless and until we need the name,
the repocopy can wait.

In general the kernel does not know about minutes, hours, days,
timezones, daylight savings time, leap-years and such.  All that
is theoretically a matter for userland only.

Parts of kernel code does however care: badly designed filesystems
store timestamps in local time and RTC chips almost universally
track time in a YY-MM-DD HH:MM:SS format, and sometimes in local
timezone instead of UTC.  For this we have <sys/clock.h>

<sys/time.h> on the other hand, deals with time_t, timeval, timespec
and so on.  These know only seconds and fractions thereof.

Move inittodr() and resettodr() prototypes to <sys/time.h>.
Retain the names as it is one of the few surviving PDP/VAX references.

Move startrtclock() to <machine/clock.h> on relevant platforms, it
is a MD call between machdep.c/clock.c.  Remove references to it
elsewhere.

Remove a lot of unnecessary <sys/clock.h> includes.

Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs.
XXX: should be kern.disable_rtc_set really, it's not MD.
2008-04-22 19:38:30 +00:00
Pawel Jakub Dawidek
d90d4eb28c Back-out previous revision. For now I can use _ddb() variants of stack(9) KPI,
as I use it for debugging only. Once someone will need it for more production
features, the change should be reconsider.

Requested by:	rwatson
2008-04-21 17:22:35 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Pawel Jakub Dawidek
f55f27f862 Allow linker_search_symbol_name() to be called with KLD lock held.
The linker_search_symbol_name() function is used by stack_print()
and stack_print() can be called from kernel module unload method.

MFC after:	1 week
2008-04-17 19:19:40 +00:00
Jeff Roberson
1690c6c1be - Add a metric to describe how busy a processor has been over the last
two ticks by counting the number of switches and the load when
   sched_clock() is called.
 - If the busy metric exceeds a threshold allow the idle thread to spin
   waiting for new work for a brief period to avoid using IPIs.  This
   reduces the cost on the sender and receiver as well as reducing wakeup
   latency considerably when it works.

Sponsored by:	Nokia
2008-04-17 09:56:01 +00:00
Jeff Roberson
8df78c41d6 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00
Doug Rabson
a365ea5fba Fix compilation with LOCKF_DEBUG. 2008-04-16 14:08:12 +00:00
Konstantin Belousov
eab626f110 Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by:	kris
Tested by:	kris, pho
Discussed with:	jeff, dfr
MFC after:	2 weeks
2008-04-16 11:33:32 +00:00
David Xu
d61f3de656 Implement POSIX function tcgetsid() which returns session id.
PR: stand/107561
2008-04-15 08:33:32 +00:00
Marcel Moolenaar
495168ba8d Support and switch to the ULE scheduler:
o  Implement IPI_PREEMPT,
o  Set td_lock for the thread being switched out,
o  For ULE & SMP, loop while td_lock points to blocked_lock for
   the thread being switched in,
o  Enable ULE by default in GENERIC and SKI,
2008-04-15 05:02:42 +00:00
Randall Stewart
cf71e4381a Add pru_flush routine so a transport can
flush itself during Shutdown

MFC after:	1 week
2008-04-14 18:06:04 +00:00
Alan Cox
e384d8a89b Initialize the vm object's flags to include OBJ_NOSPLIT, just like the
vm objects that are used by System V shared memory segments.
2008-04-13 21:08:34 +00:00
Attilio Rao
22dd228d5d Use a "rel" memory barrier for disowning the lock as it cames from an
exclusive locking operation.
2008-04-13 01:21:56 +00:00
Attilio Rao
0b0100db88 struct lock_instance and struct lock_list_entry don't need to be in the
public namespace for WITNESS as they are only used internally so just
move them in the private namespace for the subsystem (with all related
supporting definitions).
2008-04-13 01:20:47 +00:00
Poul-Henning Kamp
8d24f82310 fix printf type confusion on amd64 2008-04-12 21:51:54 +00:00
Poul-Henning Kamp
c9ad6040dd Emit summaries of struct c(alender)t(ime) <-> struct timespec conversions
under bootverbose.

Struct ct is used for setting/reading real time clocks and I'm about
to Do Things to some of those, so a bit of preemptive debugging is
in order.

Remove a pointless __inline.
2008-04-12 20:35:56 +00:00
Attilio Rao
e5f94314ad - Re-introduce WITNESS support for lockmgr. About the old implementation
the only one difference is that lockmgr*() functions now accept
  LK_NOWITNESS flag which skips ordering for the instanced calling.
- Remove an unuseful stub in witness_checkorder() (because the above check
  doesn't allow ever happening) and allow witness_upgrade() to accept
  non-try operation too.
2008-04-12 19:57:30 +00:00
Attilio Rao
872b7289fd - Remove a stale comment.
- Add an extra assertion in order to catch malformed requested operations.
2008-04-12 13:56:17 +00:00
Attilio Rao
1859cffaef Add missing stubs for spinlocks cpuset and intrcnt.
Submitted by:	kris
2008-04-12 13:51:18 +00:00
Xin LI
31c50f53da Instead of rolling our own jail number allocation procedure, use
alloc_unr() to do it.

Submitted by:	Ed Schouten <ed 80386 nl>
PR:		kern/122270
MFC after:	1 month
2008-04-11 21:31:15 +00:00
John Baldwin
03c7442d75 Use kthread_exit() to terminate a taskqueue thread rather than kproc_exit()
now that the taskqueue threads are kthreads rather than kprocs.

Reported by:	kris
2008-04-11 17:35:54 +00:00