Commit Graph

9637 Commits

Author SHA1 Message Date
David Xu
aff5bcb1b2 INT_MAX is defined in file sys/limits.h, include the file now. 2006-08-02 07:34:51 +00:00
Robert Watson
c0e1415d51 Move updated of 'numopensockets' from bottom of sodealloc() to the top,
eliminating a second set of identical mutex operations at the bottom.
This allows brief exceeding of the max sockets limit, but only by
sockets in the last stages of being torn down.
2006-08-02 00:45:27 +00:00
John Baldwin
03e161fdb1 Make system call modules a bit more robust:
- If we fail to register the system call during MOD_LOAD, then note that
  so that we don't try to deregister it or invoke the chained event handler
  during the subsequent MOD_UNLOAD event.  Doing the deregister when the
  register failed could result in trashing system call entries.
- Add a SI_SUB_SYSCALLS just before starting up init and use that to
  register syscall modules instead of SI_SUB_DRIVERS.  Registering system
  calls as late as possible increases the chances that any other module
  event handlers or SYSINITs in a module are executed to initialize the
  data in a kld before a syscall dependent on that data is able to be
  invoked.

MFC after:	3 days
2006-08-01 16:32:20 +00:00
John Baldwin
38affe135a Don't lock each of the processes while looking for a pid. The allproc and
proctree locks that we already hold provide sufficient protection.
2006-08-01 15:30:56 +00:00
Robert Watson
eaa6dfbcc2 Reimplement socket buffer tear-down in sofree(): as the socket is no
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc.  Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer.  A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.

This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.
2006-08-01 10:30:26 +00:00
Robert Watson
b5ff091431 Close a race that occurs when using sendto() to connect and send on a
UNIX domain socket at the same time as the remote host is closing the
new connections as quickly as they open.  Since the connect() and
send() paths are non-atomic with respect to another, it is possible
for the second thread's close() call to disconnect the two sockets
as connect() returns, leading to the consumer (which plans to send())
with a NULL kernel pointer to its proposed peer.  As a result, after
acquiring the UNIX domain socket subsystem lock, we need to revalidate
the connection pointers even though connect() has technically succeed,
and reurn an error to say that there's no connection on which to
perform the send.

We might want to rethink the specific errno number, perhaps ECONNRESET
would be better.

PR:		100940
Reported by:	Young Hyun <youngh at caida dot org>
MFC after:	2 weeks
MFC note:	Some adaptation will be required
2006-07-31 23:00:05 +00:00
John Baldwin
53c9158f24 Trim an obsolete comment. ktrgenio() stopped doing crazy gymnastics when
ktrace was redone to be mostly synchronous again.
2006-07-31 15:31:43 +00:00
John Baldwin
91ce2694d1 Regen for MPSAFE flag removal. 2006-07-28 19:08:37 +00:00
John Baldwin
af5bf12239 Now that all system calls are MPSAFE, retire the SYF_MPSAFE flag used to
mark system calls as being MPSAFE:
- Stop conditionally acquiring Giant around system call invocations.
- Remove all of the 'M' prefixes from the master system call files.
- Remove support for the 'M' prefix from the script that generates the
  syscall-related files from the master system call files.
- Don't explicitly set SYF_MPSAFE when registering nfssvc.
2006-07-28 19:05:28 +00:00
John Baldwin
e0b4add8d8 Various fixes to comments in the syscall master files including removing
cruft from the audit import and adding mention of COMPAT4 to freebsd32.
2006-07-28 18:55:18 +00:00
John Baldwin
764e4d54e9 Adjust td_locks for non-spin mutexes, rwlocks, and sx locks so that it is
a count of all non-spin locks, not just lockmgr locks.  This can give us a
much cheaper way to see if we have any locks held (such as when returning
to userland via userret()) without requiring WITNESS.

MFC after:	1 week
2006-07-27 21:45:55 +00:00
John Baldwin
ea175645b4 Hold the reference on the mountpoint slightly longer in kern_statfs() and
kern_fstatfs() so that it is still held when prison_enforce_statfs() is
called (since that function likes to poke and prod at the mountpoint
structure).

MFC after:	3 days
2006-07-27 20:00:27 +00:00
John Baldwin
186abbd727 Write a magic value into mtx_lock when destroying a mutex that will force
all other mtx_lock() operations to block.  Previously, when the mutex was
destroyed, it would still have a valid value in mtx_lock(): either the
unowned cookie, which would allow a subsequent mtx_lock() to succeed, or a
pointer to the thread who destroyed the mutex if the mutex was locked when
it was destroyed.

MFC after:	3 days
2006-07-27 19:58:18 +00:00
John Baldwin
f30e89ced3 Fix a file descriptor race I reintroduced when I split accept1() up into
kern_accept() and accept1().  If another thread closed the new file
descriptor and the first thread later got an error trying to copyout the
socket address, then it would attempt to close the wrong file object.  To
fix, add a struct file ** argument to kern_accept().  If it is non-NULL,
then on success kern_accept() will store a pointer to the new file object
there and not release any of the references.  It is up to the calling code
to drop the references appropriately (including a call to fdclose() in case
of error to safely handle the aforementioned race).  While I'm at it, go
ahead and fix the svr4 streams code to not leak the accept fd if it gets an
error trying to copyout the streams structures.
2006-07-27 19:54:41 +00:00
Robert Watson
0075d85869 Remove call to soisdisconnected() in uipc_detach(), since it will already
have been invoked by uipc_close() or uipc_abort(), and the socket is in a
state of being torn down by the time we get to this point, so kqueue
state frobbed by soisdisconnected() is not available, so frobbing it will
result in a panic.

Reported by:	Munehiro Matsuda <haro at h4 dot dion dot ne dot jp>
2006-07-26 19:16:34 +00:00
Robert Watson
f14cce87dc Remove non-socket buffer routines from uipc_sockbuf.c, and socket buffer
specific routines from uipc_socket2.c following repo-copy.  We might
rethink the location of one or two at some point, but the division was
relatively clean.  uipc_sockbuf.c is now the home of routines that
manipulate socket buffers.
2006-07-24 16:21:31 +00:00
Robert Watson
b0668f7151 soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
Robert Watson
ca948c5e93 Remove duplicate 'or'.
Submitted by:	ru
2006-07-23 21:01:09 +00:00
Robert Watson
809c2b789c Update various uipc_socket.c comments, and reformat others. 2006-07-23 20:36:04 +00:00
Robert Watson
f23929fbc5 Add additional comments to the top of the UNIX domain socket implementation
providing some high level pointers regarding the implementation.
2006-07-23 20:06:45 +00:00
Robert Watson
4b19d603c4 Remove old kern.malloc sysctl, which generated a text representation of
the kernel malloc(9) state for vmstat -m.  libmemstat is now used to
generate a machine-readable version which is converged by vmstat -m
into a human-readable version.

Not for MFC.
2006-07-23 19:55:41 +00:00
Robert Watson
0ce3f16dbb Expand comments for malloc(9) to better describe the design and
statistics / memory types model.
2006-07-23 19:51:39 +00:00
Robert Watson
fb6d736d14 Update and reformat comments for POSIX.1e ACL utility routines. 2006-07-23 19:35:10 +00:00
Robert Watson
4f1f0ef523 Add two new unpcb flags, UNP_BINDING and UNP_CONNECTING, which will be
used to mark UNIX domain sockets as being in the process of binding or
connecting.  Use these to prevent simultaneous bind or connect
operations by multiple threads or processes on the same socket at the
same time, which closes race conditions present in the UNIX domain
socket implementation since inception.
2006-07-23 12:01:14 +00:00
Robert Watson
dd47f5ca9c Merge unp_bind() into uipc_bind(), as it is called only from uipc_bind(). 2006-07-23 11:02:12 +00:00
Robert Watson
6d32873c29 Since unp_attach() and unp_detach() are now called only from uipc_attach()
and uipc_detach(), merge them into their calling functions.
2006-07-23 10:25:28 +00:00
Robert Watson
7e711c3aae Move various UNIX socket global variables and sysctls from the middle of
the file to the top.
2006-07-23 10:19:04 +00:00
Robert Watson
f3f49bbbe8 In uipc_send() and uipc_rcvd(), store unp->unp_conn pointer in unp2
while working with the second unpcb to make the code more clear.
2006-07-22 18:41:42 +00:00
Robert Watson
1c381b19ff Re-wrap and other minor formatting and punctuation fixes for UNIX domain
socket comments.
2006-07-22 17:24:55 +00:00
John Baldwin
b04aff773e Add a comment to explain what fdclose() does and what it's purpose is
since the subtlety eluded me when I looked at it last week.
2006-07-21 20:24:00 +00:00
Robert Watson
a152f8a361 Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket.  pru_abort is now a
notification of close also, and no longer detaches.  pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket.  This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree().  With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by:	gnn
2006-07-21 17:11:15 +00:00
Alan Cox
af51d7bf57 Eliminate OBJ_WRITEABLE. It hasn't been used in a long time. 2006-07-21 06:40:29 +00:00
John Baldwin
9079458ad2 Add a mutex to protect the list of interrupt config hooks. We do assume
that the only remove hook operation that can occur while processing the
hooks is to remove the currently executing hook.  This should be safe as
the existing code has assumed this already for a long time now.

Reviewed by:	scottl
MFC after:	1 week
2006-07-19 18:53:56 +00:00
John Baldwin
2f198e899a Call change_dir() instead of duplicating the code in fchdir(). 2006-07-19 18:30:33 +00:00
John Baldwin
b33887ea31 Don't free the sockaddr in kern_bind() and kern_connect() as not all
callers pass a sockaddr allocated via malloc() from M_SONAME anymore.
Instead, free it in the callers when necessary.
2006-07-19 18:28:52 +00:00
Stefan Farfeleder
c6e0a843cf Separate functions with a newline. 2006-07-17 21:00:42 +00:00
Poul-Henning Kamp
9c499ad92f Remove the NDEVFSINO and NDEVFSOVERFLOW options which no longer exists in
DEVFS.

Remove the opt_devfs.h file now that it is empty.
2006-07-17 09:07:02 +00:00
Robert Watson
5cd1a27145 Change comment on soabort() to more accurately describe how/when
soabort() is used.  Remove trailing white space.
2006-07-16 23:09:39 +00:00
Alan Cox
27ea29536c Enable debug.mpsafevfs by default on arm. Since every architecture except
powerpc has debug.mpsafevfs enabled by default, it is shorter to enumerate
the architectures on which debug.mpsafevfs is off.

Tested by: cognet@
2006-07-15 06:44:27 +00:00
Jung-uk Kim
8120ddb4c4 Let native elf class be registered earlier. 2006-07-14 22:39:18 +00:00
Pawel Jakub Dawidek
338ae5268b Remove duplicated #include. 2006-07-14 17:55:36 +00:00
David Xu
24af5900eb Backout the feature which can change thread's scheduling option, I really
don't want to mix process and thread scheduling options together in these
functions, now the thread scheduling option is implemented in new thr
syscalls.
2006-07-13 06:41:26 +00:00
David Xu
ba493ceb6b regenerate. 2006-07-13 06:32:55 +00:00
David Xu
60088160c9 Add syscalls thr_setscheduler, thr_getscheduler, and thr_setschedparam,
these syscalls are designed to set thread's scheduling parameters and
policy, because each syscall contains a size parameter, it is possible
to support future scheduling option, e.g SCHED_SPORADIC, this option
needs other fields in structure sched_param, current they are not
avaiblable.
2006-07-13 06:26:43 +00:00
John Baldwin
fed7988436 Honor db_pager_quit in 'show threadchain', 'show allchains', and
'show lockchain'.  This is especially helpful for the first 2 as a
threadchain could get stuck in an infinite loop during a mutex deadlock.
2006-07-12 21:25:24 +00:00
John Baldwin
19e9205a23 Simplify the pager support in DDB. Allowing different db commands to
install custom pager functions didn't actually happen in practice (they
all just used the simple pager and passed in a local quit pointer).  So,
just hardcode the simple pager as the only pager and make it set a global
db_pager_quit flag that db commands can check when the user hits 'q' (or a
suitable variant) at the pager prompt.  Also, now that it's easy to do so,
enable paging by default for all ddb commands.  Any command that wishes to
honor the quit flag can do so by checking db_pager_quit.  Note that the
pager can also be effectively disabled by setting $lines to 0.

Other fixes:
- 'show idt' on i386 and pc98 now actually checks the quit flag and
  terminates early.
- 'show intr' now actually checks the quit flag and terminates early.
2006-07-12 21:22:44 +00:00
Konstantin Belousov
3097d55a39 Use proper format specifier for pointers in debug printfs (turned off
by default).

Approved by:	pjd (mentor)
MFC after:	2 weeks
2006-07-12 11:41:53 +00:00
David Xu
a94d3e1f8a Use newkg to check if SCHED_OTHER is already inherited. 2006-07-12 07:02:28 +00:00
David Xu
c3ab507fcd Return priority range 0..PRI_MAX_TIMESHARE-PRI_MIN_TIMESHARE for
SCHED_OTHER, the same range as rtprio() is using. In old code,
it returns nice range -20 .. 20, nice should be treated as process
weight, it is really managed by getpriority() and setpriority()
syscalls, they are different.
2006-07-12 05:54:17 +00:00
Robert Watson
5908c617bb Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel)
return void, so don't implement no-op versions of these functions.
Instead, consistently check if those switch pointers are NULL before
invoking them.
2006-07-11 23:18:28 +00:00
Robert Watson
f949ae9b31 When pru_attach() fails, call sodealloc() on the socket rather than
using sorele() and the full tear-down path.  Since protocol state
allocation failed, this is not required (and is arguably undesirable).
This matches the behavior of sonewconn() under the same circumstances.
2006-07-11 21:56:58 +00:00
Robert Watson
337cc6b60e Reduce periods of simultaneous acquisition of various socket buffer
locks and the unplock during uipc_rcvd() and uipc_send() by caching
certain values from one structure while its locks are held, and
applying them to a second structure while its locks are held.  If
done carefully, this should be correct, and will reduce the amount
of work done with the global unp lock held.

Tested by:	kris (earlier version)
2006-07-11 21:49:54 +00:00
John Baldwin
90aff9de2d Regen. 2006-07-11 20:55:23 +00:00
John Baldwin
be5747d5b5 - Add conditional VFS Giant locking to getdents_common() (linux ABIs),
ibcs2_getdents(), ibcs2_read(), ogetdirentries(), svr4_sys_getdents(),
  and svr4_sys_getdents64() similar to that in getdirentries().
- Mark ibcs2_getdents(), ibcs2_read(), linux_getdents(), linux_getdents64(),
  linux_readdir(), ogetdirentries(), svr4_sys_getdents(), and
  svr4_sys_getdents64() MPSAFE.
2006-07-11 20:52:08 +00:00
David Xu
2dca4ca723 Don't forget to check invalid policy! 2006-07-11 08:19:57 +00:00
David Xu
006faeb831 Oops, remove debugger line. 2006-07-11 06:15:46 +00:00
David Xu
65343c788c Extended the POSIX scheduler APIs to accept lwpid as well, we've already
done this in ptrace syscall, when a pid is large than PID_MAX, the syscall
will search a thread in current process. It permits 1:1 thread library to
get and set a thread's scheduler attributes.
2006-07-11 06:11:34 +00:00
David Xu
2f26f4c66c For SCHED_OTHER, we always inherit current thread's interactive priority
unless current thread is realtime thread, in such case, we set a new zero
priority for it, notice we don't have per-thread nice, the priority
passed by userland is ignored here.
2006-07-11 06:01:14 +00:00
David Xu
a0712c99d0 Add POSIX scheduler parameters support to thr_new syscall, this permits
privileged process to create realtime thread.
2006-07-11 05:34:35 +00:00
David Xu
adc9c950af Create thread in separated ksegrp, so they always get correct user level
priority.
2006-07-10 23:14:07 +00:00
John Baldwin
c870740e09 - Split out kern_accept(), kern_getpeername(), and kern_getsockname() for
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat.  Specifically, go ahead
  and hard code UIO_USERSPACE in the uio as that's what all the callers
  specify.  In place, add a new uioseg to indicate what type of pointer
  is in mp->msg_name.  Previously it was always a userland address, but
  ABI emulators may pass in kernel-side sockaddrs.  Also, remove the
  namelenp field and instead require the two places that used it to
  explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
  kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
  svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
  svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.
2006-07-10 21:38:17 +00:00
John Baldwin
0f8e0c3dd4 Explicitly use STAILQ_REMOVE_HEAD() when we know we are removing the head
element to avoid confusing Coverity.  It's now also easier for humans to
parse as well.

Found by:	Coverity Prevent(tm)
CID:		1201
2006-07-10 19:28:57 +00:00
John Baldwin
0bf8969c60 Fix two more instances of using a linker_file_t object in TAILQ() macros
after free'ing it.

Found by:	Coverity Prevent(tm)
CID:		1435
2006-07-10 19:13:45 +00:00
John Baldwin
6b5b470aea Don't try to reuse the linker_file structure after we've freed it when
throwing out the kld's loaded by the loader that didn't successfully link.

Found by:	Coverity Prevent(tm)
CID:		1435
2006-07-10 19:06:01 +00:00
Scott Long
e3546a7549 Use a sleep mutex instead of an sx lock for the kernel environment. This
allows greater flexibility for drivers that want to query the environment.

Reviewed by: jhb, mux
2006-07-09 21:42:58 +00:00
John Baldwin
d9f4623307 - Split ioctl() up into ioctl() and kern_ioctl(). The kern_ioctl() assumes
that the 'data' pointer is already setup to point to a valid KVM buffer
  or contains the copied-in data from userland as appropriate (ioctl(2)
  still does this).  kern_ioctl() takes care of looking up a file pointer,
  implementing FIONCLEX and FIOCLEX, and calling fi_ioctl().
- Use kern_ioctl() to implement xenix_rdchk() instead of using the stackgap
  and mark xenix_rdchk() MPSAFE.
2006-07-08 20:12:14 +00:00
John Baldwin
c1cccebe8b Add a kern_close() so that the ABIs can close a file descriptor w/o having
to populate a close_args struct and change some of the places that do.
2006-07-08 20:03:39 +00:00
John Baldwin
b1ee5b654d Rework kern_semctl a bit to always assume the UIO_SYSSPACE case. This
mostly consists of pushing a few copyin's and copyout's up into
__semctl() as all the other callers were already doing the UIO_SYSSPACE
case.  This also changes kern_semctl() to set the return value in a passed
in pointer to a register_t rather than td->td_retval[0] directly so that
callers can only set td->td_retval[0] if all the various copyout's succeed.

As a result of these changes, kern_semctl() no longer does copyin/copyout
(except for GETALL/SETALL) so simplify the locking to acquire the semakptr
mutex before the MAC check and hold it all the way until the end of the
big switch statement.  The GETALL/SETALL cases have to temporarily drop it
while they do copyin/malloc and copyout.  Also, simplify the SETALL case to
remove handling for a non-existent race condition.
2006-07-08 19:51:38 +00:00
Warner Losh
db2bc1bb82 Create bus_enumerate_hinted_children. This routine will allow drivers
to use the hinted child system.  Bus drivers that use this need to
implmenet the bus_hinted_child method, where they actually add the
child to their bus, as they see fit.  The bus is repsonsible for
getting the attribtues for the child, adding it in the right order,
etc.  ISA hinting will be updated to use this method.

MFC After: 3 days
2006-07-08 17:06:15 +00:00
Robert Watson
e4256d1e8d Move POSIX.1e-specific utility routines from kern_acl.c to
subr_acl_posix1e.c, leaving kern_acl.c containing only ACL system
calls and utility routines common across ACL types.

Add subr_acl_posix1e.c to the build.

Obtained from:	TrustedBSD Project
2006-07-06 23:37:39 +00:00
John Baldwin
398c993b2a - Explicitly acquire Giant around SYSINIT's and SYSUNINIT's since they are
not all known to be MPSAFE yet.
- Actually remove Giant from the kernel linker by taking it out of the
  KLD_LOCK() and KLD_UNLOCK() macros.

Pointy hat to:	jhb (2)
2006-07-06 21:39:39 +00:00
John Baldwin
3cb83e714d Add kern_setgroups() and kern_getgroups() and use them to implement
ibcs2_[gs]etgroups() rather than using the stackgap.  This also makes
ibcs2_[gs]etgroups() MPSAFE.  Also, it cleans up one bit of weirdness in
the old setgroups() where it allocated an entire credential just so it had
a place to copy the group list into.  Now setgroups just allocates a
NGROUPS_MAX array on the stack that it copies into and then passes to
kern_setgroups().
2006-07-06 21:32:20 +00:00
Wayne Salamon
65ee602e0c Audit the remaining parameters to the extattr system calls. Generate
the audit records for those calls.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-07-06 19:33:38 +00:00
Robert Watson
6435cdafa3 Remove now unneeded opt_mac.h and mac.h includes. 2006-07-06 13:25:51 +00:00
Wayne Salamon
761aed363f Regen the system calls files, picking up the extended attr events, and some
mount-related changes done previously.

Approved by: rwatson (mentor)
2006-07-05 19:24:14 +00:00
Konstantin Belousov
c8d3bc1fa3 Back out my rev. 1.674. The better fix (rev. 1.637) is already in tree.
Approved by:	kan (mentor)
2006-07-05 16:33:25 +00:00
Wayne Salamon
bbe5d0318d Add audit events for the extended attribute system calls.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-07-05 15:46:02 +00:00
Maxim Konovalov
1f36c876a1 o Fix grammar in the comment, indent macros. No functional changes. 2006-07-02 20:53:52 +00:00
Maxim Konovalov
75d960eb2e o Remove rev. 1.57 leftover, not reached code. 2006-07-02 20:49:46 +00:00
Maxim Konovalov
e2668f5563 o Fix typo in the comment.
PR:		kern/99632
Submitted by:	clsung
2006-06-30 08:10:55 +00:00
David E. O'Brien
2e4db89cfc Fix building with GCC 4.2: define data types before referring to them. 2006-06-29 19:37:31 +00:00
John Baldwin
fe95c76276 Fix semctl(2) breakage from the previous commit. Previously __semctl() had
a local 'semid' variable which was the array index and used uap->semid
as the original IPC id.  During the kern_semctl() conversion those two
variables were collapsed into a single 'semid' variable breaking the
places that needed the original IPC ID.  To fix, add a new 'semidx'
variable to hold the array index and leave 'semid' unmolested as the IPC
id.  While I'm here, explicitly document that the (undocumented, at least
in semctl(2)) SEM_STAT command curiously expects an array index in the
'semid' parameter rather than an IPC id.

Submitted by:	maxim
2006-06-29 13:58:36 +00:00
David Xu
5151eeb194 Fix a bug when accumulating run time, if a thread calls yield() syscall,
its run time may be lost.
2006-06-29 12:29:20 +00:00
David Xu
d29a8ce69b Fix system load count (noticed by dephij). Remove incorrect
comment.
2006-06-29 09:49:00 +00:00
David Xu
0922ef0c42 Remove unused function declaration.
Add else statement in sched_calc_pri.
Fix a bug when checking interrupt thread in sched_add.
2006-06-29 05:59:36 +00:00
David Xu
d60003a2e4 Remove load balancer code, since it has serious priority inversion problem
which really hurts performance on FreeBSD.
2006-06-29 05:36:34 +00:00
John Baldwin
49d409a108 - Add a kern_semctl() helper function for __semctl(). It accepts a pointer
to a copied-in copy of the 'union semun' and a uioseg to indicate which
  memory space the 'buf' pointer of the union points to.  This is then used
  in linux_semctl() and svr4_sys_semctl() to eliminate use of the stackgap.
- Mark linux_ipc() and svr4_sys_semsys() MPSAFE.
2006-06-27 18:28:50 +00:00
John Baldwin
597d608f86 - Expand the scope of Giant some in mount(2) to protect the vfsp structure
from going away.  mount(2) is now MPSAFE.
- Expand the scope of Giant some in unmount(2) to protect the mp structure
  (or rather, to handle concurrent unmount races) from going away.
  umount(2) is now MPSAFE, as well as linux_umount() and linux_oldumount().
- nmount(2) and linux_mount() were already MPSAFE.
2006-06-27 14:46:31 +00:00
Pawel Jakub Dawidek
0bd645ae0c Compress direct cr_ruid comparsion and jailed() call to suser_cred(9).
Reviewed by:	rwatson
2006-06-27 11:32:08 +00:00
Pawel Jakub Dawidek
8838c27693 Use suser_cred(9) instead of checking cr_uid directly.
Reviewed by:	rwatson
2006-06-27 11:29:38 +00:00
Pawel Jakub Dawidek
2905ade228 - Use suser_cred(9) instead of checking cr_ruid directly.
- For privileged processes safe two mutex operations.

We may want to consider if this is good idea to use SUSER_ALLOWJAIL here,
but for now I didn't wanted to change the original behaviour.

Reviewed by:	rwatson
2006-06-27 11:28:50 +00:00
Sergey Babkin
d81175c738 Backed out the change by request from rwatson.
PR:		kern/14584
2006-06-26 22:03:22 +00:00
John Baldwin
c94ce032df Address a problem I missed in removing Giant from the kernel linker. Not
all of the module event handlers are MP safe yet, so always acquire Giant
for now when invoking module event handlers.  Eventually we can add an
MPSAFE flag or some such and add appropriate locking to all module event
handlers.
2006-06-26 18:34:45 +00:00
John Baldwin
322fb40cbf Remove duplicate security checks already performed in kern_kldload(). 2006-06-26 18:33:32 +00:00
Robert Watson
e83b30bdcb Trim basically unused 'unp' in uipc_connect(). 2006-06-26 16:18:22 +00:00
Sergey Babkin
7a799f1ef0 The common UID/GID space implementation. It has been discussed on -arch
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.

PR:		kern/14584
Submitted by:	babkin
2006-06-25 18:37:44 +00:00
Ian Dowse
450ec4ed45 If linker_release_module() fails then we still hold a reference on
the linker_file, so record this by restoring the linker_file pointer
in fp->file.
2006-06-25 12:36:21 +00:00
Pawel Jakub Dawidek
92c0849935 Simplify the code and remove two mutex operations.
MFC after:	2 weeks
2006-06-24 22:55:43 +00:00
John Baldwin
70f3778827 Replace the kld_mtx mutex with a kld_sx sx lock and expand it's scope to
protect all linker-related data structures including the contents of
linker file objects and the any linker class data as well. Considering how
rarely the linker is used I just went with the simple solution of
single-threading the whole thing rather than expending a lot of effor on
something more fine-grained and complex.  Giant is still explicitly
acquired while registering and deregistering sysctl's as well as in the
elf linker class while calling kmupetext().  The rest of the linker runs
without Giant unless it has to acquire Giant while loading files from a
non-MPSAFE filesystem.
2006-06-21 20:42:08 +00:00
John Baldwin
cbda6f950b - Push down Giant in kldfind() and kldsym().
- Remove several goto's by either using direct return's or else clauses.
2006-06-21 20:15:36 +00:00
John Baldwin
d36e739a0c Whoops, revert accidental commit. 2006-06-21 17:48:59 +00:00
John Baldwin
9dd44bd79e Fix two comments and a style fix. 2006-06-21 17:48:03 +00:00
John Baldwin
0df2972736 Various whitespace fixes. 2006-06-21 17:47:45 +00:00
John Baldwin
62d615d508 Conditionally acquire Giant around VFS operations. 2006-06-20 21:31:38 +00:00
John Baldwin
aeeb017bd6 - Push Giant down into linker_reference_module().
- Add a new function linker_release_module() as a more intuitive complement
  to linker_reference_module() that wraps linker_file_unload().
  linker_release_module() can either take the module name and version info
  passed to linker_reference_module() or it can accept the linker file
  object returned by linker_reference_module().
2006-06-20 20:54:13 +00:00
John Baldwin
f462ce3edd Make linker_find_file_by_name() and linker_find_file_by_id() static to
simplify linker locking.  The only external consumers now use
linker_file_foreach().
2006-06-20 20:41:15 +00:00
John Baldwin
932151064a - Add a new linker_file_foreach() function that walks the list of linker
file objects calling a user-specified predicate function on each object.
  The iteration terminates either when the entire list has been iterated
  over or the predicate function returns a non-zero value.
  linker_file_foreach() returns the value returned by the last invocation
  of the predicate function.  It also accepts a void * context pointer that
  is passed to the predicate function as well.  Using an iterator function
  avoids exposing linker internals to the rest of the kernel making locking
  simpler.
- Use linker_file_foreach() instead of walking the list of linker files
  manually to lookup ndis files in ndis(4).
- Use linker_file_foreach() to implement linker_hwpmc_list_objects().
2006-06-20 20:37:17 +00:00
John Baldwin
aaf3170501 Make linker_file_add_dependency() and linker_load_module() static since
only the linker uses them.
2006-06-20 20:18:42 +00:00
John Baldwin
e767366f99 Don't check if malloc(M_WAITOK) returns NULL. 2006-06-20 20:11:00 +00:00
John Baldwin
e5bb3a01d7 Use 'else' to remove another goto. 2006-06-20 19:49:28 +00:00
John Baldwin
73a2437a83 - Remove some useless variable initializations.
- Make some conditional free()'s where the condition was always true
  unconditional.
2006-06-20 19:32:10 +00:00
George V. Neville-Neil
fb11be62a2 Properly cast the values of valsize (the size of the value passed in)
in setsockopt so that they can be compared correctly against negative
values.  Passing in a negative value had a rather negative effect
on our socket code, making it impossible to open new sockets.

PR:		98858
Submitted by:	James.Juran@baesystems.com
MFC after:	1 week
2006-06-20 12:36:40 +00:00
Robert Watson
721150ad8f When retrieving SO_ERROR via getsockopt(), hold the socket lock around
the retrieval and replacement with 0.

MFC after:	1 week
2006-06-18 19:02:49 +00:00
Yaroslav Tykhiy
42ccd54fec Add a funny sysctl: debug.kdb.trap_code .
It is similar to debug.kdb.trap, except for it tries to cause a page fault
via a call to an invalid pointer.  This can highlight differences between
a fault on data access vs. a fault on code call some CPUs might have.

This appeared as a test for a work \
Sponsored by: RiNet (Cronyx Plus LLC)
2006-06-18 12:27:59 +00:00
Robert Watson
cd3a3a269f Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do.  No current protocols
invoke these functions, and removing them will help discourage them
from ever being used.  I should have removed them years ago, but
lost track of it.

MFC after:	1 week
Prodded almost by accident by:	peter
2006-06-17 22:48:34 +00:00
Ed Maste
374875fa56 Add a description for sysctl -d. 2006-06-17 02:58:18 +00:00
Robert Watson
9a44cbf19c Remove unused (and ifdef'd) unp_abort() and unp_drain().
MFC after:	1 month
2006-06-16 22:11:49 +00:00
David Malone
93ef14a74b Add a kern.timecounter.tc sysctl tree that contains the mask,
frequency, quality and current value of each available time counter.

At the moment all of these are read-only, but it might make sense to
make some of these read-write in the future.

MFC after:	3 months
2006-06-16 20:29:05 +00:00
Yaroslav Tykhiy
be70abccba Kill an XXX remark that has been untrue since rev. 1.150 of this file. 2006-06-16 07:36:18 +00:00
Christian S.J. Peron
4f0840f348 Axe Giant from vn_fullpath(9). The vnode -> pathname lookup should be
filesystem agnostic. We are not touching any file system specific functions
in this code path. Since we have a cache lock, there is really no need to
keep Giant around here.

This eliminates Giant acquisitions for any syscall which is auditing pathnames.

Discussed with:	jeff
2006-06-16 05:09:28 +00:00
Maxim Konovalov
059d68dea6 o Expand an exclusive lock scope to prevent a race between two
simultaneous module_register().

Original work done by:	Alex Lyashkov
Reviewed by:		jhb
MFC after:		2 weeks
2006-06-15 08:53:09 +00:00
David Xu
7bb561fbb9 Use scheduler API sched_relinquish() to implement yield() syscall. 2006-06-15 06:41:57 +00:00
David Xu
36ec198bd5 Add scheduler API sched_relinquish(), the API is used to implement
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.
2006-06-15 06:37:39 +00:00
David Xu
c2c1ab1858 Clear ke_runq before calling maybe_preempt, this avoids a
KASSERT(ke->ke_runq == NULL) panic when the sched_add is recursively
called by maybe_preempt.

Reported by: Wojciech A. Koszek < dunstan at freebsd dot czest dot pl >
2006-06-14 03:46:03 +00:00
Xin LI
6ad26d8376 Unexpand an instance of TAILQ_EMPTY() 2006-06-14 03:14:26 +00:00
Marcel Moolenaar
e1684acf38 Unbreak 64-bit architectures. The 3rd argument to kern_kldload() is
a pointer to an integer and td->td_retval[0] is of type register_t.
On 64-bit architectures register_t is wider than an integer.
2006-06-14 03:01:06 +00:00
David Xu
2c7cae8042 Fox a typo in sched_is_timeshare. 2006-06-13 23:45:59 +00:00
David Xu
e15abbf251 Pass boolean value to __predict_false. Try to keep KSE slot count
correct for migrating thread, the count is a bit mess.
2006-06-13 23:01:50 +00:00
John Baldwin
edd32c2da2 Use kern_kldload() and kern_kldunload() to load and unload modules when
we intend for the user to be able to unload them later via kldunload(2)
instead of calling linker_load_module() and then directly adjusting the
ref count on the linker file structure.  This makes the resulting
consumer code simpler and cleaner and better hides the linker internals
making it possible to sanely lock the linker.
2006-06-13 21:36:23 +00:00
John Baldwin
b21c9288ce A couple of minor style tweaks. 2006-06-13 21:34:12 +00:00
John Baldwin
d53885879d - Add a kern_kldload() that is most of the previous kldload() and push
Giant down in it.
- Push Giant down in kern_kldunload() and reorganize it slightly to avoid
  using gotos.  Also, expose this function to the rest of the kernel.
2006-06-13 21:28:18 +00:00
John Baldwin
6b3d277ad4 - Push down Giant some in kldstat().
- Use a 'struct kld_file_stat' on the stack to read data under the lock
  and then do one copyout() w/o holding the lock at the end to push the
  data out to userland.
2006-06-13 21:11:12 +00:00
John Baldwin
b904477c68 Unexpand TAILQ_FOREACH() and TAILQ_FOREACH_SAFE(). 2006-06-13 20:49:07 +00:00
John Baldwin
3a600aeabc Remove some more pointless goto's and don't check to see if
malloc(M_WAITOK) returns NULL.
2006-06-13 20:27:23 +00:00
John Baldwin
2fa6cc80d7 Handle the simple case of just dropping a reference near the start of
linker_file_unload() instead of in the middle of a bunch of code for
the case of dropping the last reference to improve readability and sanity.
While I'm here, remove pointless goto's that were just jumping to a
return statement.
2006-06-13 19:45:08 +00:00
Maxim Konovalov
70df31f4de o There are two methods to get a process credentials over the unix
sockets:

1) A sender sends SCM_CREDS message to a reciever, struct cmsgcred;
2) A reciever sets LOCAL_CREDS socket option and gets sender
credentials in control message, struct sockcred.

Both methods use the same control message type SCM_CREDS with the
same control message level SOL_SOCKET, so they are indistinguishable
for the receiver.  A difference in struct cmsgcred and struct sockcred
layouts may lead to unwanted effects.

Now for sockets with LOCAL_CREDS option remove all previous linked
SCM_CREDS control messages and then add a control message with
struct sockcred so the process specifically asked for the peer
credentials by LOCAL_CREDS option always gets struct sockcred.

PR:		kern/90800
Submitted by:	Andrey Simonenko
Regres. tests:	tools/regression/sockets/unix_cmsg/
MFC after:	1 month
2006-06-13 14:33:35 +00:00
David Xu
b41f1452d9 Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.

Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
   timeslice and interaction detecting algorithm are based
   on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
2006-06-13 13:12:56 +00:00
John Baldwin
5c69ad8374 Use fget() in kqueue_register() instead of doing all the work by hand. 2006-06-12 21:46:23 +00:00
Warner Losh
ccdc8d9bff Add a convenience function rman_init_from_resource for initializing
a rman from a resource.

Also, include _bus.h since the implementation of bus_space isn't
needed here, just the definitions of the types.
2006-06-12 04:06:21 +00:00
Ian Dowse
eb1030c4fd Keep firmware images on the list until they have been unregistered
with firmware_unregister(). Previously when the last driver reference
had been dropped we would clear the list entry under the assumption
that the firmware module was about to be unloaded, but this was not
true if the firmware image had been loaded manually with kldload.

This makes it possible to manually kldload firmware images as a
workaround for drivers such as ipw that attempt to load firmware
while resuming after a suspend.

Reviewed by:	mlaier (an earlier version of the patch)
2006-06-10 17:04:07 +00:00
Robert Watson
b37ffd3189 Move some functions and definitions from uipc_socket2.c to uipc_socket.c:
- Move sonewconn(), which creates new sockets for incoming connections on
  listen sockets, so that all socket allocate code is together in
  uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
  socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
  to sysctl.h and remove lots of scattered implementations in various
  IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
  reasons.  Statisticize soalloc() and sodealloc() as they are now
  required only in uipc_socket.c, and are internal to the socket
  implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after:	1 month
2006-06-10 14:34:07 +00:00
Robert Watson
e02421f3fb Rearrange code in soalloc() so that it's less indented by returning
early if uma_zalloc() from the socket zone fails.  No functional
change.

MFC after:	1 week
2006-06-08 22:33:18 +00:00
Konstantin Belousov
55aef2632f Fix the LOR that occurs when the MAC compiled into the kernel
and vnode is destroyed.

Reviewed by:	rwatson
LOR:		189
MFC after:	2 weeks
Approved by:	kan (mentor)
2006-06-08 07:55:10 +00:00
David Xu
0ae716e5ee Make ke_rqindex unsigned. 2006-06-06 12:26:17 +00:00
Robert Watson
7ebfc8df78 Audit some arguments to nmount(), mount(), umount().
Submitted by:	wsalamon
Obtained from:	TrustedBSD Project
2006-06-05 15:32:07 +00:00
Robert Watson
6e79e6f805 Audit command, uid arguments for quotactl().
Audit the mode argument to mkfifo().
Audit the target path passed to symlink().

Submitted by:	wsalamon
Obtained from:	TrustedBSD Project
2006-06-05 13:34:23 +00:00
Robert Watson
d3778141bf Audit path passed to the acct() system call.
Obtained from:	TrustedBSD Project
2006-06-05 13:02:34 +00:00
John Baldwin
49b94bfc54 Bah, fix fat finger in last. Invert the ~ on MTX_FLAGMASK as it's
non-intuitive for the ~ to be built into the mask.  All the users now
explicitly ~ the mask.  In addition, add MTX_UNOWNED to the mask even
though it technically isn't a flag.  This should unbreak mtx_owner().

Quickly spotted by:	kris
2006-06-03 21:11:33 +00:00
John Baldwin
3ce3f44293 In the case of reentering the debugger due to an attempt to perform a
context switch while in the debugger, reenter the debugger sooner before
performing any statistics updates.
2006-06-03 20:49:44 +00:00
John Baldwin
315ce35f7b Simplify mtx_owner() so it only reads m->mtx_lock once. 2006-06-03 20:45:00 +00:00
John Baldwin
f781b5a4bb Style fix to be more like _mtx_lock_sleep(): use 'while (!foo) { ... }'
instead of 'for (;;) { if (foo) break; ... }'.
2006-06-03 20:44:01 +00:00
Pawel Jakub Dawidek
1f58dd4956 Fix a problem introduced in revision 1.220. On mount(2) failure, don't
forget to unbusy file system before its destruction.

This fixes the following warning on mount failure:

	Mount point <X> had 1 dangling refs

Tested by:	wkoszek
2006-06-02 20:29:02 +00:00
Doug Ambrisko
51e37c7f37 Make lio ident more consistant with aio ident. 2006-06-02 17:45:48 +00:00
Pawel Jakub Dawidek
f420242b2b Don't forget to unlock kq lock in low memory situations.
OK'ed by:	jmg
2006-06-02 13:23:39 +00:00
Pawel Jakub Dawidek
8ebab14c70 Remove confusing done_noglobal label. The KQ_GLOBAL_UNLOCK() macro know
how to handle both situations - when kq_global lock is and is not held.

OK'ed by:	jmg
2006-06-02 13:21:21 +00:00
Pawel Jakub Dawidek
241321abc0 Use SLIST_FOREACH_SAFE() macro, because knote_drop() can free an element
which can be then used to find next element in the list.

OK'ed by:	jmg
2006-06-02 13:18:59 +00:00
Olivier Houchard
4bb0f51d1d sched_rem() already sets ke->ke_state to KES_THREAD, so there's no need
to redo it.
2006-06-01 22:45:56 +00:00
Diomidis Spinellis
23efd78d03 Remove two locking assertion entries that:
a) were incorrectly written and therefore never compiled into
assertions, and
b) were incorrectly specified and when compiled resulted in a
failed assertion.
2006-05-31 14:06:06 +00:00
Diomidis Spinellis
f69ec7af12 Assertion code specifications are introduced using special character
sequences that are distinct from comments. %% is used for argument
locks; %! for pre- and post-conditions.
2006-05-30 20:49:54 +00:00
Diomidis Spinellis
b1b4282160 Remove incorrect lock validation specifications that caused
failed assertions with DEBUG_VFS_LOCKS.
We should reinstate them with correct specifications, possibly
after extendng vnode_if.awk

Noted by: truckman@
2006-05-30 20:21:51 +00:00
Tor Egge
57051fdc4b Close race between vmspace_exitfree() and exit1() and races between
vmspace_exitfree() and vmspace_free() which could result in the same
vmspace being freed twice.

Factor out part of exit1() into new function vmspace_exit().  Attach
to vmspace0 to allow old vmspace to be freed earlier.

Add new function, vmspace_acquire_ref(), for obtaining a vmspace
reference for a vmspace belonging to another process.  Avoid changing
vmspace refcount from 0 to 1 since that could also lead to the same
vmspace being freed twice.

Change vmtotal() and swapout_procs() to use vmspace_acquire_ref().

Reviewed by:	alc
2006-05-29 21:28:56 +00:00
Xin LI
56e26c3e7e Unexpand TAILQ_FIRST(foo) == NULL to TAILQ_EMPTY(foo). 2006-05-29 05:43:26 +00:00
Kris Kennaway
80a8e5da94 Correct typos
MFC after:	2 weeks
2006-05-28 22:15:28 +00:00
Robert Watson
4bb260ad78 In execve(), audit the path name being executed. In the future, it
would also be good to audit the interpreter pathname, if any.

Obtained from:	TrustedBSD Project
2006-05-28 08:28:47 +00:00
Diomidis Spinellis
0e1c7fb8ea Add missing % signs in the lock annotations of the functions:
lookup, rename, strategy, islocked
The missing % sign meant that the lines were processed as plain
comments and the corresponding assertions were never generated.
2006-05-28 07:24:12 +00:00
Xin LI
e38c7f3ef3 extlen and cpp is not used here in linker_search_kld(), so nuke them.
Reported by:	Mingyan Guo <guomingyan at gmail dot com>
MFC After:	2 weeks
2006-05-27 09:21:41 +00:00
Poul-Henning Kamp
9dd2370db6 If the console has no cncheckc method, use cngetc instead. 2006-05-26 11:00:20 +00:00
Poul-Henning Kamp
8aed7613bd Don't use CONS_DRIVER() macro to insert dummy element in cons_set 2006-05-26 10:46:38 +00:00
Poul-Henning Kamp
16b1613a31 GC the cn_dbctl_t hook for consoles, it is unused.
This used to make syscons switch to vty0 when we entered DDB but this
was lost in the KDB shuffle.  We may want to bring it back down the road
but it should be done by calling cn_init_t/cn_term_t instead, possibly
with a flag argument saying "Debugger!"
2006-05-26 10:24:00 +00:00
Craig Rodrigues
0c89bb0a02 Add "update" mount option to global_opts array,
for use with vfs_filteropt().
2006-05-26 02:38:48 +00:00
Craig Rodrigues
5eb304a91a Remove calls to vfs_export() for exporting a filesystem for NFS mounting
from individual filesystems.  Call it instead in vfs_mount.c,
after we call VFS_MOUNT() for a specific filesystem.
2006-05-26 00:32:21 +00:00
Robert Watson
20bdac8a4f Use getsock() and fput() instead of fgetsock() and fputsock() in
sendfile().  This causes sendfile() to use the file descriptor
reference to the socket instead of bumping the socket reference
count, which avoids an additional refcount operation, as well as a
potential expensive socket refcount drop, which can lead to
contention on the accept mutex.  This change also has the side
effect of further reducing the number of cases where an in-progress
I/O operation can occur on a socket after close, as using the file
descriptor refcount prevents the socket from closing while in use.

MFC after:	3 months
2006-05-25 15:10:13 +00:00
Stephan Uphoff
dcf67e65d2 Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by:	tegge@
MFC after:	7 days
2006-05-25 01:00:35 +00:00
Sam Leffler
75b773ae3d When starting up threads in taskqueue_start_threads create them
stopped before adjusting their priority and setting them on the run
q so they cannot race for resources (pointed out by njl).

While here add a console printf on thread create fails; otherwise
noone may notice (e.g. return value is always 0 and caller has no
way to verify).

Reviewed by:	jhb, scottl
MFC after:	2 weeks
2006-05-24 22:11:07 +00:00
David Xu
f705bbe8b1 Don't allow non-root user to set a scheduler policy, otherwise this could
be a local DOS.

Submitted by: Diane Bruce at db at db.net
2006-05-21 00:40:38 +00:00
David Xu
f6c040a2c5 Style fixes.
Submitted by: Diane Bruce < db at db dot net >
2006-05-19 06:37:24 +00:00
David Xu
7b8d821268 Move flag TDF_UMTXQ into structure umtxq, this eliminates the requirement
of scheduler lock in some umtx code.
2006-05-18 08:43:46 +00:00
Poul-Henning Kamp
d595182f0b Make the printfs relating to purging threads from a device less intrusive. 2006-05-17 06:37:14 +00:00
Poul-Henning Kamp
c40da00ca3 Since DELAY() was moved, most <machine/clock.h> #includes have been
unnecessary.
2006-05-16 14:37:58 +00:00
Paul Saab
6befa6ae1b Allow concurrent read(2)/readv(2) access to a file.
Lock file offset against multiple read calls.

Submitted by:	ups
Obtained from:	Yahoo!
MFC after:	2 weeks
2006-05-16 07:50:54 +00:00
Kelly Yancey
c9ad8a67af Restore the ability to mount procfs and fdescfs filesystems via the
mount(2) system call:

  * Add cmount hook to fdescfs and pseudofs (and, by extension, procfs and
    linprocfs).  This (mostly) restores the ability to mount these
    filesystems using the old mount(2) system call (see below for the
    rest of the fix).

  * Remove not-NULL check for the data argument from the mount(2) entry
    point.  Per the mount(2) man page, it is up to the individual
    filesystem being mounted to verify data.  Or, in the case of procfs,
    etc. the filesystem is free to ignore the data parameter if it does
    not use it.  Enforcing data to be not-NULL in the mount(2) system call
    entry point prevented passing NULL to filesystems which ignored the
    data pointer value.  Apparently, passing NULL was common practice
    in such cases, as even our own mount_std(8) used to do it in the
    pre-nmount(2) world.

All userland programs in the tree were converted to nmount(2) long ago,
but I've found at least one external program which broke due to this
(presumably unintentional) mount(2) API change.  One could argue that
external programs should also be converted to nmount(2), but then there
isn't much point in keeping the mount(2) interface for backward
compatibility if it isn't backward compatible.
2006-05-15 19:42:10 +00:00
Benno Rice
77fe443878 The VERBOSE_SYSINIT stuff sees the DDB define a lot better if we include
opt_ddb.h.

Spotted by:	benno
Pointy hat to:	benno
2006-05-14 07:11:28 +00:00
Craig Rodrigues
5250012a1d For nmount(), if "rw" is specified as a mount option,
add "noro" to the list of mount options.  This allows
a read-only mount to be converted to read-write via:
mount -u -o rw

Requested by:	kris
2006-05-14 01:51:38 +00:00
John Baldwin
73dbd3da73 Remove various bits of conditional Alpha code and fixup a few comments. 2006-05-12 05:04:46 +00:00
Benno Rice
26ab616fdc Add a new kernel config option, VERBOSE_SYSINIT.
When porting FreeBSD to a new platform, one of the more useful things to do is
get mi_startup() to let you know which SYSINIT it's up to.  Most people tend to
whack a printf in the SYSINIT loop to print the address of the function it's
about to call.  Going one better, jhb made a version that uses DDB to look up
the name of the function and print that instead.  This version is essentially
his with the addition of some ifdeffery to make it optional and to allow it to
work (although using only the function address, not the symbol) if you forgot
to enable DDB.

All the cool bits by:	jhb
Approved by:		scottl, rink, cognet, imp
2006-05-12 02:01:38 +00:00
Poul-Henning Kamp
99ab8292c7 Remove more straggling CPU_ macro references 2006-05-11 17:53:26 +00:00
David Xu
005efcdb0e Use wakeup_one to avoid thundering herd.
Tested by: kris
2006-05-09 13:00:46 +00:00
David Xu
759ccccadb Use a dedicated mutex to protect aio queues, the movation is to reduce
lock contention with other parts.
2006-05-09 00:10:11 +00:00
Tor Egge
11991ab418 Call vn_finished_write() before calling the coredump handler which will
indirectly call vn_start_write() as necessary for each write.
2006-05-07 22:50:22 +00:00
Tor Egge
d302786c87 Temporarily unlock vnode for new image being executed to avoid lock order
reversals that can lead to deadlocks.  Normally vn_close(), namei() or vrele()
should not be called while holding vnode locks.
2006-05-05 20:25:05 +00:00
Pawel Jakub Dawidek
643df192de vn_start_write()/vn_finished_write() is not needed here, because
vn_start_write() is always called earlier in the code path and calling
the function recursively may lead to a deadlock.

Confirmed by:	tegge
MFC after:	2 weeks
2006-04-29 21:57:38 +00:00
Kris Kennaway
cef31ff7d9 Lock giant when assigning ni_vp and keep vfslocked state valid.
Committed for:	jeff
2006-04-29 07:13:49 +00:00
Pawel Jakub Dawidek
122410eea2 vn_start_write() is called only when v_type != VCHR, so corresponding
vn_finished_write() should also be called only then.

BTW. I fixed two functions here: vn_rdwr() and vn_write(). The latter seems
to be unused.

MFC after:	3 weeks
2006-04-28 21:54:05 +00:00
Robert Watson
3bf14fd5e9 Also check use_pty in the ptmx clone lookup; this means that when ptmx
support is turned off using the sysctl, we no longer even allow the
ptmx device to be looked up.

Foot provided by:	peter
2006-04-28 21:39:57 +00:00
Marcel Moolenaar
8f405ed335 Remove the puc-specific hacks. The puc(4) driver now properly uses
the rman(9) interface.
2006-04-28 21:23:09 +00:00
Jeff Roberson
6ca9fcc586 - Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child
buffers to go on the buf daemon's DIRTYGIANT queue.
 - Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
   runs in the context of the buf daemon and may require Giant.
2006-04-28 01:05:31 +00:00
Jeff Roberson
4b5b86816c - Consistently track ni_dvp and ni_vp with dvfslocked and vfslocked rather
than trying to optimize it into a single lock.  This adds more calls to
   lock giant with non smpsafe filesystems but is the only way to reliably
   hold the correct lock.
 - Remove an invalid assert in the mountedhere case in lookup and fix the
   code to properly deal with the scenario.  We can actually have a lookup
   that returns dp == dvp with mountedhere set with certain unmount races.

Tested by:	kris
Reported by:	kris/mohans
2006-04-28 00:59:48 +00:00
John-Mark Gurney
5c06d111b8 back out for now... revert ccpu to being kern.ccpu... 2006-04-27 17:57:59 +00:00
John-Mark Gurney
c71ce6a445 move remaining sysctl into the kern.sched tree... 2006-04-26 19:42:38 +00:00
John Baldwin
ae110b53d1 Add some new commands to hopefully make it easier to diagnose lock-related
problems in ddb:
- "show threadchain [thread]" will start with the specified thread (or the
  current kdb thread by default) and show it's state.  If it is blocked on
  a lock, it will find the owner of the lock and show its state, etc.
- "show allchains" will find all of the threads that are blocked on a
  lock (but do not have any threads blocked on a lock they hold) and show
  the resulting thread chain.
- "show lockchain <lock>" takes a pointer to a lock_object (such as a
  mutex or rwlock).  If there is a turnstile for that lock, then it will
  display all the threads blocked on the lock.  In addition, for each
  thread blocked on the lock, it will display any contested locks they
  hold, and recurse on those locks to show any threads blocked on those
  locks, etc.
2006-04-25 20:28:17 +00:00
John Baldwin
de833b7c0c Use db_lookup_thread() to lookup the thread for the passed in address
and change 'show locks' to only list the locks for a given thread
rather than for all the threads in the process containing a specified
thread.
2006-04-25 20:24:23 +00:00
Marius Strobl
fa63296aba Remove last vestiges of sab(4). 2006-04-25 19:43:53 +00:00
Robert Watson
102ea03373 Extend getsock() to return the struct file flags read while holding the
file lock, in the style of fgetsock().

Modify accept1() to use getsock() instead of fgetsock(), relying on the
file descriptor reference rather than an acquired socket reference to
prevent the listen socket from being destroyed during accept().  This
avoids additional reference count operations, which should improve
performance, and also avoids accept1() operating on a socket whose file
descriptor has been torn down, which may have resulted in protocol
shutdown starting.

MFC after:	3 months
2006-04-25 11:48:16 +00:00
Maxim Konovalov
481f8fe85f Inherit LOCAL_CREDS option from listen socket for sockets returned
by accept(2).

PR:		kern/90644
Submitted by:	Andrey Simonenko
OK'ed by:	mdodd
Tested by:	NetBSD regress/sys/kern/unfdpass/unfdpass.c
MFC after:	1 month
2006-04-24 19:09:33 +00:00
Marcel Moolenaar
845652dd28 MFp4: Add the ipend() method to the serdev I/F to allow umbrella
drivers to obtain pending interrupt status from subordinate
	drivers.
2006-04-23 22:12:39 +00:00
Robert Watson
0cec9959e8 Assert that sockets passed into soabort() not be SQ_COMP or SQ_INCOMP,
since that removal should have been done a layer up.

MFC after:	3 months
2006-04-23 18:15:54 +00:00
Robert Watson
28ea180136 Add missing 'not' to SQ_COMP comment.
MFC after:	3 months
2006-04-23 15:37:23 +00:00
Robert Watson
6ca35d4b81 Move handling of SQ_COMP exception case in sofree() to the top of the
function along with the remainder of the reference checking code.  Move
comment from body to header with remainder of comments.  Inclusion of a
socket in a completed connection queue counts as a true reference, and
should not be handled as an under-documented edge case.

MFC after:	3 months
2006-04-23 15:33:38 +00:00
John Baldwin
f9ab2f134f Print td_name instead of p_comm if td_name is non-empty for
'show turnstile' and 'show sleepq'.
2006-04-21 20:40:43 +00:00
Paul Saab
95f16c1e2c Don't try to kill embryonic processes in killpg1(). This prevents
a race condition between fork() and kill(pid,sig) with pid < 0 that
can cause a kernel panic.

Submitted by:	up
MFC after:	3 weeks
2006-04-21 19:26:21 +00:00
Paul Saab
4f590175b7 Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
John-Mark Gurney
be4db476a6 const'ify resource_spec to note that we won't be changing anything while
releasing resources... also, NULL out the resources as we free them...
2006-04-20 01:44:16 +00:00
Warner Losh
0385d64761 r_spare1 and r_spare2 aren't needed. They aren't used. They can't be
accessed from outside of subr_rman.c.  Remove them.

Reviewed by: jmg (in theory)
2006-04-19 21:25:55 +00:00
John Baldwin
fea3efe5bf Implement rw_try_upgrade() and rw_downgrade(). rw_try_upgrade() makes a
single attempt at upgrading a read lock to a write lock, and rw_downgrade()
converts curthread's write lock into a read lock.
2006-04-19 21:06:52 +00:00
Wojciech A. Koszek
5884c1a098 'owner' is not used without SMP. Fix kernel build for such kernel
configurations.

Approved by:	jhb
2006-04-18 20:32:42 +00:00
John Baldwin
efa86db61d Adaptively spin before blocking on the turnstile if an rwlock is write
locked.  In general the adaptive spinning is similar to the same code
for mutexes with some extra trickiness in rw_wunlock_hard().  Specifically,
even though both wait bits might be set and we might have a turnstile with
at least one waiting thread, there might not be any threads blocked on the
queue we are not waking up (they might all be spinning), and we should
only preserve the waiting flag for the queue we aren't waking up if there
are in fact threads blocked on that queue.  Secondly, there might not be
any threads blocked on the queue we have chosen to waken threads from
(there might only be threads blocked on the other queue and the threads
for this queue are all spinning) in which case we disown the turnstile
instead of doing a braodcast and unpend.
2006-04-18 18:27:54 +00:00
John Baldwin
f1a4b852dc - Bring back turnstile_empty() which can check to see if an individual
queue on a turnstile is empty.
- Add a turnstile_disown() function that allows a thread to give up
  ownership of a turnstile w/o waking up any waiters.
2006-04-18 18:16:54 +00:00
Xin LI
4207c279d4 In vfs_hash_get(): mount point should never be changed
so explicitly constify the mp parameter.

Reviewed by:	phk
2006-04-18 08:05:08 +00:00
John Baldwin
38bf165fa1 - Add a rw_wowner() macro that just returns the owner of a write lock and
use it in places that only care about the write owner instead of
  rw_owner() as a baby step towards limited read-lock owner.
- Tidy the code that sets the WAITER flag bits to not duplicate a test
  around the atomic operation and the KTR trace in both of the lock
  functions.
2006-04-17 21:11:01 +00:00
John Baldwin
32553b153e Add a 'show sleepqueue' alias for 'show sleepq' in DDB. 2006-04-17 20:16:32 +00:00
John Baldwin
964b557211 Trim trailing whitespace. 2006-04-17 20:14:51 +00:00
John Baldwin
2971c36136 Add a new module_file() function that returns the linker_file_t associated
with a given module_t.  I use this in some the MOD_LOAD event handler for
some test kernel modules to ask the kernel linker to look up the linker
sets in my test modules. (I use linker sets to generate the list of
possible events that I then signal to execute via a sysctl.  On non-amd64,
ld(8) would resolve the entire linker set, but on amd64 I have to ask the
kernel linker to do it for me, and having the kernel linker do it works on
all archs.)
2006-04-17 19:44:44 +00:00
John Baldwin
0f180a7cce Change msleep() and tsleep() to not alter the calling thread's priority
if the specified priority is zero.  This avoids a race where the calling
thread could read a snapshot of it's current priority, then a different
thread could change the first thread's priority, then the original thread
would call sched_prio() inside msleep() undoing the change made by the
second thread.  I used a priority of zero as no thread that calls msleep()
or tsleep() should be specifying a priority of zero anyway.

The various places that passed 'curthread->td_priority' or some variant
as the priority now pass 0.
2006-04-17 18:20:38 +00:00
John-Mark Gurney
e98b5a89de remove duplicate sizeof vnode entry (debug.sizeof.vnode already existed)...
move ncsize into debug.sizeof and rename to namecache...
2006-04-16 18:38:30 +00:00
Scott Long
bb141be10a Take a better stab at making this compile. 2006-04-15 18:54:56 +00:00
Scott Long
83bc5d54c8 Take a stab at making this compile. 2006-04-15 18:04:04 +00:00
John Baldwin
76447e5618 Mark the thread pointer used during an adaptive spin volatile so that the
compiler doesn't decide to cache td_state.  Cachine the state would cause
the spinning thread to not notice when the owning thread stopped executing
(if it was preempted for example) which could result in livelock.
2006-04-14 19:51:50 +00:00
John Baldwin
a29b4f6eec Drop the kqueue global mutex as soon as we are finished with it rather
than keeping it locked until we exit the function to optimize the case
where the lock would be dropped and later reacquired.  The optimization
was broken when kevent's were moved from UFS to VFS and the knote list
lock for a vnode kevent became the lockmgr vnode lock.  If one tried
to use a kqueue that contained events for a kqueue fd followed by a vnode,
then the kq global lock would end up being held when the vnode lock was
acquired which could result in sleeping with a mutex held (and subsequent
panics) if the vnode lock was contested.

Reviewed by:	jmg
Tested by:	ps (on 6.x)
MFC after:	3 days
2006-04-14 14:27:28 +00:00
David Xu
cfd6f8cd6c Clear TDF_SINTR in sleepq_resume_thread, also sleepq_catch_signal does
not need to clear it now, this should fix panic when msleep is recursivly
called. Patch is slightly adjusted after review.

Reviewed by: jhb
Tested by: Csaba Henk, csaba-ml at creo.hu
MFC after: 3 days
2006-04-13 23:29:25 +00:00
John Baldwin
9477358d00 Turn on ithread_destroy() and call it from intr_event_destroy() to tear
down an interrupt event's associated thread (if it has one).
2006-04-13 17:29:04 +00:00
Christian S.J. Peron
d5e5634075 Kill the last Giant acquisition in the exit(2) code. This Giant acquisition
doesn't appear to be protecting anything. Most of consumers funsetownlst(9)
do not appear to be picking up Giant anywhere. This was originally a part
of my Giant exit(2) clean up revision 1.272 but I thought it was a good idea
to leave it out until we were able to analyze it better.

Tested by:	kris
MFC after:	3 weeks
2006-04-10 14:07:28 +00:00
Pawel Jakub Dawidek
0909f38a3c On shutdown try to turn off all swap devices. This way GEOM providers are
properly closed on shutdown.

Requested by:	ru
Reviewed by:	alc
MFC after:	2 weeks
2006-04-10 10:03:41 +00:00
David Xu
e631cff309 Use proc lock to prevent a thread from exiting, Giant was no longer used to
protect thread list.
2006-04-10 04:55:59 +00:00
Robert Watson
d37b79a00f Remove UNIX domain socket raw socket support. This feature is documented
as being undocumented in Stevens, and was broken in 1997 during network
stack infrastructure work.  It is the one remaining (and incorrect)
direct protocol reference to raw_usrreq.pru_attach; this is incorrect
because the raw socket code assumes that raw_uattach is called only after
the protocol has allocated a PCB.

MFC after:	3 months
2006-04-09 16:29:47 +00:00
Marcel Moolenaar
07c8931358 Add the scc_hwmtx spin mutex, defined by scc(4). 2006-04-07 22:15:54 +00:00
John-Mark Gurney
1c4ca5e5fe spell unlock correctly, this is relatively minor as it's rare someone would
provide a lock method, and want the default unlock, but it is a bug...

PR:		95356
Submitted by:	Stephen Corteselli
MFC after:	3 days
2006-04-07 17:21:27 +00:00
Jeff Roberson
b53bf1269c - VFS_LOCK_GIANT when recycling a vnode via getnewvnode. We may be
recycling for an unrelated filesystem.  I really don't like potentially
   acquiring giant in the context of a giantless filesystem but there
   are reasonable objections to removing the recycling from this path.

Sponsored by:	Isilon Systems, Inc.
2006-04-04 06:46:10 +00:00
Jeff Roberson
4b24e4210e - Properly check against B_DELWRI and B_NEEDSGIANT. This check was
incorrectly written and caused some !NEEDSGIANT buffers to be put in
   the NEEDSGIANT queue.

Sponsored by:	Isilon Systems, Inc.
2006-04-04 06:44:21 +00:00
Marcel Moolenaar
39eb1d1263 Increment kdb_active after we stopped the other CPUs and decrement
kdb_active before we restart them. This avoids false positives on
restarted CPUs when they test for kdb_active while kdb_trap() is
still finishing up.
2006-04-04 00:40:20 +00:00
Marcel Moolenaar
bfcdefd8aa Eliminate HAVE_STOPPEDPCBS. On ia64 the PCPU holds a pointer to the
PCB in which the context of stopped CPUs is stored. To access this
PCB from KDB, we introduce a new define, called KDB_STOPPEDPCB. The
definition, when present, lives in <machine/kdb.h> and abstracts
where MD code saves the context. Define KDB_STOPPEDPCB on i386,
amd64, alpha and sparc64 in accordance to previous code.
2006-04-03 22:51:47 +00:00
Peter Wemm
b9eee07e36 Remove the unused sva and eva arguments from pmap_remove_pages(). 2006-04-03 21:16:10 +00:00
Marcel Moolenaar
5991a4f811 In kdb_trap(), change the type of the local variable 'intr' from int
to register_t, as intr_disable() returns the latter and register_t
may be wider than int.

Pointed out by: marius@
2006-04-03 20:55:52 +00:00
Marcel Moolenaar
2fae8f5aed Replace critical_enter() and critical_exit() in kdb_trap() with
intr_disable() and intr_restore() resp. Previously, critical
regions would have interrupts disabled, but that was changed.
Consequently, the debugger could run with interrupts enabled.
This could cause problems for the low-level console code where
received characters would trigger an interrupt that causes
the interrupt handler to read the character instead of the
cngetc() function.
2006-04-03 17:48:09 +00:00
John-Mark Gurney
5e6125891f mask out any action when copying the flags from the event to the knote..
Pointed out by:	Václav Haisman
Submitted by:	Dan Nelson (slightly modifed patch)
MFC after:	3 days
2006-04-01 20:15:39 +00:00
Robert Watson
bc725eafc7 Chance protocol switch method pru_detach() so that it returns void
rather than an error.  Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF.  so_pcb is now entirely owned and
managed by the protocol code.  Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit.  In their current state they may leak
memory or panic.

MFC after:	3 months
2006-04-01 15:42:02 +00:00
Robert Watson
ac45e92ff2 Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful.  Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit.  This will be corrected shortly in followup
commits to these components.

MFC after:      3 months
2006-04-01 15:15:05 +00:00
Robert Watson
fa4c5373ce Add comment to accept1() that it should use getsock() instead of fgetsock()
to avoid additional mutex operations, and also to avoid use of soref/sorele
which are now not preferred.

MFC after:	3 months
2006-04-01 11:14:56 +00:00
Robert Watson
197b35d717 Mark fgetsock() and fputsock() as depcrecated: callers should rely on
the file descriptor reference, rather than paying additional lock
operations to acquire a socket reference from the file descriptor.
This will also help to ensure that file descriptor based socket
requests are not delivered to a socket after close.  Most consumers
have already been converted to this model.

MFC after:	3 months
2006-04-01 11:09:54 +00:00
Robert Watson
7f689de232 Assert so->so_pcb is NULL in sodealloc() -- the protocol state should not
be present at this point.  We will eventually remove this assert because
the socket layer should never look at so_pcb, but for now it's a useful
debugging tool.

MFC after:	3 months
2006-04-01 10:45:52 +00:00
Robert Watson
220c1357ed Add a somewhat sizable comment documenting the semantics of various kernel
socket calls relating to the creation and destruction of sockets.  This
will eventually form the foundation of socket(9), but is currently in too
much flux to do so.

MFC after:	3 months
2006-04-01 10:43:02 +00:00