Commit Graph

9295 Commits

Author SHA1 Message Date
Jeff Roberson
c5dcb84008 - Revert r1.406 until a solution can be found that doesn't break nfs. The
statfs handler in nfs will lock vnodes which may lead to deadlock or
   recursion.

Found by:	kris
Pointy hat to:	me
2006-02-22 09:52:25 +00:00
Jeff Roberson
a4aeaefe5a - We can not hold a vnode lock while we do a lookup. Search for and load
modules prior to looking up the directory which we will cover to avoid
   this problem in mount.
 - We must hold the coveredvp locked before we can busy the mountpoint to
   prevent a lock order reversal with the vfs_busy() in lookup which holds
   the directory lock prior to doing a vfs_busy().  The directory lock is
   required to safely clear the v_mountedhere field on the directory.

MFC After:	1 week
2006-02-22 06:29:55 +00:00
Jeff Roberson
8a7cd2fdfb - Grab a mnt ref in vfs_busy() before dropping the interlock. This will
prevent the mount point from going away while we're waiting on the lock.
   The ref does not need to persist once we have the lock because the
   lock prevents the mount point from being unmounted.

MFC After:	1 week
2006-02-22 06:20:12 +00:00
Jeff Roberson
05b6a20a66 - Hold the vnode used in the statfs related functions until we're done with
the VFS_STATFS call to prevent the mount from disappearing while we're
   stating.
 - Convert these routines to use MPSAFE namei semantics.

MFC After:	1 week
2006-02-22 06:19:08 +00:00
David Xu
ba0360b135 Abstract function mqfs_create_node() to create a mqueue node. 2006-02-22 02:38:25 +00:00
David Xu
ad8de0f243 If block size is zero, use normal file operations to do I/O,
this eliminates a divided-by-zero fault.

Recommended by: phk
2006-02-22 00:05:12 +00:00
John Baldwin
bd106be404 Move the ruadd() in kern_exit() to save our final stats in our child
stats even further down in exit1() so that it includes the runtime and
tick counts from the final time slice for the dying thread.

Reviewed by:	phk
2006-02-21 21:48:42 +00:00
John Baldwin
6fc6433ecd Split calcru() back into a calcru1() function shared with calccru() and
a calcru() wrapper that passes a local rusage_ext on the stack that is
a snapshot to do the calculations on.  Now we can pass p->p_crux to
calcru1() in calccru() again which fixes the issues with runtime going
backwards messages when dead processes are harvested by init.

Reviewed by:	phk
Tested by:	Stefan Ehmann shoesoft at gmx dot net
2006-02-21 21:47:46 +00:00
Andre Oppermann
80444f8803 The sysctls kern.ipc.[max_linkhdr|max_protohdr|max_hdr|max_datalen]
can't be changed from userland.  Make them read-only and provide
descriptions.

kern.ipc.max_datalen must never be less than one byte.  Enforce this
with a panic in net_init_domain().

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-02-18 17:16:18 +00:00
Andre Oppermann
ec63cb90a3 Replace the 4k fixed sized jumbo mbuf clusters with PAGE_SIZE sized
jumbo mbuf clusters.  To make the variable size clear they are named
MJUMPAGESIZE.

Having jumbo clusters with the native PAGE_SIZE is more useful than
a fixed 4k size according the device driver writers using this API.

The 9k and 16k jumbo mbuf clusters remain unchanged.

Requested by:	glebius, gallatin
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-02-17 14:14:15 +00:00
Andre Oppermann
a4684d742d Make sysctl_msec_to_ticks(SYSCTL_HANDLER_ARGS) generally available instead
of being private to tcp_timer.c.

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-02-16 15:40:36 +00:00
David Xu
94f0972bec Fix a long standing race between sleep queue and thread
suspension code. When a thread A is going to sleep, it calls
sleepq_catch_signals() to detect any pending signals or thread
suspension request, if nothing happens, it returns without
holding process lock or scheduler lock, this opens a race
window which allows thread B to come in and do process
suspension work, however since A is still at running state,
thread B can do nothing to A, thread A continues, and puts
itself into actually sleeping state, but B has never seen it,
and it sits there forever until B is woken up by other threads
sometimes later(this can be very long delay or never
happen). Fix this bug by forcing sleepq_catch_signals to
return with scheduler lock held.
Fix sleepq_abort() by passing it an interrupted code, previously,
it worked as wakeup_one(), and the interruption can not be
identified correctly by sleep queue code when the sleeping
thread is resumed.
Let thread_suspend_check() returns EINTR or ERESTART, so sleep
queue no longer has to use SIGSTOP as a hack to build a return
value.

Reviewed by:	jhb
MFC after:	1 week
2006-02-15 23:52:01 +00:00
Wayne Salamon
085a0d43ca Audit the arguments to the ptrace(2) system call.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-02-14 01:18:31 +00:00
Wayne Salamon
bfd7575a39 Audit the arguments to the kill(2) and killpg(2) system calls.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-02-14 01:17:03 +00:00
David Xu
d8267df729 In order to speed up process suspension on MP machine, send IPI to
remote CPU. While here, abstract thread suspension code into a function
called sig_suspend_threads, the function is called when a process received
a STOP signal.
2006-02-13 03:16:55 +00:00
Robert Watson
13f322c2fc Improve consistency of return() style.
MFC after:	3 days
2006-02-12 15:00:27 +00:00
Poul-Henning Kamp
e8444a7e6f CPU time accounting speedup (step 2)
Keep accounting time (in per-cpu) cputicks and the statistics counts
in the thread and summarize into struct proc when at context switch.

Don't reach across CPUs in calcru().

Add code to calibrate the top speed of cpu_tickrate() for variable
cpu_tick hardware (like TSC on power managed machines).

Don't enforce monotonicity (at least for now) in calcru.  While the
calibrated cpu_tickrate ramps up it may not be true.

Use 27MHz counter on i386/Geode.

Use TSC on amd64 & i386 if present.

Use tick counter on sparc64
2006-02-11 09:33:07 +00:00
David Xu
42925630b6 Test before modifying p_sflag to avoid unconditionally cache line
ping-pong on SMP.
2006-02-10 14:59:16 +00:00
David Xu
71b7afb2b4 Call thread_stopped in thr_exit to notify parent that the child process
is now fully stopped, this was already in kse_exit().
2006-02-10 03:34:29 +00:00
Poul-Henning Kamp
eb2da9a51f Simplify system time accounting for profiling.
Rename struct thread's td_sticks to td_pticks, we will need the
other name for more appropriately named use shortly.  Reduce it
from uint64_t to u_int.

Clear td_pticks whenever we enter the kernel instead of recording
its value as reference for userret().  Use the absolute value of
td->pticks in userret() and eliminate third argument.
2006-02-08 08:09:17 +00:00
Poul-Henning Kamp
5b1a8eb397 Modify the way we account for CPU time spent (step 1)
Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.

For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines.  (The main reason for this first milestone commit is
to verify that hypothesis.)

On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark.  On more modern hardware no change
in performance is seen.
2006-02-07 21:22:02 +00:00
John Baldwin
222fdf4bff Provide some anti-footshooting. Don't allow the user to set the interval
for acctwatch() runs to be negative or zero as this could result in either
a possible hang (or panic if INVARIANTS is on).  Previously the accounting
code handled the <= 0 case by calling acctwatch on every clock tick (eww!)
due to an implementation detail of callout_reset().  (Tick counts of
<= 0 are converted to 1).

MFC after:	3 days
2006-02-07 18:59:47 +00:00
John Baldwin
505a14934e - Add a kthread to periodically call acctwatch() when accounting is active
instead of calling acctwatch() from softclock.  The acctwatch() function
  needs to hold an sx lock and also makes a VFS call, and neither of these
  are good things (or safe) to do from a callout.  The kthread only exists
  and is running when accounting is turned on; it is started and stopped
  as needed.  I didn't run acctwatch() via the thread taskqueue at Robert's
  request as he was worried that if the accounting file was over NFS the
  VFS_STAT() calls might stall other work on the taskqueue.
- Add an acct_disable() function to take care of closing the accounting
  vnode and cleaning up so we don't duplicate the same code in two
  different places.

MFC after:	3 days
2006-02-07 16:04:03 +00:00
John Baldwin
8917b8d28c - Always call exec_free_args() in kern_execve() instead of doing it in all
the callers if the exec either succeeds or fails early.
- Move the code to call exit1() if the exec fails after the vmspace is
  gone to the bottom of kern_execve() to cut down on some code duplication.
2006-02-06 22:06:54 +00:00
John Baldwin
809f984b21 Add a kern_eaccess() function and use it to implement xenix_eaccess()
rather than kern_access().

Suggested by:	rwatson
2006-02-06 22:00:53 +00:00
John Baldwin
934ba9b2cf - Move the wakeup() for exiting kthreads out of exit1() and into
kthread_exit() as that is cleaner and less obscured.  It also does the
  wakeup sooner.
- Add some comments to kthread_exit().
2006-02-06 21:56:13 +00:00
John Baldwin
2c9d9d392a We don't need the proc lock to check P_KTHREAD on curthread since it is
only set before the kthread starts executing and is never cleared.
2006-02-06 21:54:47 +00:00
Olivier Houchard
2a3b10658d rwlock expects the struct thread to be aligned on 8 bytes, so make sure
thread0 is.
2006-02-06 16:03:10 +00:00
Jeff Roberson
04f6d3effa - Add a ref count to the mount structure. Sleep for up to 3 seconds in
vfs_mount_destroy waiting for this ref to hit 0.  We don't print an
   error if we are rebooting as the root mount always retains some refernces
   by init proc.
 - Acquire a mnt ref for every vnode allocated to a mount point.  Drop this
   ref only once vdestroy() has been called and the mount has been freed.
 - No longer NULL the v_mount pointer in delmntque() so that we may release
   the ref after vgone() has been called.  This allows us to guarantee
   that the mount point structure will be valid until the last vnode has
   lost its last ref.
 - Fix a few places that rely on checking v_mount to detect recycling.

Sponsored by:	Isilon Systems, Inc.
MFC After:	1 week
2006-02-06 10:19:50 +00:00
Jeff Roberson
2f0bca553a - Don't check v_mount for NULL to determine if a vnode has been recycled.
Use the more appropriate VI_DOOMED flag instead.

Sponsored by:	Isilon Systems, Inc.
MFC After:	1 week
2006-02-06 10:15:27 +00:00
Jeff Roberson
36a52c3cae - Add the global 'rebooting' variable that is used to detect when
boot() has been called.

Sponsored by:	Isilon Systems, Inc.
MFC After:	1 week
2006-02-06 10:12:00 +00:00
David Xu
ea8e65b0fa Add members pl_sigmask and pl_siglist into ptrace_lwpinfo to get lwp's
signal mask and pending signals.
2006-02-06 09:41:56 +00:00
Robert Watson
9653775b18 Regenerate. 2006-02-06 02:00:32 +00:00
Robert Watson
c983324ef5 Prefer AUE_FOO audit identifiers to AUE_O_FOO, which are largely left
over from the Darwin implementation.

When we implement a system call as a wrapper to sysctl(), audit it as
AUE_SYSCTL.  This leads to greater compatibility with Solaris audit
trails as sysctl() argument tokens are not the same as the ones for
the originaly system calls (i.e., setdomainname()).

Replace references to AUE_ events that are equivilent to AUE_NULL with
AUE_NULL.  In the case of process signal configuration, this is
because these events do not require auditing.

Move from the Darwin spelling of getsockopt() to the FreeBSD/Solaris
one.

Audit nmount().

Obtained from:	TrustedBSD Project
2006-02-06 02:00:06 +00:00
Robert Watson
89964dd284 When exiting a thread, submit any pending record. Today, we don't
audit thread exit, but should that happen, this will prevent
unhappiness, as the thread exit system call will never return, and
hence not commit the record.

Pointed out by/with:	cognet
Obtained from:		TrustedBSD Project
2006-02-06 01:51:08 +00:00
Wayne Salamon
2f8a46d5ff Audit the arguments (user/group IDs) for the system calls that set these IDs.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-02-06 00:32:33 +00:00
Wayne Salamon
ad20c8f325 Audit the args to rfork(), and the child PID for all fork system calls.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-02-06 00:28:50 +00:00
Wayne Salamon
de3007e8f3 Audit the pid being requested in wait4().
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-02-06 00:19:09 +00:00
Wayne Salamon
a750d0b2a2 Add auditing of arguments to the close() and fstat() system calls. Much more
argument auditing yet to come, for remaining system calls in this file.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
2006-02-05 23:57:32 +00:00
Robert Watson
00c28d9678 On process exit, audit the return value of the process, and commit the
record immediately, as this system call never returns.

Obtained from:	TrustedBSD Project
2006-02-05 21:08:25 +00:00
Robert Watson
6e8525ce84 When GC'ing a thread, assert that it has no active audit record.
This should not happen, but with this assert, brueffer and I would
not have spent 45 minutes trying to figure out why he wasn't
seeing audit records with the audit version in CVS.

Obtained from:	TrustedBSD Project
2006-02-05 21:06:09 +00:00
Robert Watson
95fea57c65 Add AUDITVNODE[12] flags to namei(), which cause namei() to audit path
and vnode attribute information for looked up vnodes during the lookup
operation.  This will allow consumers of namei() to specify that this
information be added to the in-process audit record.

Submitted by:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-05 15:42:01 +00:00
David Xu
25c926f1b0 Regenerate. 2006-02-05 02:23:41 +00:00
David Xu
9e7d72246f Implement thr_set_name to set a name for thread.
Reviewed by: julian
2006-02-05 02:18:46 +00:00
David Xu
7f96995ebd Create childproc_jobstate function to report job control state, this
also fixes a bug in childproc_continued which ignored PS_NOCLDSTOP.
2006-02-04 14:10:57 +00:00
David Xu
a99f7ca21e Axe unused code. 2006-02-04 06:36:39 +00:00
John Baldwin
37f84a6018 Add a comment. 2006-02-03 21:09:40 +00:00
John Baldwin
b0864d13ab Sort includes. 2006-02-03 16:37:55 +00:00
Robert Watson
59428b0bad In fchdir(), Giant must be separately acquired and dropped if the old
vnode is from a file system that is not MPSAFE, as vrele() expects
Giant to be held when it is called on a non-MPSAFE vnode.

Spotted by:	kris
Tested by:	glebius
2006-02-03 15:42:16 +00:00
Robert Watson
d7bd3313e2 Regenerate. 2006-02-03 11:51:19 +00:00
Robert Watson
62646c07f6 Assign audit event identifiers to many system calls.
Much work by:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-03 11:48:37 +00:00
Tor Egge
c78226329a For low memory situations, non-VMIO buffers didnt't release pages back to
the system when brelse() was called with B_RELBUF set on the buffer.  This
could be a problem when the system was low on memory, had many buffers on
QUEUE_EMPTYKVA and started to traverse directories.  For each getnewbuf(),
pages were allocated from the system, driving the free reserve downwards.
For each brelse(), the system put the buffer on QUEUE_CLEAN, with B_INVAL
set.

This commit changes the semantics of B_RELBUF to also free pages from
non-VMIO buffers.

Reviewed by:	alc
2006-02-02 21:37:39 +00:00
Olivier Houchard
56db7f4cc6 Don't destroy the slave /dev entry until someone figures out why devfs seems
to behave badly when we do so.
2006-02-02 20:35:45 +00:00
John Baldwin
f6b457923d Whitespace fix.
Submitted by:	Wojciech A. Koszek <dunstan at zsno ids czest pl>
2006-02-02 20:14:52 +00:00
Jeff Roberson
68ce4375c4 - textvp may have been from a different mountpoint than ndp->ni_vp and
we may need to acquire giant to vrele it.

Found by:	mjacob
MFC After:	3 days
2006-02-02 08:39:39 +00:00
Robert Watson
06f2859f6d Regenerate. 2006-02-02 01:45:01 +00:00
Robert Watson
35d29f5091 Map audit-related system calls to audit event identifiers.
Much work by:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-02 01:44:30 +00:00
Robert Watson
fcf7f27a36 Hook up audit to fork() and exit() events. These changes manage the
audit state on processes, not auditing of these events.

Much work by:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-02 01:32:58 +00:00
Robert Watson
3683665bbd Hook up audit to the initial process creation events (proc0, proc1).
Much help from:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-02 01:16:31 +00:00
Robert Watson
911b84b08d Add new fields to process-related data structures:
- td_ar to struct thread, which holds the in-progress audit record during
  a system call.

- p_au to struct proc, which holds per-process audit state, such as the
  audit identifier, audit terminal, and process audit masks.

In the earlier implementation, td_ar was added to the zero'd section of
struct thread.  In order to facilitate merging to RELENG_6, it has been
moved to the end of the data structure, requiring explicit
initalization in the thread constructor.

Much help from:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-02 00:37:05 +00:00
Jeff Roberson
9157b485f0 - Solve a problem where a vput could be called on an outgoing directory
without Giant held.  Do this by tracking the vfslocked state for
   the directory seperate from the child.  This is only important
   in the case where we cross a mountpoint.

Sponsored by:	Isilon Systems, Inc.
MFC After:	3 days
2006-02-01 09:34:32 +00:00
Jeff Roberson
0ac72424f0 - chroot and chdir need to lock giant as appropriate for the outgoing vp
as well as the new vp.

Sponsored by:	Isilon Systems, Inc.
MFC After:	3 days
2006-02-01 09:30:44 +00:00
Scott Long
803e980d03 Fix another compile problem. If I find any more, this file is going in the
Attic until it is properly fixed.
2006-02-01 04:18:07 +00:00
Jeff Roberson
b099db5881 - Solve a race where we could lose a call to VOP_INACTIVE. If vget() waiting
on a lock held the last usecount ref on a vnode and the lock failed we
   would not call INACTIVE.  Solve this by only holding a holdcnt to prevent
   the vnode from disappearing while we wait on vn_lock.  Other callers
   may now VOP_INACTIVE while we are waiting on the lock, however this race
   is acceptable, while losing INACTIVE is not.

Discussed with:	kan, pjd
Tested by:	kkenn
Sponsored by:	Isilon Systems, Inc.
MFC After:	1 week
2006-02-01 00:30:05 +00:00
Jeff Roberson
89b0e10910 - Reorder calls to vrele() after calls to vput() when the vrele is a
directory.  vrele() may lock the passed vnode, which in these cases would
   give an invalid lock order of child -> parent.  These situations are
   deadlock prone although do not typically deadlock because the vrele
   is typically not releasing the last reference to the vnode.  Users of
   vrele must consider it as a call to vn_lock() and order it appropriately.

MFC After: 	1 week
Sponsored by:	Isilon Systems, Inc.
Tested by:	kkenn
2006-02-01 00:25:26 +00:00
Christian S.J. Peron
b4e12c03e9 Allow root to open prison pts devices too.
Pointed out by:	rwatson
2006-01-31 22:19:37 +00:00
Christian S.J. Peron
f737c45c91 Allow root in the host environment to open ptys within jailed environments.
This logic change was introduced in revision 1.74:

Correct an oversight in jail() that allowed processes in jail to access
ptys in ways that might be unethical, especially towards processes not in
jail, or in other jails.

It should be fine to allow root in the host environment to do this. This
allows for more effective monitoring of prisons from the host environment.

Discussed with:	rwatson
MFC after:	1 week
2006-01-31 17:17:45 +00:00
Pawel Jakub Dawidek
847a2a1716 Add buffer corruption protection (RedZone) for kernel's malloc(9).
It detects both: buffer underflows and buffer overflows bugs at runtime
(on free(9) and realloc(9)) and prints backtraces from where memory was
allocated and from where it was freed.

Tested by:	kris
2006-01-31 11:09:21 +00:00
Scott Long
019a2f40ae Regroup order of operations to better reflect what was probably intended.
Submitted by: Peter Jeremy
2006-01-30 19:25:52 +00:00
Gleb Smirnoff
75ee267c22 Merge the //depot/user/yar/vlan branch into CVS. It contains some collective
work by yar, thompsa and myself. The checksum offloading part also involves
work done by Mihail Balikov.

The most important changes:

o   Instead of global linked list of all vlan softc use a per-trunk
  hash. The size of hash is dynamically adjusted, depending on
  number of entries. This changes struct ifnet, replacing counter
  of vlans with a pointer to trunk structure. This change is an
  improvement for setups with big number of VLANs, several interfaces
  and several CPUs. It is a small regression for a setup with a single
  VLAN interface.
    An alternative to dynamic hash is a per-trunk static array with
  4096 entries, which is a compile time option - VLAN_ARRAY. In my
  experiments the array is not an improvement, probably because such
  a big trunk structure doesn't fit into CPU cache.
o   Introduce an UMA zone for VLAN tags. Since drivers depend on it,
  the zone is declared in kern_mbuf.c, not in optional vlan(4) driver.
  This change is a big improvement for any setup utilizing vlan(4).
o   Use rwlock(9) instead of mutex(9) for locking. We are the first
  ones to do this! :)
o   Some drivers can do hardware VLAN tagging + hardware checksum
  offloading. Add an infrastructure for this. Whenever vlan(4) is
  attached to a parent or parent configuration is changed, the flags
  on vlan(4) interface are updated.

In collaboration with:	yar, thompsa
In collaboration with:	Mihail Balikov <mihail.balikov interbgc.com>
2006-01-30 13:45:15 +00:00
Robert Watson
4c0b19957f Move pts master devices into /dev/pty/ instead of littering /dev with them;
this is more consistent with the placement of slaves in /dev/pts.  The
actual name doesn't matter as it's not part of the exposed API or used by
libc.  In some sense, it would be nice if these device nodes didn't have to
have names in devfs at all.

Suggested by:	Stephen McKay <smckay at internode dot on dot net>
2006-01-30 11:59:19 +00:00
Gleb Smirnoff
61fb9bd80c - In pipe() return the error returned by pipe_create(), rather then
hardcoded ENFILES, which is incorrect. pipe_create() can fail due
  to ENOMEM.
- Update manual page, describing ENOMEM return code.

Reviewed by:	arch
2006-01-30 08:25:04 +00:00
Jeff Roberson
608c95d341 - Add a comment warning about an anomalous condition where we VOP_UNLOCK
and then vrele rather than vput because we would like to VOP_UNLOCK with
   a specific thread.
2006-01-30 08:21:23 +00:00
Jeff Roberson
033eb86e52 - Lock access to vrele() with VFS_LOCK_GIANT() rather than mtx_lock(&Giant).
Sponsored by:	Isilon Systems, Inc.
2006-01-30 08:19:01 +00:00
Scott Long
8ad6b7ab7c Take a stab at making this compile when WITNESS is not defined. gcc can't
figure out the order of operations at line 519, and neither can I, but this
is my best guess.  Also correct a number of typos and syntax errors.
2006-01-29 20:48:25 +00:00
Max Laier
6aec1278dc firmware(9) is a subsystem to load binary data into the kernel via a
specially crafted module.  There are several handrolled sollutions to this
problem in the tree already which will be replaced with this.  They include
iwi(4), ipw(4), ispfw(4) and digi(4).

No objection from:	arch
MFC after:		2 weeks
X-MFC after:		some drivers have been converted
2006-01-29 02:52:42 +00:00
Max Laier
69e99c5d4c Unbreak on archs where %d doesn't print uintptr_t arithmetic. 2006-01-29 02:35:22 +00:00
Robert Watson
5276d7471f Rename use_old_pty variable to use_pts, as this more accurately reflects
the sense of the variable.

Suggested by:	dwhite
2006-01-28 23:31:19 +00:00
Suleiman Souhlal
c270875f7c Don't try to load KLDs if we're mounting the root. We'd otherwise panic.
Tested by:	kris
MFC after:	3 days
2006-01-28 22:58:39 +00:00
Kris Kennaway
d5e5528afe Back out r1.653; it turns out that the race (or at least the printf) is
actually not hard to trigger, and it can cause a lot of console spam.

Approved by:	kan
2006-01-28 03:06:35 +00:00
Warner Losh
6229621e2c lock unused when INVARIANTS not defined, so don't declare it then 2006-01-28 00:49:31 +00:00
John Baldwin
3f08bd8bce Add a basic reader/writer lock implementation to the kernel. This
implementation is by no means perfect as far as some of the algorithms
that it uses and the fact that it is missing some functionality (try
locks and upgrades/downgrades are not there yet), however it does seem
to work in my local testing.  There is more detail in the comments in the
code, but the short version follows.

A reader/writer lock is very much like a regular mutex: it cannot be held
across a voluntary sleep; it can be acquired in an interrupt thread; if
the lock is held by a writer then the priority of any threads that block
on the lock will be lent to the owner; the simple case lock operations all
are done in a single atomic op.  It also shares some similiarities
with sx locks: it supports reader/writer semantics (multiple readers,
but single writers); readers are allowed to recurse, but writers are not.

We can extend this implementation further by either improving algorithms
or adding new functionality, but this should at least give us a base to
work with now.

Reviewed by:	arch (in theory)
Tested on:	i386 (4 cpu box with a kernel module that used 4 threads
		that randomly chose between read locks and write locks
		that ran w/o panicing for over a day solid.  It usually
		panic'd within a few seconds when there were bugs during
		testing. :)  The kernel module source is available on
		request.)
2006-01-27 23:13:26 +00:00
John Baldwin
135161049e Whitespace. 2006-01-27 23:06:08 +00:00
John Baldwin
7aa4f6852a - Add support for having both a shared and exclusive queue of threads in
each turnstile.  Also, allow for the owner thread pointer of a turnstile
  to be NULL.  This is needed for the upcoming reader/writer lock
  implementation.
- Add a new ddb command 'show turnstile' that will look up the turnstile
  associated with the given lock argument and display useful information
  like the list of threads blocked on each queue, etc.  If there isn't an
  active turnstile for a lock at the specified address, then the function
  will see if there is an active turnstile at the specified address and
  display info about it if so.
- Adjust the mutex code to handle the turnstile API changes.

Tested on:	i386 (all), alpha, amd64, sparc64 (1 and 3)
2006-01-27 22:42:12 +00:00
John Baldwin
f126e754e0 Add a new ddb command 'show sleepq'. It takes a wait channel as an
argument and looks for a sleep queue associated with that wait channel.
If it finds one it will display information such as the list of threads
sleeping on that queue.  If it can't find a sleep queue for that wait
channel, then it will see if that address matches any of the active
sleep queues.  If so, it will display information about the sleepq at the
specified address.
2006-01-27 22:24:07 +00:00
John Baldwin
bef4bf1adf Add a new sysctl, debug.ktr.clear. If you write a non-zero value to this
sysctl then it will clear the KTR buffer.  Note that if you have active
KTR traces at the same time as a clear operation the behavior is undefined,
though it shouldn't panic.
2006-01-27 22:17:31 +00:00
Olivier Houchard
23c15e6437 Merge a bunch of changes that where done in tty_pty.c after tty_pts.c was
forked from it, but missed from some reason.
2006-01-27 15:13:40 +00:00
Pawel Jakub Dawidek
f220f7afa6 Grr. Backout previous change. vn_open_cred() will call NDFREE() on failure. 2006-01-27 11:25:06 +00:00
Pawel Jakub Dawidek
970c7ca2ef Don't forget to call NDFREE(9) in case of vn_open_cred() failure.
MFC after:	3 days
2006-01-27 11:19:53 +00:00
David Xu
6d53aa6297 Just like dofilewrite(), call bwillwrite before fo_write. 2006-01-27 08:02:25 +00:00
David Xu
03d66b36c7 return final error code in aio_return rather than a hardcoded 0. 2006-01-27 04:14:16 +00:00
Olivier Houchard
f94cf2b10b Take into account that bits 0x0000ff00 can't be used for minor. 2006-01-27 00:21:48 +00:00
Olivier Houchard
169c44907a Don't attempt to re-create the /dev entry for the slave part if it already
exist when opening the master. This can happen if one open the master, then
open the slave, then close and re-open the master.

Reported by:	Peter Holm
2006-01-26 20:54:49 +00:00
David Xu
55a122bf28 in aio_aqueue, store same return code into job->_aiocb_private.error.
in aio_return, unlock proc lock before suword.
2006-01-26 08:37:02 +00:00
Olivier Houchard
12af2a0f4f Bring in a sysv-style pts implementation, as found in the rwatson_pts perforce branch. It works the same as its SysV/linux counterpart : You obtain a fd to the master pseudo terminal by opening /dev/ptmx, which craetes a node for the master as /dev/pty[num] and a node for the slave as /dev/pts/[num].
It should play nicely with the existing BSD ptys.
By default, the system will use the BSD ptys, one can set the sysctl
kern.pts.enable to 1 to make it use the new pts system.
The max number of pty that can be allocated on a system can be changed with the
sysctl kern.pts.max. It defaults to 1000, and can be increased, but it is not
recommanded, as any pty with a number > 999 won't be handled by whatever uses
utmp(5).
2006-01-26 01:30:34 +00:00
John Baldwin
6b81555744 Axe KTR_ALQ_MASK now that KTR_WITNESS is off unless you hack an #ifdef
in subr_witness.c.  I did add a comment in subr_witness.c noting that
KTR_WITNESS is incompatible with KTR_ALQ.
2006-01-25 14:57:23 +00:00
Stephan Uphoff
6807424d19 Back out changes made in rev. 1.151.
They were bogus.

Cluebat applied by: jhb@
2006-01-25 02:05:47 +00:00
Don Lewis
f4af687a3b Touch all the pages wired by sysctl_wire_old_buffer() to avoid PTE
modified bit emulation traps on Alpha while holding locks in the
sysctl handler.

A better solution would be to pass a hint to the Alpha pmap code to
tell mark these pages as modified when they as they are being wired,
but that appears to be more difficult to implement.

Suggested by: jhb
MFC after:	3 days
2006-01-25 01:03:34 +00:00
John Baldwin
67f7fe8c01 Whitespace fix. 2006-01-24 22:24:05 +00:00
John Baldwin
2b604e82b2 - Add a new KTR_SUBSYS in place of KTR_SPARE1 to serve as a subsystem
placeholder similar to KTR_DEV.  Explain the use of KTR_DEV and
  KTR_SUBSYS in a comment as well.
- Retire KTR_WITNESS and instead have KTR_WITNESS default to off but use
  KTR_SUBSYS if it is enabled.
2006-01-24 22:23:45 +00:00
David Xu
1aa4c324ee Add locking annotation and comments about socket, pipe, fifo problem.
Temporarily fix a locking problem for socket I/O.
2006-01-24 07:24:24 +00:00
David Xu
e6bdc05ff7 Er, rescure a deleted comment line. 2006-01-24 02:50:42 +00:00
David Xu
bd793be3c6 More cleanup for aio code:
1) unregsiter kqueue filter for EVFILT_LIO.
2) free uma_zones.
3) call setsid directly to enter another session rather than
   implementing by itself.

Submitted by: jhb
2006-01-24 02:46:15 +00:00
David Xu
7f34b521c7 Add bracket. 2006-01-23 23:46:30 +00:00
John Baldwin
704c9f00fb Fix a vnode reference leak in the ktrace code. We always grab a reference
to the vnode at the start of ktr_writerequest() but were missing the
corresponding vrele() after we finished the write operation.

Reported by:	jasone
2006-01-23 21:45:32 +00:00
Stephan Uphoff
03001f59c8 Hopefully fix the "calcru: runtime went backwards from ..." problem by
keeping the resource values locked (where needed) while we use them
for calculations.

MFC after:	3 days
2006-01-23 19:15:13 +00:00
Andre Oppermann
fd2413398c In mb_zinit_pack() explicitly ignore the return value of uma_zalloc_arg().
The success of the cluster allocation is checked through a field in the
mbuf structure.  This change is non-functional but properly silences code
inspection tools.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID807
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-01-23 15:49:01 +00:00
David Xu
68d7111884 Verify all supported notification types. 2006-01-23 10:27:15 +00:00
David Xu
a9bf5e37ae 1) Merge _aio_aqueue and aio_aqueue, check quota in aio_aqueue,
so that lio_listio won't exceed the quota.
2) Remove lio_ref_count, it is no longer used.
2006-01-23 02:49:34 +00:00
Alan Cox
bb53e2bf27 Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.

Reviewed by: tegge
2006-01-23 00:00:45 +00:00
Don Lewis
1dd5fc0fde Tweak previous vfs_lookup.c commit to return an EINVAL error from
lookup() instead of EPERM when a DELETE or RENAME operation is
attempted on "..".

In kern_unlink(), remap EINVAL errors returned from namei() to EPERM
to match existing (and POSIX required) behaviour.

Discussed with:	bde
MFC after:	3 days
2006-01-22 19:37:02 +00:00
David Xu
8c0d9af5bf Fix a bogus panic. 2006-01-22 09:39:59 +00:00
David Xu
9b84335c84 Decrease kaio_active_count first, because user process may go away
after we notified it.
2006-01-22 09:25:52 +00:00
David Xu
4ca4c9ee68 Regen. 2006-01-22 06:01:48 +00:00
David Xu
1ce9182407 Make aio code MP safe. 2006-01-22 05:59:27 +00:00
Nate Lawson
a2b31c5b4f Add a devd(8) event that is sent after the system resumes. This can be
used by utilities to reset moused(8), for example.  The syntax is:

    !system=kern subsystem=power type=resume

Note that it would be nice to have notification of suspend, but it's more
difficult since there would have to be a method of doing request/ack
to userland and automatically timing out if no response.  apm(4) has a
similar mechanism.

MFC after:	2 weeks
2006-01-22 01:06:25 +00:00
Robert Watson
c1250af683 Convert remaining functions to ANSI C function declarations.
MFC after:	1 week
2006-01-22 00:30:46 +00:00
Alan Cox
e5e6093ba9 Avoid a vm object reference leak in a rarely used code path.
An executable contains at most one PT_INTERP program header.  Therefore,
the loop that searches for it can terminate after it is found rather than
iterating over the entire set of program headers.

Eliminate an unneeded initialization.

Reviewed by: tegge
2006-01-21 20:11:49 +00:00
Don Lewis
bea7a8d75c Return EPERM from lookup() if cn_nameiop is DELETE or RENAME and
the last component of the path name is "..".  This keeps VOP_LOOKUP()
from locking vnodes in reverse order.

Tested by:	Denis Shaposhnikov <dsh AT vlink DOT ru>
MFC after:	3 days
2006-01-21 19:57:56 +00:00
Robert Watson
6be2c41a22 Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C
prototypes, as the majority of new functions added have been in this
style.  Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition.  In practice
this didn't matter as only poll flags in the lower 16 bits are used.

MFC after:	1 week
2006-01-21 19:42:10 +00:00
John Baldwin
267ec43593 When loading a driver that is a subclass of another driver don't set the
devclass's parent pointer if the two drivers share the same devclass.  This
can happen if the drivers use the same new-bus name.  For example, we
currently have 3 drivers that use the name "pci": the generic PCI bus
driver, the ACPI PCI bus driver, and the OpenFirmware PCI bus driver.  If
the ACPI PCI bus driver was defined as a subclass of the generic PCI bus
driver, then without this check the "pci" devclass would point to itself
as its parent and device_probe_child() would spin forever when it
encountered the first PCI device that did have a matching driver.

Reviewed by:	dfr, imp, new-bus@
2006-01-20 21:59:13 +00:00
Julian Elischer
11f4763dd4 Return the thread name in the kinfo_proc structure.
Also correct the comment describing what the value is.
2006-01-18 20:27:43 +00:00
John Baldwin
25e498b4b0 Always include the lock_classes[] array in the kernel. The
"is it a spinlock" test in mtx_destroy() needs it even in non-debug
kernels.

Reported by:	danfe
2006-01-18 18:02:50 +00:00
Juli Mallett
b241b0a239 Since p_cansee will end up dereferencing p_ucred, don't check for p_ucred
equal to NULL several times later.  p_ucred "should probably not" be NULL
if the process isn't PRS_NEW anyway.  This is strongly reinforced by the fact
that we don't see frequent crashes here.  Remove the checks after p_cansee and
add a KASSERT right before it.

Found by:	Coverity Prevent (tm)

Also trim one nearby trailing space.
2006-01-17 20:25:01 +00:00
John Baldwin
6ef970a972 Bah. Fix 'show lock' to actually be compiled in. I had just fixed this in
p4 but had an older subr_lock.c on the machine I committed to CVS from.
2006-01-17 16:58:32 +00:00
John Baldwin
83a81bcb14 Add a new file (kern/subr_lock.c) for holding code related to struct
lock_obj objects:
- Add new lock_init() and lock_destroy() functions to setup and teardown
  lock_object objects including KTR logging and registering with WITNESS.
- Move all the handling of LO_INITIALIZED out of witness and the various
  lock init functions into lock_init() and lock_destroy().
- Remove the constants for static indices into the lock_classes[] array
  and change the code outside of subr_lock.c to use LOCK_CLASS to compare
  against a known lock class.
- Move the 'show lock' ddb function and lock_classes[] array out of
  kern_mutex.c over to subr_lock.c.
2006-01-17 16:55:17 +00:00
John Baldwin
550d1c9392 Initialize thread0.td_contested in init_turnstiles() rather than
mutex_init() as it is used by the turnstile code and is not mutex-specific.
2006-01-17 16:47:42 +00:00
John Baldwin
3eb9cab0c6 Garbage collect turnstile_empty() since it is unused. 2006-01-17 16:40:20 +00:00
Poul-Henning Kamp
25a14196dd Fix an 11 year old mistake: Let the hash functions take a void* instead
of unsigned char* argument.
2006-01-17 15:35:57 +00:00
Tor Egge
dffaf91aa3 Set flag in needsbuffer while still holding bqlock to avoid lost wakeup. 2006-01-16 22:09:47 +00:00
Christian S.J. Peron
323203d389 vfs_busy can only return something useful if MNTK_UNMOUNT has been set.
Since we are using vfs_busy() on a freshly allocated mount structure, use
(void) to show that we do not care about the return value.

Found with:	Coverity Prevent (tm)
MFC after:	2 weeks
2006-01-15 20:14:11 +00:00
Robert Watson
6994eebcab Cast VFS_STATFS() in vfs_domount() to (void) to indicate that ignoring the
return value is intentional: this is simply an attempt to pre-cache the
statfs state.

Found with:	Coverity Prevent (tm)
MFC after:	3 days
2006-01-15 20:01:05 +00:00
Christian S.J. Peron
8213baf002 Initialize ki to p->p_aioinfo after we know it's going to be referencing
a valid kaioinfo structure. This avoids a potential NULL pointer dereference.

Found with:	Coverity Prevent(tm)
MFC after:	2 weeks
2006-01-15 01:55:45 +00:00
Ruslan Ermilov
6a61c14ee1 AMD64 also supports disk slices. 2006-01-14 20:47:11 +00:00
Poul-Henning Kamp
c9df826b0a Correct STAILQ usage in purge of resourcelist.
Found with:   Coverity Prevent(tm)
2006-01-14 09:41:35 +00:00
Scott Long
0f92108d32 Add the following to the taskqueue api:
taskqueue_start_threads(struct taskqueue **, int count, int pri,
			const char *name, ...);

This allows the creation of 1 or more threads that will service a single
taskqueue.  Also rework the taskqueue_create() API to remove the API change
that was introduced a while back.  Creating a taskqueue doesn't rely on
the presence of a process structure, and the proc mechanics are much better
encapsulated in taskqueue_start_threads().  Also clean up the
taskqueue_terminate() and taskqueue_free() functions to safely drain
pending tasks and remove all associated threads.

The TASKQUEUE_DEFINE and TASKQUEUE_DEFINE_THREAD macros have been changed
to use the new API, but drivers compiled against the old definitions will
still work.  Thus, recompiling drivers is not a strict requirement.
2006-01-14 01:55:24 +00:00
Robert Watson
bc03ea7f49 When calling bioq_first() to see if a queue is empty in bioq_disksort(),
don't save the return value as we won't use it.

Noticed by:	Coverity Prevent analysis tool
MFC after:	3 days
2006-01-13 23:27:12 +00:00
Robert Watson
b8ae1cd619 Add sosend_dgram(), a greatly reduced and simplified version of sosend()
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin().  Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
follow-up.
2006-01-13 10:22:01 +00:00
Robert Watson
d7dca9034c XXX a comment in uipc_usrreq.c that requires updating. 2006-01-13 00:00:32 +00:00
Alfred Perlstein
7d7e053c21 Novel idea, don't print a string if it is NULL!
This protects people from loading _really_ old modules, like say from
5.x to a 6.x or 7.x system, like for instance right after an upgrade.
2006-01-12 19:15:14 +00:00
Scott Long
1c3a3b0bd0 The interlock in taskqueue_terminate() is completely wrong for taskqueues
that use spinlocks.  Remove it for now.
2006-01-11 00:37:13 +00:00
Poul-Henning Kamp
d3e64681d6 Move the old BSD4.3 tty compatibility from (!BURN_BRIDGES && COMPAT_43)
to COMPAT_43TTY.

Add COMPAT_43TTY to NOTES and */conf/GENERIC

Compile tty_compat.c only under the new option.

Spit out
	#warning "Old BSD tty API used, please upgrade."
if ioctl_compat.h gets #included from userland.
2006-01-10 09:19:10 +00:00
Scott Long
9df1a6dd61 Add functions and macros and refactor code to make it easier to manage
fast taskqueues.  The following have been added:

TASKQUEUE_FAST_DEFINE() - create a global task queue.
    an arbitrary execution context.
TASKQUEUE_FAST_DEFINE_THREAD() - create a global taskqueue that uses a
    dedicated kthread.
taskqueue_create_fast() - create a local/private taskqueue.

These are all complimentary of the standard taskqueue functions.  They are
primarily useful for fast interrupt handlers that can only use spinlock for
synchronization.

I personally think that the taskqueue API is starting to get too narrow and
hairy, but fixing it will require a major redesign on the API.  Such a
redesign would be good but would break compatibility with FreeBSD 6.x, so
it really isn't desirable at this time.

Submitted by: sam
2006-01-10 06:31:12 +00:00
Tor Egge
82be0a5a24 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
Scott Long
861a23087b If destroying a spinlock, make sure that it is exited properly.
Submitted by: jhb
MFC After: 3 days
2006-01-08 00:18:34 +00:00
John Baldwin
3b783acd2a Revert an untested local change that crept in with the lo_class changes
and subsequently broke the build.  This change is supposed to fix the
case where doing a mtx_destroy() off a spin mutex while you hold it fails.
If it had been tested I would just leave it in, but it hasn't been tested
yet, so it will have to wait until later.
2006-01-07 14:03:15 +00:00
David Xu
0a5cd498bb Add a new feature to thr_kill, if thread ID argument is -1, send
signals to all threads except current sender.
2006-01-07 03:15:21 +00:00
Tai-hwa Liang
75d6a87fa3 Trying to fix compilation bustage introduced in rev1.160 by converting
a missing lo_class to LO_CLASSINDEX().
2006-01-07 02:07:08 +00:00
John Baldwin
3c6decc327 Trim another pointer from struct lock_object (and thus from struct mtx and
struct sx).  Instead of storing a direct pointer to a our lock_class
struct in lock_object, reserve 4 bits in the lo_flags field to serve as an
index into a global lock_classes array that contains pointers to the lock
classes.  Only debugging code such as WITNESS or INVARIANTS checks and KTR
logging need to access the lock_class member, so this shouldn't add any
overhead to production kernels.  It might add some slight overhead to
kernels using those debug options however.

As with the previous set of changes to lock_object, this is going to
completely obliterate the kernel ABI, so be sure to recompile all your
modules.
2006-01-06 18:07:32 +00:00
John Baldwin
af56abaab5 Return error from fget_write() rather than hardcoding EBADF now that
fget_write() DTRT.

Requested by:	bde
2006-01-06 16:34:22 +00:00
John Baldwin
38f63f7e47 Return EBADF rather than EINVAL for FWRITE failure as per POSIX.
MFC after:	1 week
2006-01-06 16:30:30 +00:00
John Baldwin
e730167f16 Remove XXX comments complaining that write(2) on a read-only descriptor
returns EBADF.  That errno is correct and is mandated by POSIX.  It also
goes back to revision 1.1 of our CVS history (i.e. 4.4BSD).

The _fget() function should probably also be upated as it currently returns
EINVAL in that case rather than EBADF.  (It does return EBADF for reads
on a write-only descriptor without any XXX comments oddly enough.)

Discussed with:	scottl, grog, mjacob, bde
2006-01-05 22:20:31 +00:00
Bjoern A. Zeeb
ba0b6851b4 Minor whitespace cleanup. 2006-01-04 17:40:54 +00:00
Poul-Henning Kamp
d5f1e0d1ef Deorbit ttymalloc() in preference for ttyalloc() 2006-01-04 09:59:07 +00:00
Poul-Henning Kamp
8607f52b66 Use ttyalloc() instead of ttymalloc() 2006-01-04 09:09:46 +00:00
Poul-Henning Kamp
246b8d448a Use MTX_SYSINIT to set up the tty list mutex. 2006-01-04 08:22:39 +00:00
Diomidis Spinellis
c3d78136c9 Fix style bug.
Prompted by:	bde
2006-01-04 07:50:54 +00:00
Diomidis Spinellis
f8ccc6ceb9 Replace tv_usec normalization with the return of EINVAL.
This addresses two objections to the previous behavior,
and unbreaks the alpha tinderbox build.

TODO: update the utimes(2) man page.
2006-01-04 00:47:13 +00:00
Diomidis Spinellis
51339e8593 Normalize the tv_usec part of the utimes(2) arguments to ensure
that a file's atime and mtime are only set to correct fractional
second values (0-999999000ns with the current interface).
Prior to this change users could create files with values outside
that range.  Moreover, on 32-bit machines tv_usec offsets larger than
4.3s would result in an unnormalized AND wrong timestamp value,
due to overflow.

MFC after:	1 week
2006-01-03 21:58:21 +00:00
Alexander Leidinger
ef39c05baa MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
   this allows to try different coloring algorithms without the need to
   touch every file [1]
 - make the page queue tuning values readable: sysctl vm.stats.pagequeue
 - autotuning of the page coloring values based upon the cache size instead
   of options in the kernel config (disabling of the page coloring as a
   kernel option is still possible)

MD changes:
 - detection of the cache size: only IA32 and AMD64 (untested) contains
   cache size detection code, every other arch just comes with a dummy
   function (this results in the use of default values like it was the
   case without the autotuning of the page coloring)
 - print some more info on Intel CPU's (like we do on AMD and Transmeta
   CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by:	Chad David <davidc@acns.ab.ca> [1]
Reviewed by:		alc, arch (in 2004)
Discussed with:		alc, Chad David, arch (in 2004)
2005-12-31 14:39:20 +00:00
Pawel Jakub Dawidek
d362c40d3a Improve memguard a bit:
- Provide tunable vm.memguard.desc, so one can specify memory type without
  changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
  vm.memguard.desc, which can be changed to short description of memory
  type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
  short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
  doesn't exist (will be loaded later) or when no memory is allocated yet.
  If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
  default.
2005-12-30 11:45:07 +00:00
Pawel Jakub Dawidek
e7736557d6 Print a warning when we miss vinactive() call, because of race in vget().
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.

For now, verify if it is really hard to trigger.

Discussed with:	kan
2005-12-29 22:52:09 +00:00
John Baldwin
8963150678 patch(1) and I aren't friends today. Axe a duplicate copy of
the msleep_spin() function definition.

Spotted by:	pjd
2005-12-29 21:15:32 +00:00
John Baldwin
0cb7e6aec8 Add a new function msleep_spin() which is a slightly stripped down version
of msleep().  msleep_spin() doesn't support changing the priority of the
thread while it is asleep nor does it support interruptible sleeps (PCATCH)
or the PDROP flag.  It does support timeouts however.  It differs from
msleep() in that the passed in mutex is a spin mutex.  This means one can
use msleep_spin() and wakeup() with a spin mutex similar to msleep() and
wakeup() with a regular mutex.  Note that the spin mutex in question needs
to come before sched_lock and the sleepq locks in lock order.
2005-12-29 20:57:45 +00:00
John Baldwin
b0e9883e2f Teach WITNESS_SAVE() and WITNESS_RESTORE() to work with spin locks instead
of only sleep locks.
2005-12-29 20:54:25 +00:00
John Baldwin
0a46ed7d56 Fix a deadlock I introduced with the recently added printf to warn about
spin locks that are not in the static order list.  It is not safe to call
printf while holding the witness spin mutex since the console drivers that
back printf may need to use their own spin locks which would try to talk
to witness when they were locked.  Given this, it is possible for one
CPU to lock a console driver lock (such as sio) which then tries to lock
the witness lock while another CPU is doing the printf while holding the
witness lock.  Fix this by moving the printf outside of the witness lock.
All other printf's in witness are already correct.

MFC after:	3 days
2005-12-29 20:53:01 +00:00
John Baldwin
42b6a681bc Increment kobj_lookup_misses on a miss rather than decrementing it.
Otherwise, the miss count is actually -kobj_lookup_misses.  Mostly a
pedantic change as KOBJ_STATS isn't on by default.
2005-12-29 18:00:42 +00:00
David Xu
3357835a46 Add code to report zombie state.
PR: threads/91044
MFC after: 3 days
2005-12-29 13:00:42 +00:00
Alexander Kabaev
3f34977614 Trim trailing whitespace. 2005-12-28 17:13:31 +00:00
Pawel Jakub Dawidek
619f284195 In realloc(9), determine size of the original block based on
UMA_SLAB_MALLOC flag.
In some circumstances (I observed it when I was doing a lot of reallocs)
UMA_SLAB_MALLOC can be set even if us_keg != NULL.

If this is the case we have wonderful, silent data corruption, because less
data is copied to the newly allocated region than should be.

I'm not sure when this bug was introduced, it could be there undetected
for years now, as we don't have a lot of realloc(9) consumers and it was
hard to reproduce it...
...but what I know for sure, is that I don't want to know who introduce
the bug:) It took me two/three days to track it down (of course most of
the time I was looking for the bug in my own code).
2005-12-28 01:53:13 +00:00
David Xu
9f8eb3cb52 Use variable i instead of variable cpus as an index to get correct kseq. 2005-12-27 12:02:03 +00:00
Maxim Sobolev
d49b21093c Fix breakage introduced in the previous commit. 2005-12-26 22:32:52 +00:00
Maxim Sobolev
900b28f9f6 Remove kern.elf32.can_exec_dyn sysctl. Instead extend Brandinfo structure
with flags bitfield and set BI_CAN_EXEC_DYN flag for all brands that usually
allow executing elf dynamic binaries (aka shared libraries). When it is
requested to execute ET_DYN elf image check if this flag is on after we
know the elf brand allowing execution if so.

PR:		kern/87615
Submitted by:	Marcin Koziej <creep@desk.pl>
2005-12-26 21:23:57 +00:00
Alan Cox
60bb39431a Maintain the lock on the vnode for most of exec_elfN_imgact().
Specifically, it is required for the I/O that may be performed by
elfN_load_section().

Avoid an obscure deadlock in the a.out, elf, and gzip image
activators.  Add a comment describing why the deadlock does not occur
in the common case and how it might occur in less usual circumstances.

Eliminate an unused variable from exec_aout_imgact().

In collaboration with: tegge
2005-12-24 04:57:50 +00:00
David Xu
d7bc12b096 Avoid kernel panic when attaching a process which may not be stopped
by debugger, e.g process is dumping core. Only access p_xthread if
P_STOPPED_TRACE is set, this means thread is ready to exchange signal
with debugger, print a warning if P_STOPPED_TRACE is not set due to
some bugs in other code, if there is.

The patch has been tested by Anish Mistry mistry.7 at osu dot edu, and
is slightly adjusted.
2005-12-24 02:59:29 +00:00
Jeff Roberson
49bdcff518 - Remove and unused include.
Submitted by:	Antoine Brodin <antoine.brodin@laposte.net>
2005-12-23 21:32:40 +00:00
Poul-Henning Kamp
25f6e35a05 Regenerate sysent with new abort2 system call.
Implement abort2(const char *reason, int narg, void **args);

Submitted by:	"Wojciech A. Koszek" <dunstan@freebsd.czest.pl>
2005-12-23 11:58:42 +00:00
Poul-Henning Kamp
5a56b437ec Add abort2() systemcall. 2005-12-23 11:54:11 +00:00
Poul-Henning Kamp
49091c48d5 Make sbuf_copyin() return the number of bytes copied on success.
Submitted by:	"Wojciech A. Koszek" <dunstan@freebsd.czest.pl>
2005-12-23 11:49:53 +00:00
Scott Long
d2a401cb70 Create the taskqueue_fast handler with INTR_MPSAFE so that it doesn't run
with Giant.

MFC After: 3 days
2005-12-23 06:18:33 +00:00
John Baldwin
b439e431bf Tweak how the MD code calls the fooclock() methods some. Instead of
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly.  In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures.  Basically,
  all the archs were taking a trapframe and converting it into a clockframe
  one way or another.  Now they can just extract the PC and usermode values
  directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
  accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
  the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
  timecounter, call hardclock() directly.  This removes an extra
  conditional check from every clock interrupt on Alpha on the BSP.
  There is probably room for even further pruning here by changing Alpha
  to use the simplified timecounter we use on x86 with the lapic timer
  since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
  is false, so add a KASSERT() to that affect and remove a condition
  to slightly optimize the non-lapic case.
- Change prototypeof  arm_handler_execute() so that it's first arg is a
  trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.

Tested on:	alpha, amd64, arm, i386, ia64, sparc64
Reviewed by:	bde (mostly)
2005-12-22 22:16:09 +00:00
Alan Cox
373d1a3f8c Maintain the vnode lock throughout elfN_load_file() rather than releasing
it and reacquiring it in vrele().  Consequently, there is no reason to
increase the reference count on the vm object caching the file's pages.
Reviewed by: tegge

Eliminate unused parameters to elfN_load_file().
2005-12-21 18:58:40 +00:00
Alan Cox
ff6f03c7cd Eliminate an unneeded (vm_prot_t) parameter from two functions. Eliminate
unnecessary uses of a local variable.

Reviewed by: tegge
2005-12-20 23:42:18 +00:00
Pawel Jakub Dawidek
c505fe7a0f Reduce Giant scope a bit, as fdrop() is believed to be MPSAFE.
The purpose of this change is consistency (not performance improvement:)),
as it was hard to tell if fdrop() is MPSAFE or not when I saw it sometimes
under the Giant and sometimes without it.

Glanced at by:	ssouhlal, kan
2005-12-20 00:49:59 +00:00
Pawel Jakub Dawidek
ade9b797a0 vfs_mount_alloc() always returns 0, but what we really want is newly
allocated 'struct mount *' pointer, so simplify code a bit and return
the pointer directly.

Reviewed by:	ssouhlal
2005-12-20 00:43:51 +00:00
Pawel Jakub Dawidek
003ba8a000 Use 'td' instead of 'curthread'. 2005-12-19 16:27:13 +00:00
David Xu
a1d4fe69d2 Fix a bug in slice calculation code, current code uses hz but
sched_clock() is called by state clock.

Submitted by: taku at tackymt dot homeip dot net
2005-12-19 08:26:09 +00:00
Nate Lawson
bd6b217753 Remove the KTR for hardclock completely. It seems to not be useful.
Requested by:	jhb
2005-12-18 18:11:55 +00:00
Nate Lawson
1335c4df32 Restore KTR_CRITICAL but conditionally compile it in as KTR_SCHED.
Requested by:	scottl, jhb
2005-12-18 18:10:57 +00:00
Marcel Moolenaar
757686b115 Make our ELF64 type definitions match standards. In particular this
means:
o  Remove Elf64_Quarter,
o  Redefine Elf64_Half to be 16-bit,
o  Redefine Elf64_Word to be 32-bit,
o  Add Elf64_Xword and Elf64_Sxword for 64-bit entities,
o  Use Elf_Size in MI code to abstract the difference between
   Elf32_Word and Elf64_Word.
o  Add Elf_Ssize as the signed counterpart of Elf_Size.

MFC after: 2 weeks
2005-12-18 04:52:37 +00:00
Alan Cox
044bbbb523 Correct a long-standing problem in elfN_map_insert(): In order to copy a
page to user space, the user space mapping must allow write access.

In collaboration with: tegge@
MFC after: 3 weeks
2005-12-17 19:40:47 +00:00
Nate Lawson
8615fd8696 Clean up unused or poorly utilized KTR values. Remove KTR_FS, KTR_KGDB,
and KTR_IO as they were never used.  Remove KTR_CLK since it was only
used for hardclock firing and use KTR_INTR there instead.  Remove
KTR_CRITICAL since it was only used for crit enter/exit and use
KTR_CONTENTION instead.
2005-12-17 03:57:10 +00:00
John Baldwin
5c8b444153 - Use uintfptr_t rather than int for the kernel profiling index (though it
really should be a fptrdiff_t if we had that) in profclock().
- Don't try to profile kernel pc's that are >= the kernel lowpc to avoid
  underflows when computing a profiling index.
- Use the PC_TO_I() macro to compute the kernel profiling index rather than
  doing it inline.

Discussed with:	bde
2005-12-16 22:11:52 +00:00
John Baldwin
cb49fcd145 Change the addupc_*() functions to use the uintfptr_t type for pc rather
than uintptr_t as that is technically more correct.
2005-12-16 22:08:32 +00:00
Alan Cox
584716b08a Style: The second argument to vm_map_find() should be NULL instead of 0. 2005-12-16 19:14:25 +00:00
Alan Cox
da61b9a69e Use sf_buf_alloc() instead of vm_map_find() on exec_map to create the
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space.  There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail.  When it fails, the nascent process is aborted
(SIGABRT).  Whereas, this reimplementation using sf_buf_alloc()
sleeps.  (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster.  Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.

Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks
2005-12-16 18:34:14 +00:00
Xin LI
6ba9ec2d09 In pipe_write(): when uiomove() fails, do not spin on it forever.
Submitted by:	Kostik Belousov <kostikbel at gmail.com> on -current@
Message-ID:	<20051216151016.GE84442@deviant.zoral.local>
MFC After:	3 weeks
2005-12-16 18:32:39 +00:00
David Xu
03f70aec67 Replace selwakeuppri with selwakeup, let scheduler figure out
appropriate thread priority.
2005-12-16 15:01:16 +00:00
Ed Maste
63e6f39011 When using m_dup(9) to copy more than MHLEN bytes of data, don't create an
mbuf chain that starts with a cluster containing just MHLEN bytes.  This
happened because m_dup called m_get or m_getcl depending on the amount of
data to copy, but then always set the size available in the first mbuf to
MHLEN.

Submitted by:	Matt Koivisto <mkoivisto at sandvine dot com>
Approved by:	jmg
Silence from:	rwatson (mentor)
2005-12-14 23:34:26 +00:00
Maxime Henrion
e59898ff36 Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by:	Thomas Hurst <tom@hur.st>
MFC after:	3 days
2005-12-14 22:27:48 +00:00
Dag-Erling Smørgrav
0430a5e289 Eradicate caddr_t from the VFS API. 2005-12-14 00:49:52 +00:00
John Baldwin
d272fe53a4 Add a new 'show lock' command to ddb. If the argument has a valid lock
class, then it displays various information about the lock and calls a
new function pointer in lock_class (lc_ddb_show) to dump class-specific
information about the lock as well (such as the owner of a mutex or
xlock'ed sx lock).  This is easier than staring at hex dumps of locks to
figure out who owns the lock, etc.  Note that extending lock_class doesn't
affect the ABI for any kernel modules as the only code that deals with
lock_class structures directly is kern_mutex.c, kern_sx.c, and witness.

MFC after:	1 week
2005-12-13 23:14:35 +00:00
David Xu
dd1a6f53ac Stop fiddling thread priority with msleep, eliminating unnecessary
context switching. This improves performance about 30% on UP machine.
2005-12-12 05:04:56 +00:00
Craig Rodrigues
92f44a3f3c Contributions from XFS for FreeBSD project:
- Implement cv_wait_unlock() method which has semantics compatible
  with the sv_wait() method in IRIX.  For cv_wait_unlock(), the lock
  must be held before entering the function, but is not held when the
  function is exited.

- Implement the existing cv_wait() function in terms of cv_wait_unlock().

Submitted by:	kan
Feedback from:	jhb, trhodes, Christoph Hellwig <hch at infradead dot org>
2005-12-12 00:02:22 +00:00
Alan Cox
05406e6f33 Remove unneeded calls to pmap_remove_all(). The given page is not mapped.
Reviewed by: tegge
2005-12-11 22:06:57 +00:00
Andre Oppermann
36ae3fd3c3 Hide the 4k mbuf clusters if the normal clusters are defined to be
4k already.

This unbreaks tinderbox.

Submitted by:	ru
2005-12-10 15:21:04 +00:00
David Xu
3e70c6f047 Fix compiling warning on 64 bits system. 2005-12-09 13:16:48 +00:00
David Xu
f71a882f15 Add a sysctl to force a process to sigexit if a trap signal is
being hold by current thread or ignored by current process,
otherwise, it is very possible the thread will enter an infinite loop
and lead to an administrator's nightmare.
2005-12-09 08:29:29 +00:00
David Xu
d26b1a1fb9 Register itimers_event_hook as a kernel event handler, so I don't
have to duplicate code to call it in exec() and exit1().
2005-12-09 05:43:26 +00:00
David Xu
102178d0df Comment out mqfs_create_link. Inline some small functions. 2005-12-09 02:38:29 +00:00
David Xu
5c4745177f Now SIGCHLD is always queued. 2005-12-09 02:27:55 +00:00
David Xu
761a4d9423 Cleanup sigqueue sysctl. 2005-12-09 02:26:44 +00:00
Andre Oppermann
d5269a636b Add an API for jumbo mbuf cluster allocation and also provide
4k clusters in addition to 9k and 16k ones.

 struct mbuf *m_getjcl(int how, short type, int flags, int size)
 void *m_cljget(struct mbuf *m, int how, int size)

m_getjcl() returns an mbuf with a cluster of the specified size attached
like m_getcl() does for 2k clusters.

m_cljget() is different from m_clget() as it can allocate clusters
without attaching them to an mbuf.  In that case the return value
is the pointer to the cluster of the requested size.  If an mbuf was
specified, it gets the cluster attached to it and the return value
can be safely ignored.

For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES.

Reviewed by:	glebius
Tested by:	glebius
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-12-08 13:13:06 +00:00
Craig Rodrigues
d5989f64cf In devfs_first(), set mp->mnt_opt to a valid empty list of mount options
instead of leaving it NULL.  This eliminates a kernel panic
when trying to do a mount -o update of /dev.

Noticed by:	cjsp
Reviewed by:	phk
2005-12-08 04:27:53 +00:00
Craig Rodrigues
8539ca4cde Add "errmsg" to list of global mount options. 2005-12-08 04:09:29 +00:00
Craig Rodrigues
6951bea6c8 Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
    - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
    - b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by:	kan
Reviewed by:		phk
Silence on:		arch@
2005-12-07 03:39:08 +00:00
Alan Cox
8ad398d089 Reduce the scope of the page queues lock in exec_map_first_page(). The vm
object lock is sufficient for reading a page's PG_BUSY and busy flags.

MFC after: 1 week
2005-12-06 07:39:36 +00:00
David Xu
9da8a32aae o Turn on MPSAFE flag for mqueuefs.
o Reuse si_mqd field in siginfo_t, this also gives userland
  information about which descriptor is notified.
2005-12-06 06:22:12 +00:00
David Xu
027f760408 Fix a lock leak in childproc_continued(). 2005-12-06 05:30:13 +00:00
John Baldwin
5d2162b2f8 Tweak witness handling of lock object to shave 2 pointers off of each
lock object (and thus off of each mutex and sx lock):
- Rename the all_locks list to pending_locks and only put locks initialized
  before SI_SUB_WITNESS on the list so that the SI_SUB_WITNESS can add them
  to witness once it starts up.
- Now that pending_locks is only used during early startup, change it from
  a TAILQ to an STAILQ.  This removes a pointer from the STAILQ_ENTRY in
  struct lock_object.
- Since the pending_locks list is only used during the single-threaded
  early boot it no longer needs to be protected by a mutex, so remove
  all_mtx.
- Since the lo_list member of struct lock_object is now only used during
  early boot before witness is running, collapse lo_list and lo_witness
  into a union.  This shaves the second pointer off of struct lock_object.
- Axe lock_cur_cnt and lock_max_cnt.

With these changes, struct mtx shrinks from 36 to 28 bytes on 32-bit
platforms and from 72 to 56 bytes on 64-bit platforms.  Note that this
commit will completely and utterly destroy the kernel ABI, so no MFC.

Tested on:	alpha, amd64, i386, sparc64
2005-12-05 20:45:24 +00:00
David Xu
052ea11c71 After reading some documents, I realized SIGEV_NONE != NULL, also
fix code in mqueue_send_notification to handle SIGEV_NONE.
2005-12-05 04:41:32 +00:00
David Xu
9947b45978 Handle SIGEV_NONE, if notification is SIGEV_NONE, error status and
return status will be set, but no notification will be registered.
Increase hard limit of maxmsg to 100, so posixtestsuite ports can run.
2005-12-05 03:23:27 +00:00
Ruslan Ermilov
f4e9888107 Fix -Wundef. 2005-12-04 02:12:43 +00:00
Craig Rodrigues
1245b3433e Add "rdonly" to global_opts, and parse it in vfs_donmount().
Requested by:	rwatson
2005-12-03 12:04:20 +00:00
Craig Rodrigues
ec528a3472 - Add "rw" mount option to global_opts.
- In vfs_donmount(), parse "ro", "noro", and "rw", in order to set or
  unset the MNT_RDONLY filesystem flag.
2005-12-03 01:26:27 +00:00
David Xu
5ee2d4ac5a 1. Cleanup including.
2. Set configuration value for CTL_P1003_1B_MESSAGE_PASSING.
2005-12-02 14:09:32 +00:00
David Xu
a6de716d7e 1. Check if message priority is less than MQ_PRIO_MAX.
2. Use getnanotime instead of getnanouptime.
3. Don't free message in _mqueue_send, mqueue_send will free it.
2005-12-02 08:23:49 +00:00
David Xu
77e718f773 1. Set timer configuration values for sysconf().
2. Set overrun limit to INT_MAX, report ERANGE error if overrun will be
   greater than INT_MAX.
2005-12-01 07:56:15 +00:00
David Xu
b51d237a67 set signal queue values for sysconf(). 2005-12-01 00:25:50 +00:00
David Xu
b2f92ef96b Last step to make mq_notify conform to POSIX standard, If the process
has successfully attached a notification request to the message queue
via a queue descriptor, file closing should remove the attachment.
2005-11-30 05:12:03 +00:00
John Baldwin
398293a8de Fix snderr() to not leak the socket buffer lock if an error occurs in
sosend().  Robert accidentally changed the snderr() macro to jump to the
out label which assumes the lock is already released rather than the
release label which drops the lock in his previous change to sosend().
This should fix the recent panics about returning from write(2) with the
socket lock held and the most recent LOR on current@.
2005-11-29 23:07:14 +00:00
Robert Watson
66dd8a6f99 Move zero copy statistics structure before sosend_copyin().
MFC after:	1 month
Reported by:	tinderbox, sam
2005-11-28 21:45:36 +00:00
John Baldwin
ef627e7da0 When checking to see if a process has exceeded its time limit, flag the
process as over the limit when its time is >= to the limit rather than >
the limit.  Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit
and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit
yet.  However, having the fraction exactly equal to 0 is rather rare, and
it is not worth the overhead to handle that edge case.  With just the >
comparison, the process would have to exceed its limit by almost a second
before it was killed.

PR:		kern/83192
Submitted by:	Maciej Zawadzinski mzawadzinski at gmail dot com
Reviewed by:	bde
MFC after:	1 week
2005-11-28 19:09:08 +00:00
Robert Watson
a725629cf8 Break out functionality in sosend() responsible for building mbuf
chains and copying in mbufs from the body of the send logic, creating
a new function sosend_copyin().  This changes makes sosend() almost
readable, and will allow the same logic to be used by tailored socket
send routines.

MFC after:	1 month
Reviewed by:	andre, glebius
2005-11-28 18:09:03 +00:00
David Xu
f72b11a40c Fix a stupid compiler warining, remove a redundant line. 2005-11-27 22:59:47 +00:00
David Xu
47bf2cf9fe Change filesystem name from mqueue to mqueuefs for style consistent.
Suggested by: rwatson
2005-11-27 08:30:12 +00:00
David Xu
6829585c43 Regen. 2005-11-27 01:23:31 +00:00
David Xu
94e1294b06 Don't use OpenBSD syscall numbers, instead, use new syscall numbers
for POSIX message queue.

Suggested by: rwatson
2005-11-27 01:13:00 +00:00
Robert Watson
5e758b9561 Add several aliases for existing clockid_t names to indicate that the
application wishes to request high precision time stamps be returned:

Alias                           Existing

CLOCK_REALTIME_PRECISE          CLOCK_REALTIME
CLOCK_MONOTONIC_PRECISE         CLOCK_MONOTONIC
CLOCK_UPTIME_PRECISE            CLOCK_UPTIME

Add experimental low-precision clockid_t names corresponding to these
clocks, but implemented using cached timestamps in kernel rather than
a full time counter query.  This offers a minimum update rate of 1/HZ,
but in practice will often be more frequent due to the frequency of
time stamping in the kernel:

New clockid_t name              Approximates existing clockid_t

CLOCK_REALTIME_FAST             CLOCK_REALTIME
CLOCK_MONOTONIC_FAST            CLOCK_MONOTONIC
CLOCK_UPTIME_FAST               CLOCK_UPTIME

Add one additional new clockid_t, CLOCK_SECOND, which returns the
current second without performing a full time counter query or cache
lookup overhead to make sure the cached timestamp is stable.  This is
intended to support very low granularity consumers, such as time(3).

The names, visibility, and implementation of the above are subject
to change, and will not be MFC'd any time soon.  The goal is to
expose lower quality time measurement to applications willing to
sacrifice accuracy in performance critical paths, such as when taking
time stamps for the purpose of rescheduling select() and poll()
timeouts.  Future changes might include retrofitting the time counter
infrastructure to allow the "fast" time query mechanisms to use a
different time counter, rather than a cached time counter (i.e.,
TSC).

NOTE: With different underlying time mechanisms exposed, using
different time query mechanisms in the same application may result in
relative non-monoticity or the appearance of clock stalling for a
single clockid_t, as a cached time stamp queried after a precision
time stamp lookup may be "before" the time returned by the earlier
live time counter query.
2005-11-27 00:55:18 +00:00
David Xu
7023331e59 Regen. 2005-11-26 12:45:22 +00:00
David Xu
655291f2ae Bring in experimental kernel support for POSIX message queue. 2005-11-26 12:42:35 +00:00
Craig Rodrigues
5e6b93a014 In nmount() and vfs_donmount(), do not strcmp() the options in the iovec
directly.  We need to copyin() the strings in the iovec before
we can strcmp() them.  Also, when we want to send the errmsg back
to userspace, we need to copyout()/copystr() the string.

Add a small helper function vfs_getopt_pos() which takes in the
name of an option, and returns the array index of the name in the iovec,
or -1 if not found.  This allows us to locate an option in
the iovec without actually manipulating the iovec members. directly via
strcmp().

Noticed by:	kris on sparc64
2005-11-23 20:51:15 +00:00
John Polstra
ba3612cd5c Fix a bug in the loop in sonewconn that makes room on the incomplete
connection queue for a new connection.  It was removing connections
from the wrong list.

Submitted by:	Paul Mikesell
Sponsored by:	Isilon Systems
MFC after:	1 week
2005-11-22 01:55:29 +00:00
Marcel Moolenaar
60b7823989 Fix bug introduced in revision 1.186:
When all file systems have a time stamp of zero, which is the case
for example when the root file system is on a read-only medium, we
ended up not calling inittodr() at all.  A potential uncleanliness
existed as well. If multiple file systems had a non-zero time stamp,
we would call inittodr() multiple times. While this should not be
harmful, it's definitely not ideal.
Fix both issues by iterating over the mounted file systems to find
the largest time stamp and call inittodr() exactly once with that
time stamp. This could of course be a zero time stamp if none of the
mounted file systems have a non-zero time stamp. In that case the
annoying errors mentioned in the commit log for revision 1.186 still
haven't been avoided. The bottom line is that inittodr() should not
complain when it gets a time base of zero. At the time of this
commit only alpha seems to have that problem.

Reported by: Dario Freni (saturnero at freesbie dot org)
MFC after: 1 week
2005-11-19 21:51:45 +00:00
Craig Rodrigues
425e5b6268 Parse more mount options in vfs_donmount(), before vfs_domount()
is called.  It looks like there are lots of different mount flags checked
in vfs_domount(), so we need to do the parsing for these particular
mount flags earlier on.  The new flags parsed are:
async, force, multilabel, noasync, noatime, noclusterr, noclusterw,
noexec, nosuid, nosymfollow, snapshot, suiddir, sync, union.

Existing code which uses mount() to mount UFS filesystems is not
affected, but new code which uses nmount() to mount UFS filesystems
should behave better.
2005-11-19 21:22:21 +00:00
Andre Oppermann
5eefd88949 Add CLOCK_UPTIME to clock_gettime(2) reporting the current
uptime measured in SI seconds.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 16:51:13 +00:00
Craig Rodrigues
8fd860cfa1 In vfs_nmount(), check to see if "update" mount option was passed
in, and if so, set MNT_UPDATE filesystem flag.
vfs_nmount() calls vfs_domount(), and there is special logic
inside vfs_domount() if MNT_UPDATE is set.  This is very important
when we want to do an update mount of the root filesystem, using nmount().
2005-11-18 01:31:10 +00:00
Pyun YongHyeon
dc22aef2ae Prefer NULL to 0.
Add missing lock/unlock in sysctl handler.
Protect accessing NULL pointer when resource allocation was failed.
style(9)

Reviewed by:	scottl
MFC after:	1 week
2005-11-17 08:56:21 +00:00
Olivier Houchard
481a1fe19e Add a new sysctl, kern.elf[32|64].can_exec_dyn. When set to 1, one can
execute a ET_DYN binary (shared object).
This does not make much sense, but some linux scripts expect to be able to
execute /lib/ld-linux.so.2 (ldd comes to mind).
The sysctl defaults to 0.

MFC after:	3 days
2005-11-14 22:24:00 +00:00
Robert Watson
c5c9bd5b72 In ktr_getrequest(), acquire ktrace_mtx earlier -- while the race
currently present is minor and offers no real semantic issues, it also
doesn't make sense since an earlier lockless check has already
occurred.  Also hold the mutex longer, over a manipulation of
per-process ktrace state, which requires synchronization.

MFC after:	1 month
Pointed out by:	jhb
2005-11-14 19:30:09 +00:00