Commit Graph

9890 Commits

Author SHA1 Message Date
jeff
7038c5de35 - Change types for necent runq additions to u_char rather than int.
- Fix these types in ULE as well.  This fixes bugs in priority index
   calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.

Reported by:	kkenn using pho's stress2.
2007-02-08 01:52:25 +00:00
alc
c1270b41ec Remove the vm page queue free mutex from the CDEV order. 2007-02-07 05:43:31 +00:00
rwatson
a7eaaf4149 Push UNIX domain socket locking further into uipc_ctloutput() in order to
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout().  This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.

Reviewed by:	mdodd
2007-02-06 14:31:37 +00:00
mpp
f010375878 The change to the vm_page_queue_freelist lock from a spin lock to a
sleep lock missed the witness code, and the system will panic
immediately on boot if WITNESS is enabled.

Changed the witness definition to the new type.
2007-02-06 05:51:55 +00:00
mlaier
4bd0763c38 Add a small informative printf under bootverbose to firmware_register to
track problems when loading firmware from loader.
2007-02-03 16:01:46 +00:00
bms
2b8498ff24 Diff reduction with RELENG_6, style(9):
Remove unnecessary brace; && should be on end of line.
No functional changes.
2007-02-03 03:57:45 +00:00
bms
a6c57fe6a9 Use int instead of u_int for the 'extra' argument to the
clone_create() KPI.
This fixes a signedness bug in unit number comparisons.

Submitted by:	imp, Landon Fuller
PR:		kern/105228
MFC after:	2 weeks
2007-02-02 22:27:45 +00:00
kib
a816abd565 Record kqueue -> struct mount mtx -> vnode interlock lock order to
catch the places where reverse lock order is instantiated.

OKed by:	jeff
2007-02-02 09:02:18 +00:00
julian
743211870f Move the seting of the idle_mask bits to a place where they
can't be wrong.
Also use the IDLETD bit in the thread mask to test if its an idle thread
rather than doing a PCPU access.
2007-02-02 05:14:22 +00:00
andre
ad9bb7722c Generic socket buffer auto sizing support, header defines, flag inheritance.
MFC after:	1 month
2007-02-01 17:53:41 +00:00
mlaier
e3327eddd2 In case we are supplied with an imagename that matches a module, but not a
firmware in that module (eventhough this is a programming error) - drop the
reference to the module again.

Submitted by:	Benjamin Close
MFC after:	3 days
2007-01-27 19:52:08 +00:00
jeff
0f05ca9b5b - Implement much more intelligent ipi sending. This algorithm tries to
minimize IPIs and rescheduling when scheduling like tasks while keeping
   latency low for important threads.
   1) An idle thread is running.
   2) The current thread is worse than realtime and the new thread is
      better than realtime.  Realtime to realtime doesn't preempt.
   3) The new thread's priority is less than the threshold.
2007-01-25 23:51:59 +00:00
jeff
94085f7612 - Get rid of the unused DIDRUN flag. This was really only present to
support sched_4bsd.
 - Rename the KTR level for non schedgraph parsed events.  They take event
   space from things we'd like to graph.
 - Reset our slice value after we sleep.  The slice is simply there to
   prevent starvation among equal priorities.  A thread which had almost
   exhausted it's slice and then slept doesn't need to be rescheduled a
   tick after it wakes up.
 - Set the maximum slice value to a more conservative 100ms now that it is
   more accurately enforced.
2007-01-25 19:14:11 +00:00
mohans
83064ec323 Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@
2007-01-25 01:05:23 +00:00
jeff
743ea48fbc - With a sleep time over 2097 seconds hzticks and slptime could end up
negative.  Use unsigned integers for sleep and run time so this doesn't
   disturb sched_interact_score().  This should fix the invalid interactive
   priority panics reported by several users.
2007-01-24 18:18:43 +00:00
rrs
ba4b733a7c Fixes the MSG_PEEK for sctp_generic_recvmsg() the msg_flags
were not being copied in properly so PEEK and any other
msg_flags input operation were not being performed right.
Approved by:	gnn
2007-01-24 12:59:56 +00:00
kib
fdd50404d1 Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
jeff
8fd8265087 - Catch up to setrunqueue/choosethread/etc. api changes.
- Define our own maybe_preempt() as sched_preempt().  We want to be able
   to preempt idlethread in all cases.
 - Define our idlethread to require preemption to exit.
 - Get the cpu estimation tick from sched_tick() so we don't have to worry
   about errors from a sampling interval that differs from the time
   domain.  This was the source of sched_priority prints/panics and
   inaccurate pctcpu display in top.
2007-01-23 08:50:34 +00:00
jeff
474b917526 - Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty.  The few asserts and thread state
   setting were moved to the individual schedulers.  sched_add() was
   chosen to displace it for naming consistency reasons.
 - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
   different on all three schedulers where it was only called in one place
   each.
 - Remove the long ifdef'd out remrunqueue code.
 - Remove the now redundant ts_state.  Inspect the thread state directly.
 - Don't set TSF_* flags from kern_switch.c, we were only doing this to
   support a feature in one scheduler.
 - Change sched_choose() to return a thread rather than a td_sched.  Also,
   rely on the schedulers to return the idlethread.  This simplifies the
   logic in choosethread().  Aside from the run queue links kern_switch.c
   mostly does not care about the contents of td_sched.

Discussed with:	julian

 - Move the idle thread loop into the per scheduler area.  ULE wants to
   do something different from the other schedulers.

Suggested by:	jhb

Tested on:	x86/amd64 sched_{4BSD, ULE, CORE}.
2007-01-23 08:46:51 +00:00
rodrigc
4c351de443 When exiting vfs_export(), delete the "export" option from
the mount options list with vfs_deleteopt().  At this point, the export
information is saved in mp->mnt_export, so we can delete
the "export" mount option from mp->mnt_optnew and mp->mnt_opt.

This fixes read-write/read-only update mounts (mount -u -o rw, mount -u -o ro)
of NFS exported directories.

For some reason, I could only reproduce the problem with a configuration
supplied by Andre:
- "options QUOTA" enabled in kernel config
- "/ -maproot=root 10.0.1.105" in /etc/exports

Reported by:	kris, Andre Guibert de Bruet <andy siliconlandmark com>,
            	Andrzej Tobola <ato iem pw edu pl>
Tested by:	Andre Guibert de Bruet
2007-01-23 06:19:16 +00:00
andre
4a22f82e6c Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary
control data but no payload data is passed.

Change m_uiotombuf() to return at least one empty mbuf if the requested
length was zero.  Add comment to sosend_dgram and sosend_generic().

Diagnoses by:		jhb
Regression test by:	rwatson
Pointy hat to.		andre
2007-01-22 14:50:28 +00:00
kib
79752b63e1 Below is slightly edited description of the LOR by Tor Egge:
--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.

A simplified scenario:

root fs					var fs
/    		A			/    (/var)	D
/var		B			/log (/var/log) E
vfs lock	C			vfs lock	F

Within each file system, the lock order is clear: C->A->B and F->D->E

When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:

      L1: C->A->B
		|
	        +->F->D->E

      L2: F->D->E
	     |
             +->C->A->B

The lookup() process for namei("/var") mixes those two lock orders:

    VOP_LOOKUP() obtains B while A is held
    vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
    violates L2)
    vput() releases lock on B
    VOP_UNLOCK() releases lock on A
    VFS_ROOT() obtains lock on D while shared lock on F is held
    vfs_unbusy() releases shared lock on F
    vn_lock() obtains lock on A while D is held (violates L1, follows L2)

dounmount() follows L1 (B is locked while F is drained).

Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).

With unmount, you can get 4 processes in a deadlock:

     p1: holds D, want A (in lookup())
     p2: holds shared lock on F, want D (in VFS_ROOT())
     p3: holds B, want drain lock on F (in dounmount())
     p4: holds A, want B (in VOP_LOOKUP())

You can have more than one instance of p2.

The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.

- Tor Egge

To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.

Idea by:	ups
Reviewed by:	tegge, ups, jeff, rwatson (mac interaction)
Tested by:	Peter Holm
MFC after:	2 weeks
2007-01-22 11:25:22 +00:00
jeff
5fd995e14a - Disable the long-term load balancer. I believe that steal_busy works
better and gives more predictable results.
2007-01-20 21:24:05 +00:00
jeff
a1996060b3 - We do need to IPI the idlethread on some systems. It may be stuck in
a power saving mode otherwise.
 - If the thread is already bound in sched_bind() unbind it before
   re-binding it to a new cpu.  I don't like these semantics but they are
   expected by some code in the tree.  Patch by jkoshy.
2007-01-20 17:03:33 +00:00
jeff
3f693f3417 - In tdq_transfer() always set NEEDRESCHED when necessary regardless of
the ipi settings.  If NEEDRESCHED is set and an ipi is later delivered
   it will clear it rather than cause extra context switches.  However, if
   we miss setting it we can have terrible latency.
 - In sched_bind() correctly implement bind.  Also be slightly more
   tolerant of code which calls bind multiple times.  However, we don't
   change binding if another call is made with a different cpu.  This
   does not presently work with hwpmc which I believe should be changed.
2007-01-20 09:03:43 +00:00
jeff
a5cccc05cb Major revamp of ULE's cpu load balancing:
- Switch back to direct modification of remote CPU run queues.  This added
   a lot of complexity with questionable gain.  It's easy enough to
   reimplement if it's shown to help on huge machines.
 - Re-implement the old tdq_transfer() call as tdq_pickidle().  Change
   sched_add() so we have selectable cpu choosers and simplify the logic
   a bit here.
 - Implement tdq_pickpri() as the new default cpu chooser.  This algorithm
   is similar to Solaris in that it tries to always run the threads with
   the best priorities.  It is actually slightly more complex than
   solaris's algorithm because we also tend to favor the local cpu over
   other cpus which has a boost in latency but also potentially enables
   cache sharing between the waking thread and the woken thread.
 - Add a bunch of tunables that can be used to measure effects of different
   load balancing strategies.  Most of these will go away once the
   algorithm is more definite.
 - Add a new mechanism to steal threads from busy cpus when we idle.  This
   is enabled with kern.sched.steal_busy and kern.sched.busy_thresh.  The
   threshold is the required length of a tdq's run queue before another
   cpu will be able to steal runnable threads.  This prevents most queue
   imbalances that contribute the long latencies.
2007-01-19 21:56:08 +00:00
delphij
2e20bff54b Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form. 2007-01-17 14:58:53 +00:00
ssouhlal
d4434aa6e9 Remove hptlock from the static witness table, now that it's a regular sleep
mutex.
2007-01-16 22:56:28 +00:00
rrs
e614960c33 Removes useless (flags | ) KASSERT. The ^ one that actually
does what we want.

Submitted by:	Li Xin delphij@delphij.net
Reviewed by:	rrs
Approved by:	gnn
2007-01-16 11:40:55 +00:00
kmacy
4da320a732 Fix warning by adding extra parentheses 2007-01-16 00:09:58 +00:00
rrs
af870dbd2e Reviewed by: rwatson
Approved by:	gnn

Add a new function hashinit_flags() which allows NOT-waiting
for memory (or waiting). The old hashinit() function now
calls hashinit_flags(..., HASH_WAITOK);
2007-01-15 15:06:28 +00:00
rwatson
3dd2666ab7 Re-wrap comments to wider margins now that they have been relocated from
within functions.
2007-01-12 22:01:03 +00:00
imp
faa5da2dc1 When ntp_gettime() was converted from a sysctl + wrapper to a system
call, its semantics were unintentionally changed.  It went from
returning the time state to returning 0 or -1.  Since 0 means time
normal, and non-zero effectively only shows up around leap seconds,
this went unnoticed until now.  At least unnoticed until someone was
trying to run a binary they didn't have source for and it was
misbehaving...

Submitted by: Judah Levine
MFC After: 2 weeks
2007-01-12 07:40:30 +00:00
jhb
496f904eab Wrap propagate_priority() in a critical section to prevent unwanted
preemptions when adjusting the priority of a thread that is on a run
queue.  This was only observed when FULL_PREEMPTION was enabled.

Reported by:	kris
Diagnosed by:	ups
MFC after:	1 week
2007-01-11 19:13:27 +00:00
rwatson
37fe9cfef4 Sort copyrights together.
MFC after:	3 days
2007-01-08 20:37:02 +00:00
rwatson
0ef16b090a Resort copyrights and licenses in kern_acct.c: per UCB letter,
the UCB license now excludes the advertising clause.  I'm not
interested in it either, so move my copyright.  This leaves
only a CGD copyright with the advertising clause.

MFC after:      3 days
2007-01-08 20:35:13 +00:00
rwatson
fd7dad9d6e Canonicalize copyrights in some files I hold copyrights on:
- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.

MFC after:	3 days
2007-01-08 17:49:59 +00:00
jeff
0f9511e94e - Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer
divide faults in roundup() later if it is able to return 0.  For some
   reason this bug only shows up on my laptop and not my testboxes.
2007-01-06 12:33:43 +00:00
jeff
b19ed2c7b0 - Fix the sched_priority() invalid priority bugs. Use roundup() instead
of max() when computing the divisor in SCHED_TICK_PRI().  This prevents
   cases where rounding down would allow the quotient to exceed
   SCHED_PRI_RANGE.
 - Garbage collect some unused flags and fields.
 - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
   duplicated this functionality.
 - Re-enable the rebalancer by default and fix the sysctl so it can be
   modified.
2007-01-06 08:44:13 +00:00
jeff
80a97a8d5e - Don't IPI unless we're going to interrupt something exiting in the kernel.
otherwise we can afford the latency.  This makes a significant performance
   improvement.
2007-01-06 02:34:23 +00:00
jeff
08aa1698e7 - Fix a comparison in sched_choose() that caused cpus to be constantly
marked idle, thus breaking cpu load balancing.
 - Change sched_interact_update() to fix cases where the stored history
   has expanded significantly rather than handling them in the callers.  This
   fixes a case where sched_priority() could compute a bad value.
 - Add a sysctl to disable the global load balancer for experimentation.
2007-01-05 23:45:38 +00:00
jhb
256d3cdbaf - Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
  pcb object and fixing the sysctl that enumerates unpcbs to grab a
  reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
  file descriptor teardown (fdrop()) by adding a new garbage collection
  flag FWAIT.  unp_gc() sets FWAIT while it walks the message buffers
  in a UNIX domain socket looking for nested file descriptor references
  and clears the flag when it is finished.  fdrop() checks to see if the
  flag is set on a file descriptor whose refcount just dropped to 0 and
  waits for unp_gc() to clear the flag before completely destroying the
  file descriptor.

MFC after:	1 week
Reviewed by:	rwatson
Submitted by:	ups
Hopefully makes the panics go away:	mx1
2007-01-05 19:59:46 +00:00
jeff
47d8080afa - ftick was initialized to -1 for init and any of it's children. Fix this by
setting ftick = ltick = ticks in schedinit().
 - Update the priority when we are pulled off of the run queue and when we
   are inserted onto the run queue so that it more accurately reflects our
   present status.  This is important for efficient priority propagation
   functioning.
 - Move the frequency test into sched_pctcpu_update() so we don't repeat it
   each time we'd like to call it.
 - Put some temporary work-around code in sched_priority() in case the tick
   mechanism produces a bad priority.  Eventually this should revert to an
   assert again.
2007-01-05 08:50:38 +00:00
jeff
ae96850d62 - Only allow the tdq_idx to increase by one each tick rather than up to
the most recently chosen index.  This significantly improves nice
   behavior.  This allows a lower priority thread to run some multiple of
   times before the higher priority thread makes it to the front of
   the queue.  A nice +20 cpu hog now only gets ~5% of the cpu when running
   with a nice 0 cpu hog and about 1.5% with a nice -20 hog.  A nice
   difference of 1 makes a 4% difference in cpu usage between two hogs.
 - Track a seperate insert and removal index.  When the removal index is
   empty it is updated to point at the current insert index.
 - Don't remove and re-add a thread to the runq when it is being adjusted
   down in priority.
 - Pull some conditional code out of sched_tick().  It's looking a bit
   large now.
2007-01-04 12:16:19 +00:00
jeff
60a15a9f22 - Don't pass a pointer into runq_choose_from(). The caller can adjust the
index if it chooses to.
2007-01-04 12:10:58 +00:00
jeff
2c3282f28a ULE 2.0:
- Remove the double queue mechanism for timeshare threads.  It was slow
   due to excess cache lines in play, caused suboptimal scheduling behavior
   with niced and other non-interactive processes, complicated priority
   lending, etc.
 - Use a circular queue with a floating starting index for timeshare threads.
   Enforces fairness by moving the insertion point closer to threads with
   worse priorities over time.
 - Give interactive timeshare threads real-time user-space priorities and
   place them on the realtime/ithd queue.
 - Select non-interactive timeshare thread priorities based on their cpu
   utilization over the last 10 seconds combined with the nice value.  This
   gives us more sane priorities and behavior in a loaded system as
   compared to the old method of using the interactivity score.  The
   interactive score quickly hit a ceiling if threads were non-interactive
   and penalized new hog threads.
 - Use one slice size for all threads.  The slice is not currently
   dynamically set to adjust scheduling behavior of different threads.
 - Add some new sysctls for scheduling parameters.

Bug fixes/Clean up:
 - Fix zeroing of td_sched after initialization in sched_fork_thread() caused
   by recent ksegrp removal.
 - Fix KSE interactivity issues related to frequent forking and exiting of
   kse threads.  We simply disable the penalty for thread creation and exit
   for kse threads.
 - Cleanup the cpu estimator by using tickincr here as well.  Keep ticks and
   ltick/ftick in the same frequency.  Previously ticks were stathz and
   others were hz.
 - Lots of new and updated comments.
 - Many many others.

Tested on:	up x86/amd64, 8way amd64.
2007-01-04 08:56:25 +00:00
jeff
78c3275ce1 - Add three new functions to support circular run queues.
- runq_add_pri allows the caller to position the thread at any rqindex
   regardless of priority.
 - runq_choose_from() chooses the lowest priority thread starting from a given
   index.  The index is updated with the rqindex of the chosen thread.  This
   routine is used to pick the lowest priority relative to a given index.
 - runq_remove_idx() updates the index if the run queue that held the removed
   thread is now empty.
2007-01-04 08:39:58 +00:00
jeff
b5c5ce5407 - Fix schedgraph output with KSE threads. Call thread_switchout() after
calling CTR() so we don't confuse a new kse thread with a real preemption.
2007-01-03 02:38:41 +00:00
davidxu
eafca3f075 Fix compiling. 2007-01-02 04:14:01 +00:00
rwatson
2a3330b81a Prefer a more traditional spelling of inhibited in comments and panic
messages.
2006-12-31 15:56:04 +00:00
jeff
9c815f4892 - More search and replace prettying. 2006-12-29 12:55:32 +00:00
jeff
e74edb3876 - Clean up a bit after the most recent KSE restructuring. 2006-12-29 10:37:07 +00:00
rwatson
4a9f23955f Break contents of kern_mac.c out into two files following a repo-copy:
mac_framework.c   Contains basic MAC Framework functions, policy
                  registration, sysinits, etc.

mac_syscalls.c    Contains implementations of various MAC system calls,
                  including ENOSYS stubs when compiling without options
                  MAC.

Obtained from:	TrustedBSD Project
2006-12-28 20:52:02 +00:00
rwatson
e7f843dc94 Update MAC Framework general comments, referencing various interfaces it
consumes and implements, as well as the location of the framework and
policy modules.

Refactor MAC Framework versioning a bit so that the current ABI version can
be exported via a read-only sysctl.

Further update comments relating to locking/synchronization.

Update copyright to take into account these and other recent changes.

Obtained from:	TrustedBSD Project
2006-12-28 17:25:57 +00:00
davidxu
70875d94ab break loop early if we know that there are at least two signals. 2006-12-25 03:00:15 +00:00
davidxu
f51b738f57 Fix typo, p_slptime should be td_slptime. 2006-12-24 01:52:27 +00:00
bms
1a77168e4f Drop all received data mbufs from a socket's queue if the MT_SONAME
mbuf is dropped, to preserve the invariant in the PR_ADDR case.

Add a regression test to detect this condition, but do not hook it
up to the build for now.

PR:             kern/38495
Submitted by:   James Juran
Reviewed by:    sam, rwatson
Obtained from:  NetBSD
MFC after:      2 weeks
2006-12-23 21:07:07 +00:00
rwatson
a1911a8513 Update comments to reflect changes in the extattrctl() code.
Clean up comment formatting.

Obtained from:	TrustedBSD Project
2006-12-23 00:30:03 +00:00
rwatson
520b5875d2 Following a repo-copy of vfs_syscalls.c to vfs_extattr.c, remove
non-extattr functions from vfs_extattr.c, and extattr functions from
vfs_syscalls.c.

Change copyright/license on vfs_extattr.c to my copyright/license on
the extended attribute implementation (from extattr.h).

Clean up includes a bit.

Obtained from:	TrustedBSD Project
2006-12-23 00:10:36 +00:00
rwatson
ae9ef07995 Move src/sys/sys/mac_policy.h, the kernel interface between the MAC
Framework and security modules, to src/sys/security/mac/mac_policy.h,
completing the removal of kernel-only MAC Framework include files from
src/sys/sys.  Update the MAC Framework and MAC policy modules.  Delete
the old mac_policy.h.

Third party policy modules will need similar updating.

Obtained from:	TrustedBSD Project
2006-12-22 23:34:47 +00:00
rrs
c427816562 The prepend function did not handle non-pkthdr's correctly.
It always called MH_ALIGN for small lengths being
prepended (less than MHLEN). This meant that if you did
a prepend on a non M_PKTHDR the system would panic with
the KASSERT in MH_ALIGN. Instead we are not aware of
this and do a MH_ALIGN or M_ALIGN as appropriate.

Reviewed by:	andre
Approved by:	gnn
2006-12-21 19:58:04 +00:00
rwatson
6fa1425be4 Remove mac_enforce_subsystem debugging sysctls. Enforcement on
subsystems will be a property of policy modules, which may require
access control check entry points to be invoked even when not actively
enforcing (i.e., to track information flow without providing
protection).

Obtained from:	TrustedBSD Project
Suggested by:	Christopher dot Vance at sparta dot com
2006-12-21 09:51:34 +00:00
rwatson
5749ecccba Expand commenting on label slots, justification for the MAC Framework locking
model, interactions between locking and policy init/destroy methods.

Rewrap some comments to 77 character line wrap.

Obtained from:	TrustedBSD Project
2006-12-20 20:38:44 +00:00
jkim
0099defbac MFP4: (part of) 110058
copyin()/copyout() for message type is separated from msgsnd()/msgrcv() and
it is done from its wrapper functions to support 32-bit emulations.  After I
implemented this, I have briefly referenced NetBSD and Darwin.  NetBSD passes
copyin()/copyout() function pointers from wrappers.  Darwin passes size of
message type as an argument, which is actually similar to my first
implementation (P4 109706).  We may revisit these implementations later.
2006-12-20 19:26:30 +00:00
kib
9311fcbc5d In rev. 1.514, iodone on async buffer may happen before code checks the
vnode v_flag. For cluster buffers this would result in dereferencing NULL
b_vp. To prevent the panic, cache relevant vnode flag before calling
bstrategy.

Reported by:	Peter Holm, kris
Tested by:	Peter Holm
Reviewed by: tegge
Pointy hat to:	kib
2006-12-20 09:22:31 +00:00
davidxu
5a984630fa Add a lwpid field into per-cpu structure, the lwpid represents current
running thread's id on each cpu. This allow us to add in-kernel adaptive
spin for user level mutex. While spinning in user space is possible,
without correct thread running state exported from kernel, it hardly
can be implemented efficiently without wasting cpu cycles, however
exporting thread running state unlikely will be implemented soon as
it has to design and stablize interfaces. This implementation is
transparent to user space, it can be disabled dynamically. With this
change, mutex ping-pong program's performance is improved massively on
SMP machine. performance of mysql super-smack select benchmark is increased
about 7% on Intel dual dual-core2 Xeon machine, it indicates on systems
which have bunch of cpus and system-call overhead is low (athlon64, opteron,
and core-2 are known to be fast), the adaptive spin does help performance.

Added sysctls:
    kern.threads.umtx_dflt_spins
        if the sysctl value is non-zero, a zero umutex.m_spincount will
        cause the sysctl value to be used a spin cycle count.
    kern.threads.umtx_max_spins
        the sysctl sets upper limit of spin cycle count.

Tested on: Athlon64 X2 3800+, Dual Xeon 5130
2006-12-20 04:40:39 +00:00
mbr
a2c03bf6cb Back out rev. 1.266. The real cause for the recent panics has been fixed
in rev. 1.267 and there is no need to keep this test.
2006-12-20 02:49:59 +00:00
mbr
37965664cc Giant might have been temporarily dropped while waiting for proctree_lock, allowing for an
intervening tty_close() that cleared tp->t_session.

Submitted by:	tegge
MFC:		1 day
2006-12-19 22:34:32 +00:00
mbr
ccabbc6486 Add the tp->t_refcnt validity check back. There are still some race
conditions where tp->t_refcnt can go to zero.
2006-12-19 16:46:13 +00:00
davidxu
b0b74f9bd3 Remove unused sysctls. 2006-12-19 13:06:01 +00:00
pjd
6cc6a8d100 Use pipe_direct_write() optimization only if the data is in process' memory.
This fixes sending data through pipe from the kernel.

Fix suggested by:	rwatson
2006-12-19 12:52:22 +00:00
kmacy
af645e118f ktrace_cv is no longer used - remove
Submitted by: Attilio Rao
2006-12-17 00:16:09 +00:00
kmacy
bb69932355 Cleaner fix for handling declaration of loop variable under INVARIANTS
- in trying to avoid nested brackets and #ifdef INVARIANTS around i at the
  top, I broke booting for INVARIANTS all together :-(
- the cleanest fix is to simply assign to sq twice if INVARIANTS is enabled
- tested both with and without INVARIANTS :-/
2006-12-17 00:14:20 +00:00
ache
aebc61a22f Don't intermix assignments and variable declarations in prev. commit 2006-12-16 21:17:27 +00:00
ache
84d03f55f7 Fix NULL pointer reference for INVARIANTS case
Submitted by:   Yuriy Tsibizov <Yuriy.Tsibizov@gfk.ru>
2006-12-16 20:33:26 +00:00
rodrigc
10e4664552 In vfs_export(), if we specify MNT_DELEXPORT in the struct export_args,
after we perform the operations to delete the export,
call vfs_deleteopt() to delete the "export" mount option from
the linked list of mount options associated with that mount point.

This fixes one scenario:
- put a filesystem in /etc/exports to export it
- remove the filesystem from /etc/exports to delete the export and restart
  mountd
- try to do a "mount -u -o ro" or "mount -u -o rw" on that filesystem
  now that it is no  longer exported.
2006-12-16 15:50:36 +00:00
rodrigc
ccbffdda2c Add a function vfs_deleteopt() which searches through the vfsoptlist
linked list of mount options by name, and deletes the option if it finds it.
2006-12-16 15:44:03 +00:00
rodrigc
2aade96a59 Convert to ANSI-style function prototypes. 2006-12-16 12:06:59 +00:00
rwatson
4d1fe0d425 For now, back out sysv_ipc.c:1.30, which caused shmget() with odd mode
arguments to fail.  The mode field for shmget() appears to have undefined
meaning in the context of an already-present IPC object, but applications
appear to assume any arbitrary passed value will be ignored.  I had hoped
to revisit this more quickly, but am removing the change for now to
prevent toe-stubbing.

Reported by:	JAroslav Suchanek <jarda at grisoft dot cz>
PR:		kern/106078
2006-12-16 11:30:54 +00:00
kmacy
4e5f5353fb correct name of number of sleep queues 2006-12-16 07:50:39 +00:00
kmacy
7327d346fc Add second sleep queue so that sx and lockmgr can have separate sleep
queues for shared and exclusive acquisitions

Submitted by: Attilio Rao
Approved by: jhb
2006-12-16 06:54:09 +00:00
kmacy
ad3abace0c - Fix some gcc warnings in lock_profile.h
- add cnt_hold cnt_lock support for spin mutexes
- make sure contested is initialized to zero to only bump contested when appropriate
- move initialization function to kern_mutex.c to avoid cyclic dependency between
  mutex.h and lock_profile.h
2006-12-16 02:37:58 +00:00
n_hibma
c98f016084 Align the interfaces for the various watchdogs and make the interface
behave as expected.

Also:
- Return an error if WD_PASSIVE is passed in to the ioctl as only
  WD_ACTIVE is implemented at the moment. See sys/watchdog.h for an
  explanation of the difference between WD_ACTIVE and WD_PASSIVE.
- Remove the I_HAVE_TOTALLY_LOST_MY_SENSE_OF_HUMOR define. If you've
  lost your sense of humor, than don't add a define.

Specific changes:

i80321_wdog.c
  Don't roll your own passive watchdog tickle as this would defeat the
  purpose of an active (userland) watchdog tickle.

ichwd.c / ipmi.c:
  WD_ACTIVE means active patting of the watchdog by a userland process,
  not whether the watchdog is active. See sys/watchdog.h.

kern_clock.c:
  (software watchdog) Remove a check for WD_ACTIVE as this does not make
  sense here. This reverts r1.181.
2006-12-15 21:44:49 +00:00
kib
e7cdcb3240 Resolve two deadlocks that could be caused by busy md device backed
by vnode. Allow for md thread and the thread that owns lock on vnode
backing the md device to do the write even when runningbufspace is
exhausted.

Tested by:	Peter Holm
Reviewed by:	tegge
MFC after:	2 weeks
2006-12-14 11:34:07 +00:00
jhb
65d8bd30a0 Add a function to return the MD interrupt source cookie associated with
an interrupt event.  Use this in the x86 code to fixup the intrcnt names
when an interrupt handler is removed.
2006-12-12 19:20:19 +00:00
jhb
7106027433 Add a comment and fix a whitespace nit. 2006-12-12 19:19:22 +00:00
julian
541d02c2d4 Fix a potential point of confusion. Art Ironport we've seen this end up
with an infinite loop in and out of the kernel during process shutdown.
2006-12-12 08:01:55 +00:00
rodrigc
fbf224913d Use vfs_mount_error() to log mount errors in a few places with human
readable strings which can be retrieved if an "errmsg" parameter is
passed into nmount().
2006-12-07 02:57:00 +00:00
julian
948c671f4a Changes to try fix sched_ule.c courtesy of David Xu. 2006-12-06 06:55:59 +00:00
julian
396ed947f6 Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
kmacy
ff9c89ea11 Bug fix for obscenely large wait times on uncontested locks
if waittime was zero (the lock was uncontested) l->lpo_waittime
in the hash table would not get initialized.

Inspection prompted by questions from: Attilio Rao
2006-12-04 22:15:50 +00:00
jhb
12956f1daa Fix an edge case in rman_manage_region() where it didn't handle a resource
ending at ULONG_MAX properly.  While here, use TAILQ_FOREACH_SAFE().

Tested by:	"Stephane E. Potvin" <sepotvin at videotron-ca>
MFC after:	1 week
2006-12-04 16:45:23 +00:00
davidxu
22a81fc246 if a thread blocked on userland condition variable is
pthread_cancel()ed, it is expected that the thread will not
consume a pthread_cond_signal(), therefor, we use thr_wake()
to mark a flag, the flag tells a thread calling do_cv_wait()
in umtx code to not block on a condition variable.
Thread library is expected that once a thread detected itself
is in pthread_cond_wait, it will call the thr_wake() for itself
in its SIGCANCEL handler.
2006-12-04 14:15:12 +00:00
davidxu
f6f7f2a20e Introduce userspace condition variable, since we have already POSIX
priority mutex implemented, it is the time to introduce this stuff,
now we can use umutex and ucond together to implement pthread's
condition wait/signal.
2006-12-03 01:49:22 +00:00
kib
29a6b4a0eb Linker set support depends on the magic __start_<section> and
__stop_<section> symbols generated by the static linker for elf
sections. This is done only for the final link, and not for ld -r.
Augment elf_obj in-kernel linker by recognizing such special symbols,
and resolving them to the start and end of the section automatically.

As result, linker sets on amd64 could be used in the same way as on
other architectures, without explicit calls to linker_file_lookup_set().

Requested by:	rdivacky
No objections from:	peter, jhb
2006-11-30 10:50:29 +00:00
phk
b911f6e6f0 Only grab the sched_lock if we actually need to modify the thread priority.
During a buildworld only 2/3 of the calls to msleep actually changed
the priority.
2006-11-30 08:27:38 +00:00
jb
01bbbf558e Flushing the buffer is conditional on actually using the buffer. Oops. 2006-11-30 07:25:52 +00:00
jb
da35e3e55f Turn console printf buffering into a kernel option and only on
by default for sun4v where it is absolutely required.

This change moves the buffer from struct pcpu to the stack to avoid
using the critical section which created a LOR in a couple of cases
due to interaction with the tty code and kqueue. The LOR can't be
fixed with the critical section and the pcpu buffer can't be used
without the critical section.

Putting the buffer on the stack was my initial solution, but it was
pointed out that the stress on the stack might cause problems
depending on the call path. We don't have a way of creating tests
for those possible cases, so it's best to leave this as an option
for the time being. In time we may get enough data to enable this
option more generally.
2006-11-30 04:17:05 +00:00
davidxu
0c2ab2f920 - Remove third parameter of itimer_find, the parameter is always zero.
- Call callout_drain on deleting POSIX timer.
- Use kern_timer_delete in exiting hook.
2006-11-28 03:24:34 +00:00
mohans
38630c101d Fix a race in soclose() where connections could be queued to the
listening socket after the pass that cleans those queues. This
results in these connections being orphaned (and leaked). The fix
is to clean up the so queues after detaching the socket from the
protocol. Thanks to ups and jhb for discussions and a thorough code
review.
2006-11-22 23:54:29 +00:00
jhb
80327896bd Save exit status of an exiting process in kn_data in the knote.
Submitted by:	Jared Yanovich ^phirerunner at comcast.net^
MFC after:	2 weeks
2006-11-20 22:17:50 +00:00
julian
14dac92354 whitespace fix only 2006-11-20 16:13:02 +00:00
davidxu
a25887447d Use scheduler API sched_user_prio() to adjust thread's userland priority,
use td_base_user_prio to get real userland priority since POSIX priority
mutex may adjust td_user_pri which is an effective priority.
2006-11-20 05:50:59 +00:00
alc
d93a445ea9 Add vm map and object locking to each_writable_segment().
Noticed by: jhb@
MFC after: 3 weeks
2006-11-19 23:38:59 +00:00
jkim
edc10a6695 Fix msgsnd(3)/msgrcv(3) deadlock under heavy resource pressure by timing out
msgsnd and rechecking resources.  This problem was found while I was running
Linux Test Project test suite (test cases: msgctl08, msgctl09).
Change `msgwait' to `msgsnd' and `msgrcv' to distinguish its sleeping
conditions.  Few cosmetic changes to debugging messages.
2006-11-17 20:43:01 +00:00
pjd
63d82b700d Change sleepq_add(9) argument from 'struct mtx *' to 'struct lock_object *',
which allows to use it with different kinds of locks. For example it allows
to implement Solaris conditions variables which will be used in ZFS port on
top of sx(9) locks.

Reviewed by:	jhb
2006-11-16 01:02:00 +00:00
jhb
fa8eeee427 Adjust assertions to allow for magical properties of the 'lbolt' wait
channel for tsleep():
- Allow tsleep() on &lbolt without Giant with a timeout 0 since &lbolt has
  an implied timeout.
- If &lbolt is used with msleep() pass NULL to sleepq_add() for the lock
  object.  Unlike other sleepq channels, &lbolt doesn't have an associated
  owning lock.
2006-11-15 20:44:07 +00:00
davidxu
c3c0231226 Fix a copy-paste bug in NON-KSE case. 2006-11-14 05:48:27 +00:00
kmacy
0c00ea16db change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)
2006-11-13 05:51:22 +00:00
kmacy
ec9503cd04 track lock class name in a way that doesn't break WITNESS 2006-11-13 05:41:46 +00:00
kmacy
e9b95a7ee6 Unbreak witness 2006-11-12 23:23:38 +00:00
andre
855a36d5a0 In kern_sendfile() fix the calculation of sbytes (the total number of bytes
written to the socket).  The rewrite in revision 1.240 got confused by the
FreeBSD 4.x bug compatibility code.

For some reason lighttpd, that was used for testing the new sendfile code,
was not affected by the problem but apache and others using headers/trailers
in the sendfile call received incorrect sbytes values after return from non-
blocking sockets.  This then lead to restarts with wrong offsets and thus
mixed up file contents when the socket was writeable again.  All programs
not using headers/trailers, like ftpd, were not affected by the bug.

Reported by:	Pawel Worach <pawel.worach-at-gmail.com>
Tested by:	Pawel Worach <pawel.worach-at-gmail.com>
2006-11-12 20:57:00 +00:00
davidxu
c2eeccb29d Copy base user priority in NO_KSE case. 2006-11-12 11:48:37 +00:00
trhodes
892aec9079 Fix mispatch of includes list; allows my kernel to build successfully. 2006-11-12 03:34:03 +00:00
kmacy
b082a8c58a show lock class in profiling output for default case where type is not specified when initializing the lock
Approved by: scottl (standing in for mentor rwatson)
2006-11-12 03:30:01 +00:00
davidxu
0c1c40362e Use mi_switch, this should fix loadavg calculation problem in NO_KSE case. 2006-11-12 03:18:22 +00:00
trhodes
e4ae9746ec Update includes for sys/posix4 move.
Approved by:	silence on -arch and -standards
2006-11-11 16:46:31 +00:00
trhodes
58cca8458a Merge posix4/* into normal kernel hierarchy.
Reviewed by:	glanced at by jhb
Approved by:	silence on -arch@ and -standards@
2006-11-11 16:26:58 +00:00
trhodes
2d45fdc244 Update #includes list. 2006-11-11 16:19:12 +00:00
davidxu
8cc5d384d8 Unbreak userland priority inheriting in NO_KSE case. 2006-11-11 13:11:29 +00:00
kmacy
ae161ce255 tinderbox fix 2006-11-11 07:38:48 +00:00
kmacy
2aebfa38b5 remove lingering call to rd(tick) 2006-11-11 07:28:45 +00:00
kmacy
d73c3c8f6e missed nits replacing mutex with lock 2006-11-11 06:28:47 +00:00
kmacy
9eefcf3161 MUTEX_PROFILING has been generalized to LOCK_PROFILING. We now profile
wait (time waited to acquire) and hold times for *all* kernel locks. If
the architecture has a system synchronized TSC, the profiling code will
use that - thereby minimizing profiling overhead. Large chunks of profiling
code have been moved out of line, the overhead measured on the T1 for when
it is compiled in but not enabled is < 1%.

Approved by: scottl (standing in for mentor rwatson)
Reviewed by: des and jhb
2006-11-11 03:18:07 +00:00
maxim
75bc36366a o Fix a couple of obvious typos. 2006-11-08 09:09:07 +00:00
andre
aa817971d1 Style cleanups to the sctp_* syscall functions. 2006-11-07 21:28:12 +00:00
jhb
25a000d49f Simplify operations with sync_mtx in sched_sync():
- Don't drop the lock just to reacquire it again to check rushjob, this
  only wastes time.
- Use msleep() to drop the mutex while sleeping instead of explicitly
  unlocking around tsleep.

Reviewed by:	pjd
2006-11-07 19:45:05 +00:00
jhb
f4782279b7 Fix comment typo and function declaration. 2006-11-07 19:07:33 +00:00
tegge
4fdb31ed24 Don't drop reference to tty in tty_close() if TS_ISOPEN is already cleared.
Reviewed by:	bde
2006-11-06 22:12:43 +00:00
andre
98af8dbef9 Handle early errors in kern_sendfile() by introducing a new goto 'out'
label after the sbunlock() part.

This correctly handles calls to sendfile(2) without valid parameters
that was broken in rev. 1.240.

Coverity error:	272162
2006-11-06 21:53:19 +00:00
rwatson
10d0d9cf47 Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
rwatson
7288104e20 Add a new priv(9) kernel interface for checking the availability of
privilege for threads and credentials.  Unlike the existing suser(9)
interface, priv(9) exposes a named privilege identifier to the privilege
checking code, allowing more complex policies regarding the granting of
privilege to be expressed.  Two interfaces are provided, replacing the
existing suser(9) interface:

suser(td)                 ->   priv_check(td, priv)
suser_cred(cred, flags)   ->   priv_check_cred(cred, priv, flags)

A comprehensive list of currently available kernel privileges may be
found in priv.h.  New privileges are easily added as required, but the
comments on adding privileges found in priv.h and priv(9) should be read
before doing so.

The new privilege interface exposed sufficient information to the
privilege checking routine that it will now be possible for jail to
determine whether a particular privilege is granted in the check routine,
rather than relying on hints from the calling context via the
SUSER_ALLOWJAIL flag.  For now, the flag is maintained, but a new jail
check function, prison_priv_check(), is exposed from kern_jail.c and used
by the privilege check routine to determine if the privilege is permitted
in jail.  As a result, a centralized list of privileges permitted in jail
is now present in kern_jail.c.

The MAC Framework is now also able to instrument privilege checks, both
to deny privileges otherwise granted (mac_priv_check()), and to grant
privileges otherwise denied (mac_priv_grant()), permitting MAC Policy
modules to implement privilege models, as well as control a much broader
range of system behavior in order to constrain processes running with
root privilege.

The suser() and suser_cred() functions remain implemented, now in terms
of priv_check() and the PRIV_ROOT privilege, for use during the transition
and possibly continuing use by third party kernel modules that have not
been updated.  The PRIV_DRIVER privilege exists to allow device drivers to
check privilege without adopting a more specific privilege identifier.

This change does not modify the actual security policy, rather, it
modifies the interface for privilege checks so changes to the security
policy become more feasible.

Sponsored by:		nCircle Network Security, Inc.
Obtained from:		TrustedBSD Project
Discussed on:		arch@
Reviewed (at least in part) by:	mlaier, jmg, pjd, bde, ceri,
			Alex Lyashkov <umka at sevcity dot net>,
			Skip Ford <skip dot ford at verizon dot net>,
			Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:37:19 +00:00
pjd
c524521d2f Typo, 'from' vnode is locked here, not 'to' vnode. 2006-11-04 23:57:02 +00:00
rrs
f19063382b This commits the remake in kern/ make sysent to get
the correct syscalls.master's $FreeBSD$ tag record and
a make sysent in sys/compat/freebsd32. Thanks Ruslan
for pointing out the steps I missed :-0
Approved by:	gnn
2006-11-03 18:57:49 +00:00
rrs
3d3e3f2242 Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by:	gnn
Approved by:	gnn
2006-11-03 15:23:16 +00:00
jb
9658161dc0 Always init the console before trying to cnadd it to
avoid the case where the console name isn't set and
cnadd wants to use printf to complain about it.
2006-11-03 06:23:53 +00:00
andre
429c8e9145 Use the improved m_uiotombuf() function instead of home grown sosend_copyin()
to do the userland to kernel copying in sosend_generic() and sosend_dgram().

sosend_copyin() is retained for ZERO_COPY_SOCKETS which are not yet supported
by m_uiotombuf().

Benchmaring shows significant improvements (95% confidence):
 66% less cpu (or 2.9 times better) with new sosend vs. old sosend (non-TSO)
 65% less cpu (or 2.8 times better) with new sosend vs. old sosend (TSO)

(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 month
2006-11-02 17:45:28 +00:00
andre
d1cc5b22d7 Rename m_getm() to m_getm2() and rewrite it to allocate up to page sized
mbuf clusters.  Add a flags parameter to accept M_PKTHDR and M_EOR mbuf
chain flags.  Provide compatibility macro for m_getm() calling m_getm2()
with M_PKTHDR set.

Rewrite m_uiotombuf() to use m_getm2() for mbuf allocation and do the
uiomove() in a tight loop over the mbuf chain.  Add a flags parameter to
accept mbuf flags to be passed to m_getm2().  Adjust all callers for the
extra parameter.

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 month
2006-11-02 17:37:22 +00:00
andre
42a69dcade Rewrite kern_sendfile() to work in two loops, the inner which turns as many
VM pages into mbufs as it can -- up to the free send socket buffer space.
The outer loop then drops the whole mbuf chain into the send socket buffer,
calls tcp_output() on it and then waits until 50% of the socket buffer are
free again to repeat the cycle. This way tcp_output() gets the full amount
of data to work with and can issue up to 64K sends for TSO to chop up in
the network adapter without using any CPU cycles. Thus it gets very efficient
especially with the readahead the VM and I/O system do.

The previous sendfile(2) code simply looped over the file, turned each 4K
page into an mbuf and sent it off. This had the effect that TSO could only
generate 2 packets per send instead of up to 44 at its maximum of 64K.

Add experimental SF_MNOWAIT flag to sendfile(2) to return ENOMEM instead of
sleeping on mbuf allocation failures.

Benchmarking shows significant improvements (95% confidence):
 45% less cpu (or 1.81 times better) with new sendfile vs. old sendfile (non-TSO)
 83% less cpu (or 5.7 times better) with new sendfile vs. old sendfile (TSO)

(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 month
2006-11-02 16:53:26 +00:00
jhb
2a13022ca0 Increment nb_allocated while holding the pt_mtx lock to avoid races. 2006-11-01 16:50:13 +00:00
jhb
49d3deb803 Comment and style tweak. 2006-11-01 16:48:33 +00:00
jb
d2bd807356 Add a cnputs() function to write a string to the console with
a lock to prevent interspersed strings written from different CPUs
at the same time.

To avoid putting a buffer on the stack or having to malloc one,
space is incorporated in the per-cpu structure. The buffer
size if 128 bytes; chosen because it's the next power of 2 size
up from 80 characters.

String writes to the console are buffered up the end of the line
or until the buffer fills. Then the buffer is flushed to all
console devices.

Existing low level console output via cnputc() is unaffected by
this change. ithread calls to log() are also unaffected to avoid
blocking those threads.

A minor change to the behaviour in a panic situation is that
console output will still be buffered, but won't be written to
a tty as before. This should prevent interspersed panic output
as a number of CPUs panic before we end up single threaded
running ddb.

Reviewed by:	scottl, jhb
MFC after:	2 weeks
2006-11-01 04:54:51 +00:00
pjd
036e929548 Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
  number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
  total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
  object is still open), increase fs_unrefs and cg_unrefs fields,
  which is a hint for fsck in which cylinder groups looks for such
  (orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by:	home.pl
2006-10-31 21:48:54 +00:00
pjd
d5cc909451 Add a new I/O request - BIO_FLUSH, which basically tells providers below to
flush their caches. For now will mostly be used by disks to flush their
write cache.

Sponsored by:	home.pl
2006-10-31 21:11:21 +00:00
alc
ce403a9391 Refactor vfs_setdirty(), creating vfs_setdirty_locked_object().
Call vfs_setdirty_locked_object() from vfs_busy_pages() instead of
vfs_setdirty(), thereby eliminating a second acquisition and release
of the same vm object lock.
2006-10-29 00:04:39 +00:00
alc
b944aa0079 In bufdone_finish() restrict the acquisition and release of the page
queues lock to BIO_READ operations.  Recent changes to the implementation
of the per-page flags have eliminated the need for the page queues lock
in the other cases.
2006-10-28 19:16:57 +00:00
davidxu
ce90069697 Remove member p_procscopegrp which is no longer used by libthr. 2006-10-27 05:45:44 +00:00
jb
f82c799735 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
kib
33638419ea The attempt to rename "." with MAC framework compiled in would cause attempt
to twice unlock the vnode. Check that ni_vp and ni_dvp are different before
doing second unlock.

Reviewed by:	rwatson
Approved by:	pjd (mentor)
MFC after:	1 week
2006-10-26 13:20:28 +00:00
rwatson
55067e6cf7 Increase usefulness of "show malloc" by moving from displaying the basic
counters of allocs/frees/use for each malloc type to calculating InUse,
MemUse, and Requests as displayed by the userspace vmstat -m.  This is
more useful when debugging malloc(9)-related memory leaks, where the
count of allocs/frees may not usefully reflect that current memory
allocation (i.e., when highly variable size allocations occur with the
same malloc type, such as with contigmalloc).

MFC after:			3 days
Limitations observed by:	scottl
2006-10-26 10:17:13 +00:00
davidxu
329f9b6294 Optimize umtx_lock_pi() a bit by moving some heavy code out of the loop,
make a fast path when a umtx_pi can be allocated without being blocked.
2006-10-26 09:33:34 +00:00
davidxu
c7a2917f38 In order to eliminate a branch, convert opcode to unsigned integer. 2006-10-25 06:38:46 +00:00
davidxu
6f6aaf471b Eliminate an unnecessary `if' statement. 2006-10-25 06:28:23 +00:00
davidxu
4889cc1ac4 Move sigqueue_take() call into proc_reparent(), this fixed bugs where
proc_reparent() is called but sigqueue_take() is forgotten.
2006-10-25 06:18:04 +00:00
davidxu
c117e9447c Protect sigqueue_take() call by child process's lock, it fixed a
potential race with ptrace 'attach' which changes parent of the
child process.
2006-10-24 12:04:21 +00:00
phk
e74f0ab140 Better naming of fattime conversion functions, they do convert to timespec
after all.

Add 'utc' argument to control if fattimestamps are on UTC or local timezone
calendar.
2006-10-24 10:27:23 +00:00
alc
5d9c66a3f8 The page queues lock is no longer required by vm_page_busy() or
vm_page_wakeup().  Reduce or eliminate its use accordingly.
2006-10-22 21:18:48 +00:00
phk
cc74fcabb4 Add two new functions to convert FAT filesystem format timestamps
to and from struct timespec, to replace the crummy conversion
function which have been copy&pasted into three different
filesystems already.

Apart from general crummyness as indicated by code like:

	for (year = 1970;; year++) {
		inc = year & 0x03 ? 365 : 366;
		if (days < inc)
			break;
		days -= inc;
	}

They also contain specialized crummyness which tries to compensate
for the general crummyness by caching recent conversion results,
with no regard for locking or consistency.

These replacement functions are smaller, O(1) and handle the Y2.1K
leap-year correctly.

Ideally, these functions should live in a module of their own,
which the three offending filesystems would depend on, but the
size is 877 bytes of code (on i386), so that would be false
economy.
2006-10-22 18:19:08 +00:00
rwatson
7beaaf5cd2 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
alc
cbcb760109 Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.
2006-10-22 04:28:14 +00:00
davidxu
5bbfc34004 Use macro TAILQ_FOREACH_SAFE instead of expanding it. 2006-10-22 00:09:41 +00:00
davidxu
df9c81e665 Since revision 1.333 of kern_sig.c no longer uses P_WEXIT, the change
opened a race window which can cause memory leak in signal queue.
Here we free memory for signal queue when process state is set to
PRS_ZOMBIE.
2006-10-21 23:59:15 +00:00
jhb
4cc0b2f46e Remove the check that prevented signals from being delivered to exiting
processes.  It was originally added back when support for Linux threads
(and thus shared sigacts objects) was added, but no one knows why.  My
guess is that at some point during the Linux threads patches, the sigacts
object was torn down during exit1(), so this check was added to prevent
a panic for that race.  However, the stuff that was actually committed to
the tree doesn't teardown sigacts until wait() making the above race moot.
Re-allowing signals here lets one interrupt a NFS request during process
teardown (such as closing descriptors) on an interruptible mount.

Requested by:	kib (long time ago)
MFC after:	1 week
2006-10-20 16:19:21 +00:00
kib
5f5bf9dadc Fix the race between devfs_fp_check and devfs_reclaim. Derefence the
vnode' v_rdev and increment the dev threadcount , as well as clear it
(in devfs_reclaim) under the dev_lock().

Reviewed by:	tegge
Approved by:	pjd (mentor)
2006-10-20 07:59:50 +00:00
bde
57b90fd011 kern_intr.c:
- Count (scheduling of) software interrupts (SWIs) as SWIs, not as
  hardware interrupts.
- Don't count (scheduling of) delayed SWIs as interrupts at all, since
  in the delayed case it is expected that there are many more scheduling
  calls than handling calls.  Perhaps all interrupts should be counted
  only when they are handled, but it is only counts of delayed SWIs that
  shouldn never be combined with the other counts.

subr_trap.c:
- Count (handling of) Asynchronous System Traps (ASTs) as traps, not as
  software interrupts.

Before these changes, the counter for SWIs only counted ASTs, and SWIs
weren't counted separately, but a subcounter for ASTs alone is less
needed than for most other exception sources.

4.4BSD-Lite uses the counters for similar things (actually matching
their names) on its main arches (hp300, ..., !i386) where more of the
exceptions are in hardware.
2006-10-18 04:48:09 +00:00
davidxu
9e7c324e6e Regenerate. 2006-10-17 02:28:58 +00:00
davidxu
bb5a3880aa o Add keyword volatile for user mutex owner field.
o Fix type consistent problem by using type long for old
  umtx and wait channel.
o Rename casuptr to casuword.
2006-10-17 02:24:47 +00:00
netchild
183bd5a34b MFP4 (with some minor changes):
Implement the linux_io_* syscalls (AIO). They are only enabled if the native
AIO code is available (either compiled in to the kernel or as a module) at
the time the functions are used. If the AIO stuff is not available there
will be a ENOSYS.

From the submitter:
---snip---
DESIGN NOTES:

1. Linux permits a process to own multiple AIO queues (distinguished by
   "context"), but FreeBSD creates only one single AIO queue per process.
   My code maintains a request queue (STAILQ of queue(3)) per "context",
   and throws all AIO requests of all contexts owned by a process into
   the single FreeBSD per-process AIO queue.

   When the process calls io_destroy(2), io_getevents(2), io_submit(2) and
   io_cancel(2), my code can pick out requests owned by the specified context
   from the single FreeBSD per-process AIO queue according to the per-context
   request queues maintained by my code.

2. The request queue maintained by my code stores contrast information between
   Linux IO control blocks (struct linux_iocb) and FreeBSD IO control blocks
   (struct aiocb). FreeBSD IO control block actually exists in userland memory
   space, required by FreeBSD native aio_XXXXXX(2).

3. It is quite troubling that the function io_getevents() of libaio-0.3.105
   needs to use Linux-specific "struct aio_ring", which is a partial mirror
   of context in user space. I would rather take the address of context in
   kernel as the context ID, but the io_getevents() of libaio forces me to
   take the address of the "ring" in user space as the context ID.

   To my surprise, one comment line in the file "io_getevents.c" of
   libaio-0.3.105 reads:

             Ben will hate me for this

REFERENCE:

1. Linux kernel source code:   http://www.kernel.org/pub/linux/kernel/v2.6/
   (include/linux/aio_abi.h, fs/aio.c)

2. Linux manual pages:         http://www.kernel.org/pub/linux/docs/manpages/
   (io_setup(2), io_destroy(2), io_getevents(2), io_submit(2), io_cancel(2))

3. Linux Scalability Effort:   http://lse.sourceforge.net/io/aio.html
   The design notes:           http://lse.sourceforge.net/io/aionotes.txt

4. The package libaio, both source and binary:
       http://rpmfind.net/linux/rpm2html/search.php?query=libaio
   Simple transparent interface to Linux AIO system calls.

5. Libaio-oracle:              http://oss.oracle.com/projects/libaio-oracle/
   POSIX AIO implementation based on Linux AIO system calls (depending on
   libaio).
---snip---

Submitted by:	Li, Xiao <intron@intron.ac>
2006-10-15 14:22:14 +00:00
ru
1bc0f5fc76 Prevent IOC_IN with zero size argument (this is only supported
if backward copatibility options are present) from attempting
to free memory that wasn't allocated.  This is an old bug, and
previously it would attempt to free a null pointer.  I noticed
this bug when working on the previous revision, but forgot to
fix it.

Security:	local DoS
Reported by:	Peter Holm
MFC after:	3 days
2006-10-14 19:01:55 +00:00
trhodes
e7d9ab0897 Close a race condition where num can be larger than tmp, giving the user
too large of a boundary.

Reported by:	Ilja Van Sprundel
2006-10-14 10:30:14 +00:00
tegge
2d3b850cd6 Wait for thread count to reach zero in destroy_devl() even when no purge
method is defined, to avoid memory being modified after free.

Temporarily increase refcount in destroy_devl() to avoid a double free
if dev_rel() is called while waiting for thread count to reach zero.
2006-10-13 20:49:24 +00:00
glebius
7fc0346961 Improve ktr(4) logging for callout(9) subsystem. Log all inserts and
removals, including failures, into the callwheel.

XXX: Most of the CTR() macros are called with callout_lock spin mutex
held, thus won't be logged into file, if KTR_ALQ is used. Moving the
CTR() macros out from the spinlocked code would require copying of all
arguments. I'm too lazy to do this.
2006-10-11 14:57:03 +00:00
davidxu
c5bda619e9 Implement 32bit umtx_lock and umtx_unlock system calls, these two system
calls are not used by libthr in RELENG_6 and HEAD, it is only used by
the libthr in RELENG-5, the _umtx_op system call can do more incremental
dirty works than these two system calls without having to introduce new
system calls or throw away old system calls when things are going on.
2006-10-06 08:22:08 +00:00
davidxu
0fa66af83d Move some declaration of 32-bit signal structures into file
freebsd32-signal.h, implement sigtimedwait and sigwaitinfo system calls.
2006-10-05 01:56:11 +00:00
mbr
7d21b9894a Back out part of rev. 1.149. While adding a workaround in ptcopen() to
avoid leaked ptys works fine, this opens a possible security hole.

Submitted by:	bde
MFC after:	3 days
2006-10-04 05:43:39 +00:00
rwatson
dafc0585f0 Regenerate. 2006-10-03 20:48:11 +00:00
rwatson
b54cd1fd3c Audit creat() system call (compat code), and change type for getpagesize(),
which isn't actually being audited anyway.

MFC after:	3 days
Obtained from:	TrustedBSD Project
2006-10-03 20:46:52 +00:00
kib
38b2c3b26f Fix the remaining race in the revs. 1.232, 1,233 that could occur during
unmount when mp structure is reused while waiting for coveredvp lock.
Introduce struct mount generation count, increment it on each reuse and
compare the generations before and after obtaining the coveredvp lock.

Reviewed by:	tegge, pjd
Approved by:	pjd (mentor)
MFC after:	2 weeks
2006-10-03 10:47:04 +00:00
phk
638e020bc6 Use utc_offset() where applicable, and hide the internals of it
as static variables.
2006-10-02 18:23:37 +00:00
phk
52fd275504 Introduce utc_offset() to capture a calculation currently done all over the
place.
2006-10-02 16:17:23 +00:00
phk
7a874f961d Move tz_minuteswest and tz_dsttime to subr_clock.c 2006-10-02 16:06:26 +00:00
phk
84cc0f277f Second part of a little cleanup in the calendar/timezone/RTC handling.
Split subr_clock.c in two parts (by repo-copy):
   subr_clock.c contains generic RTC and calendaric stuff. etc.
   subr_rtc.c contains the newbus'ified RTC interface.

Centralize the machdep.{adjkerntz,disable_rtc_set,wall_cmos_clock}
sysctls and associated variables into subr_clock.c.  They are
not machine dependent and we have generic code that relies on being
present so they are not even optional.
2006-10-02 15:42:02 +00:00
phk
50c81b8a9a First part of a little cleanup in the calendar/timezone/RTC handling.
Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.
2006-10-02 12:59:59 +00:00
kib
aa82c72808 Correct the comment: numvnodes is decreased on vdestroying the vnode.
OKed by:	tegge
Approved by:	pjd (mentor)
MFC after:	1 week
2006-10-02 07:25:58 +00:00
tegge
4d26d3d4d7 If the buffer lock has waiters after the buffer has changed identity then
getnewbuf() needs to drop the buffer in order to wake waiters that might
sleep on the buffer in the context of the old identity.
2006-10-02 02:06:27 +00:00
mbr
7e18a19928 Readd rev. 1.145 because of vfs bugs and races near revoke(). Until they
are fixed we can't free any slaves. Add a workaround to not to leak ptys
by number.
2006-09-30 22:51:05 +00:00
pjd
1358d3f1ae Remove duplicated $FreeBSD$. 2006-09-30 16:33:29 +00:00
mbr
b63df50b58 Any call of tty_close() with a tty refcount of <= 1 is wrong and we will
free the tty in this case. This is a workaround until the underlaying
devfs/tty problems are fixed.

MFC after:	1 day
2006-09-30 08:11:51 +00:00
mbr
06d2753424 Free tty struct after last close. This should fix the pty-leak by numbers.
Remove workarounds for tty_refcount beeing 0, this will be fixed differently
later.
2006-09-29 09:53:19 +00:00
mbr
ed48249ff4 Free tty struct after last close. This should fix the pty-leak by numbers.
Remove workarounds for tty_refcount beeing 0, this will be fixed differently
later.

Back out rev 1.145 since we initialize the tty struct from scratch and bad
things can't happen anymore.
2006-09-29 09:52:57 +00:00
ru
4ef62e4ca5 Fix our ioctl(2) implementation when the argument is "int". New
ioctls passing integer arguments should use the _IOWINT() macro.
This fixes a lot of ioctl's not working on sparc64, most notable
being keyboard/syscons ioctls.

Full ABI compatibility is provided, with the bonus of fixing the
handling of old ioctls on sparc64.

Reviewed by:	bde (with contributions)
Tested by:	emax, marius
MFC after:	1 week
2006-09-27 19:57:02 +00:00
mbr
24e5e6df3a Move Giant up even further since P_CONTROLT isn't really fully locked
yet (p_flag is, but P_CONTROLT isn't really).

Submitted by:	jhb
2006-09-27 16:42:10 +00:00
mbr
b07486e6c4 Use ctty instead of just returning. ctty just has a simple open that
returns ENXIO.

Submitted by:	jhb
2006-09-27 16:41:15 +00:00
tegge
89ea8a9b1b Reduce fluctuations of mnt_flag to allow unlocked readers to get a
slightly more consistent view.
2006-09-26 04:20:09 +00:00
tegge
a802c35eb4 Don't restore MNT_QUOTA bit in mnt_flag after a failed mount with
MNT_UPDATE flag, closing a race between nmount() and quotactl().
2006-09-26 04:18:36 +00:00
tegge
f42473d76b Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC.  Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
2006-09-26 04:15:59 +00:00
tegge
b500b0ae92 Don't restore mnt_kern_flag on failed MNT_UPDATE mount, it can race
with dounmount(), causing loss of MNTK_UNMOUNT flag.
2006-09-26 04:15:04 +00:00
tegge
83154f853d Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
2006-09-26 04:12:49 +00:00
rwatson
aa39d5a21d SI_ORDER_THIRD + 2, not SI_ORDER_FOURTH + 2.
MFC after:	3 days
Submitted by:	mlaier
2006-09-26 00:15:56 +00:00
rwatson
ffe75b8d42 Add "FreeBSD" trademark statement to copyright section of boot messages.
MFC after:	3 days
Approved by:	core, board at FreeBSDFoundation dot org
2006-09-25 23:19:01 +00:00