Commit Graph

6990 Commits

Author SHA1 Message Date
phk
c92feb226d Various minor details:
Give the HZ/overflow check a 10% margin.
	Eliminate bogus newline.
	If timecounters have equal quality, prefer higher frequency.

Some inspiration from:	bde
2003-11-13 10:03:58 +00:00
jhb
989e0408dd - Close a race where a thread on another CPU could release a contested lock
and empty its turnstile while the blocking threads still pointed to the
  turnstile.  If the thread on the first CPU blocked on a lock owned by
  one of the threads blocked on the turnstile just woken up, then the
  first CPU could try to manipulate a bogus thread queue in the turnstile
  during priority propagation.
- Update locking notes for ts_owner and always clear ts_owner, not just
  under INVARIANTS.

Tested by:      sam (1)
2003-11-12 23:48:42 +00:00
mckusick
b2ca74654c At the request of several developers, restore the DIAGNOSIC code
deleted in 1.81. Increase the initial timeout limit to 2ms to
eliminate spurious messages of excessive timeouts in the NFS
client code.

Requested by:	Poul-Henning Kamp <phk@phk.freebsd.dk>
Requested by:	Mike Silbersack <silby@silby.com>
Requested by:	Sam Leffler <sam@errno.com>
2003-11-12 22:28:27 +00:00
rwatson
3f5efde5af Mark __mac_get_pid() as MPSAFE in the comment, as it runs without
Giant and is also MPSAFE.

Push Giant further down into __mac_get_fd() and __mac_set_fd(),
grabbing it only for constrained regions dealing with VFS, and
dropping it entirely for operations related to labeling of pipes.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-12 22:19:15 +00:00
peter
e97c7f33cb MNAMELEN is back to an int again after Kirk's statfs commit
kern/vfs_mount.c:1305: warning: signed size_t format, different type arg (arg 4)
*** Error code 1
2003-11-12 17:09:12 +00:00
jhb
b996af9fb8 Fix a typo in a comment.
Submitted by:	das
2003-11-12 14:55:45 +00:00
phk
70388f65e9 Replace B_PHYS conditional assignment to bio_offset with KASSERT check
to see that the originating code already did it right.
2003-11-12 10:27:06 +00:00
mckusick
a7fc450278 Update the five files derived from /sys/kern/syscalls.master
after the additions made for the new statfs structure (version
1.157). These must be updated in a separate checkin after
syscalls.master has been checked in so that they reflect its
new CVS identity. As these are purely derived files, it is not
clear to me why they are under CVS at all. I presume that it has
something to do with having `make world' operate properly.
2003-11-12 08:09:19 +00:00
mckusick
6a4c30bccd Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by:	Bruce Evans <bde@zeta.org.au>
Reviewed by:	Tim Robbins <tjr@freebsd.org>
Reviewed by:	Julian Elischer <julian@elischer.org>
Reviewed by:	the hoards of <arch@freebsd.org>
Sponsored by:   DARPA & NAI Labs.
2003-11-12 08:01:40 +00:00
rwatson
77ed6e2d1c Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c).  This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE.  This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time.  This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change.  Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from:	bmilekic
Obtained from:		TrustedBSD Project
Sponsored by:		DARPA, Network Associates Laboratories
2003-11-12 03:14:31 +00:00
kan
9352a05d40 1. Consolidate mount struct allocation/destruction into a common code in
vfs_mount_alloc/vfs_mount_destroy functions and take care to completely
destroy the mount point along with its locks. Mount struct has grown in
coplexity recently and depending on each failure path to destroy it
completely isn't working anymore.

2. Eliminate largely identical vfs_mount and vfs_unmount question by
moving the code to handle both cases into a newly introduced vfs_domount
function.

3. Simplify nfs_mount_diskless to always expect an allocated mount
struct and never attempt an allocation/destruction itself. The
vfs_allocroot allocation was there to support 'magic' swap space
configuration for diskless clients that was already removed by PHK some
time ago.

4. Include a vfs_buildopts cleanups by Peter Edwards to validate the
sanity of nmount parameters passed from userland.

Submitted by:  (4) Peter Edwards <peter.edwards@openet-telecom.com>
Reviewed by:    rwatson
2003-11-12 02:54:47 +00:00
jhb
6cc1f7e330 Add an implementation of turnstiles and change the sleep mutex code to use
turnstiles to implement blocking isntead of implementing a thread queue
directly.  These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.

Turnstiles do not come out of a fixed-sized pool.  Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads.  The queue associated with a given lock
is found by a lookup in a simple hash table.  The turnstile itself is
protected by a lock associated with its entry in the hash table.  This
means that sched_lock is no longer needed to contest on a mutex.  Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.

Currently turnstiles only support mutexes.  Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation.  For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.

The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win).  Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.

Tested on:	i386 SMP, sparc64 SMP, alpha SMP
2003-11-11 22:07:29 +00:00
jkoshy
ac83b0ec2b Bound the number of iterations a thread can perform inside
ktr_resize_pool(); this eliminates a potential livelock.

Return ENOSPC only if we encountered an out-of-memory condition when
trying to increase the pool size.

Reviewed by:	jhb, bde (style)
2003-11-11 09:09:26 +00:00
jkoshy
edc6e45a50 Have utrace(2) return ENOMEM if malloc() fails. Document this error
return in its manual page.

Reviewed by:	jhb
2003-11-11 04:54:11 +00:00
alc
148f9321bd - Revision 1.469 of vfs_subr.c resulted in the buf's b_object field being
consistency initialized.  Consequently, a number of conditionals that
   checked the validity of b_object before passing it to VM_OBJECT_LOCK()
   and VM_OBJECT_UNLOCK() are no longer needed.
2003-11-11 04:45:37 +00:00
rwatson
ce4ce483f9 Whitespace sync to MAC branch, expand comment at the head of the file. 2003-11-11 03:40:04 +00:00
alfred
8837bee815 Fix a bug where the taskqueue kproc was being parented by init
because RFNOWAIT was being passed to kproc_create.

The result was that shutdown took quite a bit longer because this
errant "child" would not respond to termination signals from init
at system shutdown.

RFNOWAIT dissassociates itself from the caller by attaching to init
as a parent proc.  We could have had the taskqueue proc listen for
SIGKILL, but being able to SIGKILL a potentially critical system
process doesn't seem like a good idea.
2003-11-10 20:39:44 +00:00
tjr
3fb83070b6 When there are no free sem_undo structs available in semu_alloc(), only
free one sem_undo with un_cnt == 0 instead of all of them. This is a
temporary workaround until the SLIST_FOREACH_PREVPTR loop gets fixed so
that it doesn't cause cycles in semu_list when removing multiple adjacent
items. It might be easier to just use (doubly-linked) LISTs here instead
of complicated SLIST code to achieve O(1) removals.

This bug manifested itself as a complete lockup under heavy semaphore use
by multiple processes with the SEM_UNDO flag set.

PR:		58984
2003-11-10 07:22:41 +00:00
marcel
21340f30b3 Change the clear_ret argument of get_mcontext() to be a flags argument.
Since all callers either passed 0 or 1 for clear_ret, define bit 0 in
the flags for use as clear_ret. Reserve bits 1, 2 and 3 for use by MI
code for possible (but unlikely) future use. The remaining bits are for
use by MD code.

This change is triggered by a need on ia64 to have another knob for
get_mcontext().
2003-11-09 20:31:04 +00:00
bde
67e14c35e9 Quick fix for scaling of statclock ticks in the SMP case. As explained
in the log message for kern_sched.c 1.83 (which should have been
repo-copied to preserve history for this file), the (4BSD) scheduler
algorithm only works right if stathz is nearly 128 Hz.  The old
commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz
but there was a scale factor of 4 to give the requirement of 64 Hz,
and rev.1.83 changed the scale factor so that the requirement became
128 Hz.  The change of the scale factor was incomplete in the SMP
case.  Then scheduling ticks are provided by smp_ncpu CPUs, and the
scheduler cannot tell the difference between this and 1 CPU providing
scheduling ticks smp_ncpu times faster, so we need another scale
factor of smp_ncp or an algorithm change.

This quick fix uses the scale factor without even trying to optimize
the runtime divisions required for this as is done for the other
scale factor.

The main algorithmic problem is the clamp on the scheduling tick counts.
This was 295; it is now approximately 295 * smp_ncpu.  When the limit
is reached, threads get free timeslices and scheduling becomes very
unfair to the threads that don't hit the limit.  The limit can be
reached and maintained in the worst case if the load average is larger
than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with
2 CPUs before this change), so there are algorithmic problems even for
a load average of 1.  Fortunately, the worst case isn't common enough
for the problem to be very noticeable (it is mainly for niced CPU hogs
competing with less nice CPU hogs).
2003-11-09 13:45:54 +00:00
tanimura
7eade05dfa - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
sam
7f3b205cb8 o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
  operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
  calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
  Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
  have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by:	FreeBSD Foundation
2003-11-08 22:28:40 +00:00
davidxu
42bda3c853 Return a reasonable number for top or ps to display for M:N thread,
since there is no direct association between M:N thread and kse,
sometimes, a thread does not have a kse, in that case, return a pctcpu
from its last kse, it is not perfect, but gives a good number to be
displayed.
2003-11-08 03:03:17 +00:00
jhb
b6b663a274 Regen. 2003-11-07 20:30:30 +00:00
jhb
0ec5596a30 Mark ptrace(), ktrace(), utrace(), sysarch(), and issetugid() as MP safe.
The parts of these calls that are not yet MP safe acquire Giant explicitly.
2003-11-07 20:23:23 +00:00
rwatson
f74e2ef8ef Slight whitespace consistency improvement:
Trim trailing whitespace.
  Remove unmatched " " before ")".
2003-11-07 04:47:14 +00:00
jeff
6fc9686d6a - Somehow I botched my last commit. Add an extra ( to fix things up. I'm
still not sure how this happened.

Reported by:	ps
2003-11-06 07:56:01 +00:00
alc
adf47c347a - Delay the allocation of memory for the pipe mutex until we need it.
This avoids the need to free said memory in various error cases along
   the way.
2003-11-06 05:58:26 +00:00
alc
ddec4754a2 - Simplify pipespace() by eliminating the explicit creation of vm objects.
Instead, let the vm objects be lazily instantiated at fault time.  This
   results in the allocation of fewer vm objects and vm map entries due to
   aggregation in the vm system.
2003-11-06 05:08:12 +00:00
rwatson
c7fff281b1 Remove the flags argument from mac_externalize_*_label(), as it's not
passed into policies or used internally to the MAC Framework.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-06 03:42:43 +00:00
jeff
e150e86d4b - Remove the local definition of sched_pin and unpin. They are provided in
sched.h now.
 - Respect the td pin count.
2003-11-06 03:09:51 +00:00
sam
0927f68a45 o make debug_mpsafenet globally visible
o move it from subr_bus.c to netisr.c where it more properly belongs
o add NET_PICKUP_GIANT and NET_DROP_GIANT macros that will be used to
  grab Giant as needed when MPSAFE operation is enabled

Supported by:	FreeBSD Foundation
2003-11-05 23:42:51 +00:00
imp
535923b1ec Minor style(9) nit 2003-11-05 06:14:48 +00:00
jeff
9f266b0941 - It's ok if sched_runnable() has races in it, we don't need the sched_lock
here unless we have something on the assigned queue.
2003-11-05 05:30:12 +00:00
kan
36d60f3bb7 Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff
2003-11-05 04:30:08 +00:00
fjoe
160f13825c Back out the following revisions:
1.36      +73 -60    src/sys/compat/linux/linux_ipc.c
1.83      +102 -48   src/sys/kern/sysv_shm.c
1.8       +4 -0      src/sys/sys/syscallsubr.h

That change was intended to support vmware3, but
wantrem parameter is useless because vmware3 uses SYSV shared memory
to talk with X server and X server is native application.
The patch worked because check for wantrem was not valid
(wantrem and SHMSEG_REMOVED was never checked for SHMSEG_ALLOCATED segments).

Add kern.ipc.shm_allow_removed (integer, rw) sysctl (default 0) which when set
to 1 allows to return removed segments in
shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().

MFC after:	1 week
2003-11-05 01:53:10 +00:00
mckusick
609a7ca3c1 Get rid of DIAGNOSTIC that gives false positives on slow CPUs. 2003-11-04 08:03:11 +00:00
jeff
23d5b83434 - Add initial support for pinning and binding. 2003-11-04 07:45:41 +00:00
mckusick
993e4dcc97 Allow the bufdaemon and update daemon processes to skip the
waitrunningbufspace() calls so that they are always able to
proceed and clean up buffer space.

Submitted by:	Brian Fundakowski Feldman <green@freebsd.org>
2003-11-04 06:30:00 +00:00
sam
0faeaa0329 disable MPSAFE network drivers; we aren't ready yet` 2003-11-04 02:01:42 +00:00
cognet
8a3fa16a0a I believe kbyanc@ really meant this in rev 1.58.
Use zpfind() to see if the process became a zombie if pfind() doesn't find it
and if the caller wants to know about process death, so that the caller knows
the process died even if it happened before the kevent was actually registered.

MFC after:	1 week
2003-11-04 01:41:47 +00:00
cognet
a3a383926f Do not attempt to report proc event if NOTE_EXIT has already been received.
This fixes a race condition (specifically with signal events) that could
lead to the kn being re-inserted into the list after it has been destroyed,
which is not something we want to happen.

PR:		kern/58258
2003-11-04 01:14:58 +00:00
jhb
da7c66355b Don't require INTR_FAST handlers to be exclusive in the MI layer. Instead,
let the MD code choose whether or not to implement such a policy.  The new
i386 interrupt code allows multiple FAST handlers for a given source for
example.  However, the code does not allow FAST and non-FAST handlers to be
mixed.
2003-11-03 22:42:58 +00:00
jhb
1d8b4454d7 Update spin lock order list for new i386 interrupt and SMP code. 2003-11-03 22:38:30 +00:00
rwatson
95e56a059a Unlock pipe mutex when failing MAC pipe ioctl access control check.
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-03 17:58:23 +00:00
jeff
19f727b426 - Remove kseq_find(), we no longer scan other cpu's run queues when we go
idle.  They figure out that we're idle fast enough that the cache pollution
   introduces by scanning their run queue is more expensive than waiting
   a little longer.
 - Add kseq_setidle() to mark us as being idle.  Use this in place of
   kseq_find().
 - Remove kseq_load_highest(), kseq_find() was the only consumer of this
   interface.  kseq_balance() has it's own customized version that finds the
   lowest and highest loads simultaneously.

Continuously told that this would be faster by:	terry
2003-11-03 03:27:22 +00:00
jeff
40aa6873ef - Remove the ksq_loads[] array. We are only interested in three counts,
the total load, the timeshare load, and the number of threads that can
   be migrated to another cpu.  Account for these seperately.
 - Introduce a KSE_CAN_MIGRATE() macro which determines whether or not a KSE
   can be migrated to another CPU.  Currently, this only checks to see if
   we're an interrupt handler.  Eventually this will also be used to support
   CPU binding.
2003-11-02 10:56:48 +00:00
kan
618baf4714 Take care not to call vput if thread used in corresponding vget
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.

The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.

This fixes the panic on shutdown people have reported.
2003-11-02 04:52:53 +00:00
jeff
e8941020fa - In sched_prio() only force us onto the current queue if our priority is
being elevated (numerically smaller).
2003-11-02 04:25:59 +00:00
jeff
5c293e491d - Rename SCHED_PRI_NTHRESH to SCHED_SLICE_NTHRESH since it is only used in
slice assignment.  Add a comment describing what it does.
 - Remove a stale XXX comment, the nice should not impact the interactivity,
   nice adjustments only effect non-interactive tasks in ULE.
 - Don't allow nice -20 tasks to totally starve nice 0 tasks.  Give them at
   least SCHED_SLICE_MIN ticks.  We still allow nice 0 tasks to starve nice
   +20 tasks as intended.
2003-11-02 04:10:15 +00:00
jeff
4455a41346 - Remove uses of PRIO_TOTAL and replace them with SCHED_PRI_NRESV
- SCHED_PRI_NRESV does not have the off by one error in PRIO_TOTAL so we
   do not have to account for it in the few places that we use it.

Requested by:	bde
2003-11-02 03:49:32 +00:00
jeff
5fcee8c60b - Change sched_interact_update() to only accept slp+runtime values between
0 and SCHED_SLP_RUN_MAX * 2.  This allows us to simplify the algorithm
   quite a bit.  Before, it dealt with arbitrary values which required us
   to do nasty integer division tricks that didn't quite work out correctly.
 - Chnage sched_wakeup() to detect conditions where the slp+runtime could
   exceed SCHED_SLP_RUN_MAX * 2.  This can happen if we go to sleep for
   longer than 6 seconds.  In this case, we'll just clear the runtime and
   set the sleep time to the max.
 - Define a new function, sched_interact_fork() which updates the slp+runtime
   of a newly forked thread.  We want to limit the amount of history retained
   from the parent so that we learn the child's behavior quickly.  We don't,
   however want to decay it to nothing.  Previously, we would simply divide
   each parameter by 100 whenever we forked.  After a few forks the values
   would reach 0 and tasks would not be considered interactive.
 - Add another KTR entry, cleanup some existing entries.
 - Remove a useless sched_interact_update() from sched_priority().  This is
   already done by the callers that require it.
2003-11-02 03:36:33 +00:00
kan
bc70c0727c Temporarily undo parts of the stuct mount locking commit by jeff.
It is unsafe to hold a mutex across vput/vrele calls.

This will be redone when a better locking strategy is agreed upon.

Discussed with: jeff
2003-11-01 05:51:54 +00:00
jeff
5cf7d9c84d - Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses.  This mechanism is
   broken down into several components.  This is intended to reduce cache
   thrashing by eliminating most cases where one cpu touches another's
   run queues.
 - kseq_notify() appends a kse to a lockless singly linked list and
   conditionally sends an IPI to the target processor.  Right now this is
   protected by sched_lock but at some point I'd like to get rid of the
   global lock.  This is why I used something more complicated than a
   standard queue.
 - kseq_assign() processes our list of kses that have been assigned to us
   by other processors.  This simply calls sched_add() for each item on the
   list after clearing the new KEF_ASSIGNED flag.  This flag is used to
   indicate that we have been appeneded to the assigned queue but not
   added to the run queue yet.
 - In sched_add(), instead of adding a KSE to another processor's queue we
   use kse_notify() so that we don't touch their queue.  Also in sched_add(),
   if KEF_ASSIGNED is already set return immediately.  This can happen if
   a thread is removed and readded so that the priority is recorded properly.
 - In sched_rem() return immediately if KEF_ASSIGNED is set.  All callers
   immediately readd simply to adjust priorites etc.
 - In sched_choose(), if we're running an IDLE task or the per cpu idle thread
   set our cpumask bit in 'kseq_idle' so that other processors may know that
   we are idle.  Before this, make a single pass through the run queues of
   other processors so that we may find work more immediately if it is
   available.
 - In sched_runnable(), don't scan each processor's run queue, they will IPI
   us if they have work for us to do.
 - In sched_add(), if we're adding a thread that can be migrated and we have
   plenty of work to do, try to migrate the thread to an idle kseq.
 - Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
   consideration.
 - No longer use kseq_choose() to steal threads, it can lose it's last
   argument.
 - Create a new function runq_steal() which operates like runq_choose() but
   skips threads based on some criteria.  Currently it will not steal
   PRI_ITHD threads.  In the future this will be used for CPU binding.
 - Create a kseq_steal() that checks each run queue with runq_steal(), use
   kseq_steal() in the places where we used kseq_choose() to steal with
   before.
2003-10-31 11:16:04 +00:00
jhb
0a3e524af3 Ensure that mp_ncpus is set to 1 if mp_cpu_probe() fails. 2003-10-30 21:44:01 +00:00
kan
ddb3e24c6c Relock mntvnode_mtx if vget fails in vfs_stdsync. The loop is
always shoould entered with mutex locked.
2003-10-30 16:22:51 +00:00
davidxu
91407731e5 Try to fetch thread mailbox address in page fault trap, so when thread
blocks in page fault hanlder, and upcall thread can be scheduled. It is
useful if process is doing lots of mmap based I/O.
2003-10-30 02:55:43 +00:00
sam
1beb7cc4f2 Add a temporary mechanism to disble INTR_MPSAFE from network interface
drivers.  This is prepatory to running more parts of the network system
w/o Giant.
2003-10-29 18:29:50 +00:00
bde
f03fdf4e6f Removed mostly-dead code for setting switchtime after the idle loop
clobbers this variable.  Long ago, when the idle loop wasn't in a
process, it set switchtime.tv_sec to zero to indicate that the time
needs to be read after the idle loop finishes.  The special case for
this isn't needed now that there is an idle process (for each CPU).
The time is read in the normal way when the idle process is switched
away from.  The seconds component of the time is only zero for the
first second after the uptime is set, and the mostly-dead code was only
executed during this time.  (This was slightly broken by using uptimes
instead of times relative to the Epoch -- in the original version the
seconds component of the time was only 0 for the first second after
the Epoch.)

In mi_switch(), moved the setting of switchticks to just after the
first (and now only) setting of switchtime.  This setting used to be
delayed since a late setting was needed for the idle case and an early
setting was not needed.  Now the early setting is needed so that
fork_exit() doesn't need to set either switchtime or switchticks.
Removed now-completely-rotted comment attached to this.  Most of the
code described by the comment had already moved to sched_switch().
2003-10-29 15:23:09 +00:00
bde
6bce6afbe7 Removed sched_nest variable in sched_switch(). Context switches always
begin with sched_lock held but not recursed, so this variable was
always 0.

Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch().  Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up.  Only
sched_lock.mtx_lock really needs a fixup.

Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up.  This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.
2003-10-29 14:40:41 +00:00
sam
409cf5f514 Introduce the notion of "persistent mbuf tags"; these are tags that stay
with an mbuf until it is reclaimed.  This is in contrast to tags that
vanish when an mbuf chain passes through an interface.  Persistent tags
are used, for example, by MAC labels.

Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp.  This fixes problems
with "tag leakage".

Pointed out by:	Jonathan Stone
Reviewed by:	Robert Watson
2003-10-29 05:40:07 +00:00
sam
39ba2e1c90 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
jeff
7742522f99 - Only change the run queue in sched_prio() if the kse is non null. threads
can be in the TD_ON_RUNQ state and not have an associated kse.
 - Remove the PRI_IDLE special case from sched_clock(), it was not actually
   necessary.
2003-10-28 03:28:48 +00:00
jeff
d74e8d0250 - Don't set td_priority directly here, use sched_prio(). 2003-10-27 07:15:47 +00:00
jeff
729434c982 - Use a better algorithm in sched_pctcpu_update()
Contributed by:	Thomaswuerfl@gmx.de

 - In sched_prio(), adjust the run queue for threads which may need to move
   to the current queue due to priority propagation .
 - In sched_switch(), fix style bug introduced when the KSE support went in.
   Columns are 80 chars wide, not 90.
 - In sched_switch(), Fix the comparison in the idle case and explicitly
   re-initialize the runq in the not propagated case.
 - Remove dead code in sched_clock().
 - In sched_clock(), If we're an IDLE class td set NEEDRESCHED so that threads
   that have become runnable will get a chance to.
 - In sched_runnable(), if we're not the IDLETD, we should not consider
   curthread when examining the load.  This mimics the 4BSD behavior of
   returning 0 when the only runnable thread is running.
 - In sched_userret(), remove the code for setting NEEDRESCHED entirely.
   This is not necessary and is not implemented in 4BSD.
 - Use the correct comparison in sched_add() when checking to see if an idle
   prio task has had it's priority temporarily elevated.
2003-10-27 06:47:05 +00:00
alfred
4ba4eee48e constify the second args to timevaladd() and timevalsub(). 2003-10-26 02:19:00 +00:00
rwatson
b43632b153 Check (locked) before performing an advisory unlock following a failure
of vn_start_write().  Otherwise, we may inconsistently attempt to release
the advisory lock.

Pointed out by:	teggej
2003-10-25 16:43:50 +00:00
rwatson
e4935eb9ae When generate a core dump, use advisory locking in an advisory way:
if we do acquire an advisory lock, great!  We'll release it later.
However, if we fail to acquire a lock, we perform the coredump
anyway.  This problem became particularly visible with NFS after
the introduction of rpc.lockd: if the lock manager isn't running,
then locking calls will fail, aborting the core dump (resulting in
a zero-byte dump file).

Reported by:	Yogeshwar Shenoy <ynshenoy@alumni.cs.ucsb.edu>
2003-10-25 16:14:09 +00:00
rwatson
723804b261 Allow MAC policies to block/revoke kern_alq write access to a file.
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
Reviewed by:	jeff
2003-10-25 16:10:41 +00:00
imp
a73d1ed688 Convenience functions to generate notifications from the kernel. The ACPI
code will start using these shortly.

Reviewed by: njl
2003-10-24 22:41:54 +00:00
jmg
abe50d83e6 don't allow reading from files that haven't been open'd for reading. 2003-10-24 21:07:53 +00:00
jhb
7d3a70bca2 - Add a DDB command 'show intrcnt' to show the non-zero interrupt counts.
- Add a DDB function to dump the contents of an ithread and optionally
  details about each handler in that ithread.  This function can be used
  by MD code to implement DDB commands that display information about
  interrupt sources and their registered handlers.
2003-10-24 21:05:30 +00:00
jhb
aee5e4f914 Writes to p_flag in __setugid() no longer need Giant. 2003-10-23 21:20:34 +00:00
jhb
84fcdc7f5a Move the P_COWINPROGRESS flag from being a per-process p_flag to being a
per-thread td_pflag which doesn't require any locks to read or write as it
is only read or written by curthread on itself.

Glanced at by:	mckusick
2003-10-23 21:14:08 +00:00
wollman
dabc1a332d Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by:	the evil and rude OpenAFS cache manager code
2003-10-23 18:17:36 +00:00
rwatson
966f1ed0b5 mac_Finish break-out of kern_mac.c into parts:
Include src/sys/security/mac/mac_internal.h in kern_mac.c.

  Remove redundant defines from the include: SYSCTL_DECL(), debug macros,
    composition macros.

  Unstaticize various bits now exposed to the remainder of the kernel:
    mac_init_label(), mac_destroy_label().

  Remove all the functions now implemented in mac_process/mac_vfs/mac_net/
    mac_pipe.  Also remove debug counters, sysctls exporting debug
    counters, enforcement flags, sysctls exporting enforcement flags.

  Leave module declaration, sysctl nodes, mactemp malloc type, system
    calls.

This should conclude MAC/LINT/NOTES breakage from the break-out process,
but I'm running builds now to make sure I caught everything.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-10-22 20:59:31 +00:00
rwatson
bae92ad3e6 Variable cleanup following break-out of kern_mac.c into sys/security/mac:
Unstaticize mac_late.
  Remove ea_warn_once, now in mac_vfs.c.
  Unstaticisize mac_policy_list, mac_static_policy_list, use
    struct mac_policy_list_head instead of LIST_HEAD() directly.
  Unstaticize and un-inline MAC policy locking functions so they can
    be referenced from mac_*.c.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-10-22 20:47:41 +00:00
rwatson
e36fe77ad7 Rename error_select() to mac_error_select(), and unstaticize so it
can be used from src/sys/security/mac/mac_*.c.

Obtained from:	TrustedBSD Project
Sponosred by:	DARPA, Network Associates Laboratories
2003-10-22 20:42:22 +00:00
silby
f0e686a675 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
simokawa
aeab60718e We need to initialize bp->b_offset and bp->b_iooffset
becuase bp->b_blkno is ignored now.
2003-10-21 13:18:19 +00:00
scottl
0564a95894 Don peril-sensitive sunglasses and mark pipe(2) as MPSAFE. I've beaten up
on it for the last 15 hours with no signs of problems.  It gives a small
(1%) gain on buildworld since pipe_read/pipe_write are already free of Giant.
2003-10-21 07:03:27 +00:00
phk
01dad440c7 Remove KASSERTS on B_PHYS for vmapbuf() and vunmapbuf(), B_PHYS is going
away.
2003-10-21 06:53:10 +00:00
marcel
7df6e35964 Remove md_bspstore from the MD fields of struct thread. Now that
the backing store is at a fixed address, there's no need for a
per-thread variable.
2003-10-21 01:13:49 +00:00
sam
b058e8665f revert default for idle polling to zero until we can resolve the
livelock problem
2003-10-20 21:14:24 +00:00
jeff
d477fdf956 - If a thread is not bound to a kse return 0 from sched_pctcpu().
Reported by:	 pawel.worach@nordea.com
2003-10-20 19:55:21 +00:00
alc
512489f301 Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This
facilitates synchronization of the vm page's valid field using the
vm object's lock.)

Suggested by:	tegge
2003-10-20 18:24:38 +00:00
dwmalone
72e9866f3d Mark dup as MPSAFE. Giant was pushed into dup ages ago, but it looks
like it was missed in syscalls.master.

Spotted by:	alc
2003-10-20 16:16:03 +00:00
alc
ba669f6199 - Synchronize access to a vm page's valid field using the containing
vm object's lock.
2003-10-20 05:57:55 +00:00
marcel
87282b0dcb Put the RSE backing store at a fixed address. This change is triggered
by libguile that needs to know the base of the RSE backing store. We
currently do not export the fixed address to userland by means of a
sysctl so user code needs to hardcode it for now. This will be revisited
later.

The RSE backing store is now at the bottom of region 4. The memory stack
is at the top of region 4. This means that the whole region is usable
for the stacks, giving a 61-bit stack space.

Port: lang/guile (depended of x11/gnome2)
2003-10-20 05:34:10 +00:00
dwmalone
be405a4cbd falloc allocates a file structure and adds it to the file descriptor
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.

As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.

To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.

Reviewed by:	iedowse
Idea run past:	jhb
2003-10-19 20:41:07 +00:00
alc
676085d671 - Add vm object locking to vfs_clean_pages() and vfs_bio_set_validclean().
This is to synchronize access to the vm page's valid field by
   vm_page_set_validclean().
2003-10-19 20:39:06 +00:00
peter
7b3a8f1308 Tidy up loose ends in the idle process. Call the MI cpu_idle() function
for all platforms now.

XXX alpha/sparc64/powerpc should fill in the function.

Submitted by:  bde
2003-10-19 02:43:57 +00:00
phk
50ecf36141 Initialize b_iooffset before calling VOP_[SPEC]STRATEGY 2003-10-18 19:49:46 +00:00
phk
4b7ade98cd Initialize b_iooffset before calling strategy 2003-10-18 19:48:21 +00:00
phk
80b79c02da Don't report b_pblkno, it is going away. 2003-10-18 17:59:02 +00:00
phk
adc5cdd07a Report bio_pblkbo instead of bio_blkno. 2003-10-18 17:27:10 +00:00
phk
b60969a35e Make bioq_disksort() sort on the bio_offset field instead of bio_pblkno. 2003-10-18 15:50:56 +00:00
phk
4c2cb3f397 DuH!
bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)
2003-10-18 14:10:28 +00:00
phk
d004fc1e31 I think rwatson got the sign wrong here... 2003-10-18 12:16:17 +00:00
phk
bd00da531d Initialize bp->b_offset before calling VOP_STRATEGY() 2003-10-18 11:13:31 +00:00
phk
6b528c7911 Convert some if(bla) panic("foo") to KASSERTS to improve grep-ability. 2003-10-18 09:32:39 +00:00
phk
3dfc831549 The size and contents of the DEV_STRATEGY() macro has progressed to
the point where it being a macro is no longer sensible, and it will
only be more so in days to come.

BIO_STRATEGY() is now only used from DEV_STRATEGY() and should not
be used directly anymore.

Put the contents of both in the new function dev_strategy() and
make DEV_STRATEGY() call that function.

In addition, this allows us to make the rather magic bufdonebio()
helper function static.

This alse saves hunderedandsome bytes of code in a typical kernel.
2003-10-18 09:03:15 +00:00
rwatson
9be7eda223 Wrap db_active check in #ifdef DDB, as db_active is not defined ifndef
DDB.
2003-10-18 02:23:57 +00:00
rwatson
bf32888a2a Add a new cn_flags fields to struct consdev, the low-level console
definition structure.  Define one flag, CN_FLAG_NODEBUG, which
indicates the console driver cannot be used in the context of the
debugger.  This may be used, for example, if the console device
interacts with kernel services that cannot be used from the
debugger context, such as the network stack.  These drivers are
skipped over for calls to cn_checkc() and cn_putc(), and the
calling function simply moves on to the next available console.
2003-10-18 02:13:39 +00:00
jeff
1869c3be82 - Remove the correct thread from the run queue in setrunqueue(). This
fixes ULE + KSE.
2003-10-17 20:53:04 +00:00
phk
888092f317 Simplify count_dev() 2003-10-17 11:56:48 +00:00
peter
c7a92905e0 Halt the cpu on amd64 as well. For some strange reason, this makes
a fair bit of difference to the power consumption and lets my cpu cool
down enough for the temperature sensitive fan controller to completely
stop the cpu fan at times.
2003-10-17 03:49:03 +00:00
marcel
1d3178bbc0 Implement cpu_idle() on ia64. We put the processor in a lightweight
halt state that minimizes power consumption while still preserving
cache and TLB coherency. Halting the processor is not conditional at
this time. Tested with UP and SMP kernels.
2003-10-17 02:24:59 +00:00
jeff
7a27809dea - The kse may be null in sched_pctcpu().
Reported by:	kris
2003-10-16 21:13:14 +00:00
jeff
fcda5c4f74 - Only kse_reassign() in the !running case.
Reported by:	kris
2003-10-16 20:32:57 +00:00
jeff
3b6db25186 - Call sched_add() with the correct argument on SMP.
Reported by:	Valentin Chopov <valentin@valcho.net>
2003-10-16 20:06:19 +00:00
jeff
6e062906a7 - Fix a minor problem with my last commit, we don't want to return from
sched_switch if the thread is running, we want to fall through and pick
   a new thread because we have been preempted.
2003-10-16 10:04:54 +00:00
dfr
3dac505582 * Add multiple inheritance to kobj. Each class can have zero or more base
classes and if a method is not found in a given class, its base classes
  are searched (in the order they were declared). This search is recursive,
  i.e. a method may be define in a base class of a base class.
* Change the kobj method lookup algorithm to one which is SMP-safe. This
  relies only on the constraint that an observer of a sequence of writes
  of pointer-sized values will see exactly one of those values, not a
  mixture of two or more values. This assumption holds for all processors
  which FreeBSD supports.
* Add locking to kobj class initialisation.
* Add a simpler form of 'inheritance' for devclasses. Each devclass can
  have a parent devclass. Searches for drivers continue up the chain of
  devclasses until either a matching driver is found or a devclass is
  reached which has no parent. This can allow, for instance, pci drivers
  to match cardbus devices (assuming that cardbus declares pci as its
  parent devclass).
* Increment __FreeBSD_version.

This preserves the driver API entirely except for one minor feature used
by the ISA compatibility shims. A workaround for ISA compatibility will
be committed separately. The kobj and newbus ABI has changed - all modules
must be recompiled.
2003-10-16 09:16:28 +00:00
jeff
4aea3a9433 - Collapse sched_switchin() and sched_switchout() into sched_switch(). Now
mi_switch() calls sched_switch() which calls cpu_switch().  This is
   actually one less function call than it had been.
2003-10-16 08:53:46 +00:00
jeff
991febf6dd - Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td
argument rather than a kse.
2003-10-16 08:39:15 +00:00
jeff
bf29a9dd12 - The non iterative algorithm for interact_update was broken due to
rounding errors.  This was the source of the majority of the
   interactivity problems.  Reintroduce the old algorithm and its XXX.
 - Up the interactivity threshold to 30.  It really could stand to be even
   a tiny bit higher.
 - Let the sleep and run time accumulate up to 5 seconds of history rather
   than two.  This helps stop XFree86 from becoming non-interactive during
   bursts of activity.
2003-10-16 08:17:43 +00:00
jeff
72dab909cc - If our user_pri doesn't match our actual priority our priority has been
elevated either due to priority propagation or because we're in the
   kernel in either case, put us on the current queue so that we dont
   stop others from using important resources.  At some point the priority
   elevations from sleeping in the kernel should go away.
 - Remove an optimization in sched_userret().  Before we would only set
   NEEDRESCHED if there was something of a higher priority available.  This
   is a trivial optimization and it breaks priority propagation because it
   doesn't take threads which we may be blocking into account.  Notice that
   the thread which is blocking others gets up to one tick of cpu time before
   we honor this NEEDRESCHED in sched_clock().
2003-10-15 07:47:06 +00:00
peter
8cbefc5894 The KERN_PROC_PROC sysctl took 4 args in 5.0-REL and 5.1-REL. We need to
accept this for a bit longer.  Requiring the new order of 3 args only
was not very helpful.
2003-10-15 03:11:46 +00:00
sam
4e958f714d Change default for kern.polling.idle_poll back to 1. This was set to 0
because Luigi observed livelock but in recent testing it did not occur
so I'm re-enabling it by default.

Reviewed by:	luigi
2003-10-14 18:39:36 +00:00
phk
637477251d Made use of 'error' argument, which was unused (by mistake) before.
Submitted by:    Pawel Jakub Dawidek <nick@garage.freebsd.pl>
2003-10-14 08:09:43 +00:00
imp
e34b12b9cd With DIAGNOSTICS, sometimes we get weird crashes when some driver
accesses softc after it is freed.  Use a different malloc type for
softc than the rest of the bus code to make it more clear when these
things happen that it is the driver that's at fault, not the bus code.

Suggested by: sam and/or phk (I think)
2003-10-14 06:22:07 +00:00
jeff
ce562d35ad - Add a mising vn_finished_write()
Pointy hat:     jeff
Found by:       robert
Obtained from:  kirk
2003-10-14 00:38:34 +00:00
davidxu
b24bb74b9e Don't clear signal mask in execsig(). RELENG_4 does not clear it and POSIX
asks to inherit signal mask for execv.
2003-10-13 14:03:08 +00:00
jeff
bd83534d11 - In SCHED_CURR() add holding Giant to the list of criteria that will keep
you on the current queue.  In the future, it would be nice if priority
   propagation could deterministicly pluck a thread off of the next queue
   and put it on the current queue.  Until then this hack stops us from
   holding up our entire current queue, including interrupt handlers, while
   a thread on the next queue is blocked while holding Giant.
 - Inherit our pctcpu information from our parent.
2003-10-12 21:07:31 +00:00
alc
bfc309eda8 In vfs_bio_clrbuf(), ignore the state of the object lock if the page is the
"bogus" page.

Found by:	tegge
2003-10-12 18:26:48 +00:00
phk
edaf9214c4 Simplify vn_isdisk() a bit. 2003-10-12 14:04:39 +00:00
jmg
f1d456150e fix a problem referencing free'd memory. This is only a problem for
kqueue write events on a socket and you regularly create tons of pipes
which overwrites the structure causing a panic when removing the knote
from the list.  If the peer has gone away (and it's a write knote), then
don't bother trying to remove the knote from the list.

Submitted by:	Brian Buchanan and myself
Obtained from:	nCircle
2003-10-12 07:06:02 +00:00
jeff
5b9cc4b22e - Fix a typo, I meant & and not |. This was causing lockups from the syncer
looping forever due to list corruption.

Solved by:	tegge
2003-10-11 21:50:45 +00:00
alc
7f1718017c - Synchronize access to a page's valid field in vfs_bio_clrbuf()
by using the lock from its containing object.
 - Remove GIANT_REQUIRED from vm_hold_load_pages().
2003-10-10 07:26:21 +00:00
robert
8519aa2ff0 Implement preliminary support for the PT_SYSCALL command to ptrace(2). 2003-10-09 10:17:16 +00:00
tjr
1ac708bd28 Remove support for the unused 4th component of the KERN_PROC_PROC sysctl. 2003-10-06 01:26:11 +00:00
jeff
6be3f23255 - Add a missing vn_start_write() to flushbufqueues(). This could have
caused snapshot related problems.
 - The vp can not be NULL here or we would panic in vfs_bio_awrite().  Stop
   confusing the logic by checking for it in several places.

Submitted by:	kirk and then rototilled by me to remove vp == NULL checks.
2003-10-05 22:16:08 +00:00
bms
3960d861d4 Bring back sysctl_wire_old_buffer(). Fix a bug in sysctl_handle_opaque()
whereby the pointers would not get reset on a retried SYSCTL_OUT() call.

Noticed by:	bde
2003-10-05 13:31:33 +00:00
bms
44fa0fe4a2 Fix a security problem in sysctl() the long way round.
Use pre-emption detection to avoid the need for wiring a userland buffer
when copying opaque data structures.

sysctl_wire_old_buffer() is now a no-op. Other consumers of this
API should use pre-emption detection to notice update collisions.

vslock() and vsunlock() should no longer be called by any code
and should be retired in subsequent commits.

Discussed with:	pete, phk
MFC after:	1 week
2003-10-05 09:37:47 +00:00
bms
0f917818e6 Add a pre-emption counter, td_generation, so that threads can notice
when they have been pre-empted by other threads. This is bumped from
within mi_switch() every time a context switch takes place.

Discussed with:	pete
2003-10-05 09:35:08 +00:00
bms
b463ed3eb2 Fold the vslock() and vsunlock() calls in this file with #if 0's; they will
go away in due course. Involuntary pre-emption means that we can't count
on wiring of pages alone for consistency when performing a SYSCTL_OUT()
bigger than PAGE_SIZE.

Discussed with:	pete, phk
2003-10-05 08:38:22 +00:00
jeff
d22d3e1bfb - Apply a big giant lock around the namecache. This has been sitting in
my tree since BSDcon.
2003-10-05 07:13:50 +00:00
jeff
2c3fea92c8 - Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify
LK_RETRY either, we don't want this vnode if it turns into another.
 - Remove the code that checks the mount point after acquiring the lock
   we are guaranteed to either fail or get the vnode that we wanted.
2003-10-05 07:12:38 +00:00
bms
c600018661 Remove magic numbers surrounding locking state in the sysctl module, and
replace them with more meaningful defines.
2003-10-05 05:38:30 +00:00
jeff
6b8324e841 - Rename vcanrecycle() to vtryrecycle() to reflect its new role.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
   passed.  We won't vgonel if someone has either acquired a hold or usecount
   or started the vgone process elsewhere.  This is because we may have been
   removed from the free list while we were inspecting the vnode for
   recycling.
 - The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
   the same vnode.  To further reduce the likelyhood of this event, requeue
   the vnode on the tail of the list prior to calling vtryrecycle().  We can
   not actually remove the vnode from the list until we know that it's
   going to be recycled because other interlock holders may see the VI_FREE
   flag and try to remove it from the free list.
 - Kill a bogus XXX comment.  If XLOCK is set we shouldn't wait for it
   regardless of MNT_WAIT because the vnode does not actually belong to
   this filesystem.
2003-10-05 05:35:41 +00:00
jeff
e15704d590 - Don't cache_purge() in getnewvnode. It's done in vclean(). With this
purge, the purge in vclean, and the filesystems purge, we had 3 purges
   per vnode.
 - Move the insmntque(vp, 0) to vclean() so that we may remove it from the
   two vgone() functions and reduce the number of lock operations required.
2003-10-05 02:48:04 +00:00
jeff
8d0f78003f - Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine
whether or not the sync failed.  This could potentially get set between
   the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
   lead to the sync being delayed by an extra 30 seconds.  If we do not move
   the vnode it could cause an endless loop if it continues to fail to sync.
 - Use vhold and vdrop to stop the vnode from changing identities while we
   have it unlocked.  Other internal vfs lists are likely to follow this
   scheme.
2003-10-05 00:35:41 +00:00
jeff
8d72c43763 - Move the xlock 'locking' code into vx_lock() and vx_unlock().
- Create a new function, vgonechrl(), which performs vgone for an in-use
   character device.  Move the code from vflush() that did this into
   vgonechrl().
 - Hold the xlock across the entirety of vgonel() and vgonechrl() so that
   at no point will an invalid vnode exist on any list without XLOCK set.
 - Move the xlock code out of vclean() now that it is in the vgone*()
   functions.
2003-10-05 00:02:41 +00:00
alc
74f19ddd0e Eliminate some unnecessary uses of the vm page queues lock around the
vm page's valid field.  This field is being synchronized using the
containing vm object's lock.
2003-10-04 22:47:20 +00:00
alc
7c11eaebac - Extend the scope the vm object lock to cover calls to
vm_page_is_valid().
 - Assert that the lock on the containing vm object is held in
   vm_page_is_valid().
2003-10-04 19:23:29 +00:00
jeff
55547647ec - In sched_sync() test our preconditions prior to dropping the sync_mtx.
This is so that we may grab the interlock while still holding the
   sync_mtx.  We have to VI_TRYLOCK() because in all other cases the lock
   order runs the other way.
 - If we don't meet any of the preconditions, reinsert the vp into the
   list for the next second.
 - We don't need to panic if we fail to sync here because each FSYNC
   function handles this case.  Removing this redundant code also
   simplifies locking.
2003-10-04 18:03:53 +00:00
jeff
1fc4867692 - Change a lame iterative algorithm to a constant time algorithm. Remove
the XXX that complains about it as well.

Submitted by:	ThomasWuerfl@gmx.de
2003-10-04 17:41:13 +00:00
jeff
bb40013912 - In a Giantless world, the vn_lock() in vcanrecycle() could legitimately
fail.  Remove the panic from that case and document why it might fail.
 - Document the reason for calling cache_purge() on a newly created vnode.
 - In insmntque() order the operations so that we can call mtx_unlock()
   one fewer times.  This makes the code somewhat clearer as well.
 - Add XXX comments in sched_sync() and vflush().
 - In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
   set.
 - In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
   call.  It's ok if we race here, the vinvalbuf will just do nothing.
 - Increase the scope of the lock in vgonel() to reduce the number of lock
   operations that are performed.
2003-10-04 15:10:40 +00:00
jeff
c1590e8666 - If we are called with LK_NOWAIT in vn_lock() we may be holding a mutex
and should not sleep while waiting for XLOCK to clear.  Care needs to be
   taken in functions that use this capability to avoid spinning.
2003-10-04 14:35:22 +00:00
nectar
1857c0891b Introduce a uiomove_frombuf helper routine that handles computing and
validating the offset within a given memory buffer before handing the
real work off to uiomove(9).

Use uiomove_frombuf in procfs to correct several issues with
integer arithmetic that could result in underflows/overflows.  As a
side-effect, the code is significantly simplified.

Add additional sanity checks when computing a memory allocation size
in pfs_read.

Submitted by:	rwatson  (original uiomove_frombuf -- bugs are mine :-)
Reported by:	Joost Pol <joost@pine.nl>  (integer underflows/overflows)
2003-10-02 15:00:55 +00:00
rwatson
1c522512fd Remove the global variable 'cmask', which was used to initialize the
fd_cmask field in the file descriptor structure for the first process
indirectly from CMASK, and when an fd structure is initialized before
being filled in, and instead just use CMASK.  This appears to be an
artifact left over from the initial integration of quotas into BSD.

Suggested by:	peter
2003-10-02 03:57:59 +00:00
jeff
db2419a0a1 - On my Pentium4-M laptop, invalpg takes ~1100 cycles if the page is found in
the TLB and ~1600 if it is not.  Therefore, it is more effecient to
   invalidate the TLB after operations that use CMAP rather than before.
 - So that the tlb is invalidated prior to switching off of a processor, we
   must change the switchin functions to switchout functions.
 - Remove td_switchout from the thread and move it to the x86 pcb.
 - Move the code that calls switchout into swtch.s.  These changes make this
   optimization truely x86 specific.
2003-09-30 08:11:36 +00:00
rwatson
0e5948bb6b If the struct mac copied into the kernel has a negative length, return
EINVAL rather than failing the following malloc due to the value being
too large.
2003-09-29 18:35:17 +00:00
phk
7e52b3ce80 Retire revoke_and_destroy_dev() with extreme prejudice. 2003-09-28 20:50:36 +00:00
marcel
186d3de8bf Remove the regstkpages sysctl variable. We have a growable register
stack now.
2003-09-27 23:07:47 +00:00
marcel
d75cf98307 Part 2 of implementing rstacks: add the ability to create rstacks and
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.

Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.

Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.

Tested on: alpha, i386, ia64, sparc64
2003-09-27 22:28:14 +00:00
phk
d0dceae232 Make life a little bit easier for cloning device drivers. 2003-09-27 21:50:00 +00:00
phk
74c6dfd454 Introduce no_poll() default method for device drivers. Have it
do exactly the same as vop_nopoll() for consistency and put a
comment in the two pointing at each other.

Retire seltrue() in favour of no_poll().

Create private default functions in kern_conf.c instead of public
ones.

Change default strategy to return the bio with ENODEV instead of
doing nothing which would lead the bio stranded.

Retire public nullopen() and nullclose() as well as the entire band
of public no{read,write,ioctl,mmap,kqfilter,strategy,poll,dump}
funtions, they are the default actions now.

Move the final two trivial functions from subr_xxx.c to kern_conf.c
and retire the now empty subr_xxx.c
2003-09-27 12:53:33 +00:00
phk
d92d95ded7 Don't use seltrue when that is not really what we mean. 2003-09-27 12:44:06 +00:00
phk
7099deadda The present defaults for the open and close for device drivers which
provide no methods does not make any sense, and is not used by any
driver.

It is a pretty hard to come up with even a theoretical concept of
a device driver which would always fail open and close with ENODEV.

Change the defaults to be nullopen() and nullclose() which simply
does nothing.

Remove explicit initializations to these from the drivers which
already used them.
2003-09-27 12:01:01 +00:00
phk
0c8bfb6d00 OK, I messed up /dev/console with what I had hoped would be compat
code.  Convert remaining console drivers and hope for the best.
2003-09-26 19:35:50 +00:00
robert
58f93096d9 Move some tracing related code into its own function as it will
be needed for system call related ptrace functionality I plan
to commit soon.
2003-09-26 15:09:46 +00:00
phk
45c1eba059 Update the list of CDROM device names to try for booting with RB_CDROM
flag set.
2003-09-26 09:07:27 +00:00
phk
af0dfcd07f Remove wrongly sized cnd_name field, we now store the name in the
consdev structure.

If the consdev name is not set and we have a cn_dev, set the name
from there.  Try to issue a printf about this, even though it may
not have a place to go.

Modify the sysctl related code to pick up the name from the consdev
instead.
2003-09-26 07:26:54 +00:00
peter
8ecb3577d8 Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function.  Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable.  This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'.  And that causes exec etc to fail since it can no
longer find space to mmap things.
2003-09-25 01:10:26 +00:00
fjoe
9957f857c4 Avoid NULL pointer dereferencing in modlist_lookup2().
PR:		56570
Submitted by:	Thomas Wintergerst <Thomas.Wintergerst@nord-com.net>
2003-09-23 14:42:38 +00:00
alc
9c61d65266 - vm_hold_free_pages() should lock the kernel object. (The pages being
freed belong to the kernel object.)
 - Increase the granularity of the vm object locking in vm_hold_load_pages()
   in order to reduce the number of times that we acquire and release the
   same lock.
2003-09-22 04:58:09 +00:00
dfr
8cfa3234b8 The method link_preload_finish is not static. 2003-09-20 17:39:32 +00:00
jeff
83b269493e - Somewhere along the line I stupidly removed critical logic from
sched_ptcpu_update().  This caused erroneous cpu times in TOP for
   processes that were asleep.  Replace the code that was removed.
2003-09-20 02:05:58 +00:00
jeff
517dcea6c8 - In reassignbuf() don't unlock vp and lock newvp if they are the same.
Doing so creates a race where the buf is on neither list.
 - Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
   should.
 - Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
   really should not happen and is fast to check.
2003-09-20 00:21:48 +00:00
jeff
45f3b1b270 - Remove spls(). The locking that has replaced them is in place and they
no longer serve as guidelines for future work.
2003-09-19 23:52:06 +00:00
kan
cf77f9f005 Eliminate one case of VI_UNLOCK followed by an immediate
VI_LOCK.
2003-09-19 19:13:54 +00:00
tjr
3bce48c27b Allow the KERN_PROC_PROC sysctl to be used without the useless 4th
name component, for consistency with KERN_PROC_ALL. Support for the
4-argument form will be removed some time before 5.2-R.
2003-09-19 14:16:50 +00:00
jeff
52b3368d79 - Only use UMA to cache malloc requests up to PAGE_SIZE. Values larger than
this are requested very infrequently and waste memory when we cache
   spares.
2003-09-19 04:39:08 +00:00
alc
beb3ca1e4c Correct a typo in the previous revision. 2003-09-15 02:56:48 +00:00
rwatson
50888524ca Add a new sysctl, security.bsd.conservative_signals, to disable
special signal-delivery protections for setugid processes.  In the
event that a system is relying on "unusual" signal delivery to
processes that change their credentials, this can be used to work
around application problems.

Also, add SIGALRM to the set of signals permitted to be delivered to
setugid processes by unprivileged subjects.

Reported by:	Joe Greco <jgreco@ns.sol.net>
2003-09-14 07:22:38 +00:00
nectar
f158e368c2 sched_setscheduler: Return EINVAL when a invalid policy is specified,
thus complying with POLA and the man page.  (Previously, no error was
returned for this case.)
2003-09-13 18:46:24 +00:00
nectar
54f60400ec Correct mostly harmless off-by-one error in getdomainname().
Reviewed by:	imp
2003-09-13 17:12:22 +00:00
alc
ee4ef644cf Convert vmapbuf() from using pmap_extract() to using
pmap_extract_and_hold().  Note, however, that GIANT_REQUIRED should not be
removed until all platforms fully implement the "prot" parameter to
pmap_extract_and_hold().

Reviewed by:	tegge
2003-09-13 04:29:55 +00:00
alc
6808836f35 pipe_build_write_buffer() only requires read access of the page that it
obtains from pmap_extract_and_hold().
2003-09-12 07:13:15 +00:00
marcel
5b71626790 Introduce BUS_CONFIG_INTR(). The method allows devices to tell parents
about interrupt trigger mode and interrupt polarity. This allows ACPI
for example to pass interrupt resource information up the hierarchy.
The default implementation of the method therefore is to pass the
request to the parent.

Reviewed by: jhb, njl
2003-09-10 21:37:10 +00:00
simokawa
c286d7e22f Fix asynchronous physio breakage introduced in rev 1.163.
We cannnot use bp->b_caller2 because DEV_STRATEGY will overwrite it.
2003-09-10 15:48:51 +00:00
jhb
68ae42e041 Update the license on this file to be a bit more sane. 2003-09-10 01:09:32 +00:00
iedowse
2e1d99cc8a In the !MNT_BYFSID case, return EINVAL from unmount(2) when the
specified directory is not found in the mount list. Before the
MNT_BYFSID changes, unmount(2) used to return ENOENT for a nonexistent
path and EINVAL for a non-mountpoint, but we can no longer distinguish
between these cases. Of the two error codes, EINVAL was more likely
to occur in practice, and it was the only one of the two that was
documented.

Update the manual page to match the current behaviour.

Suggested by:	tjr
Reviewed by:	tjr
2003-09-08 16:23:21 +00:00
alc
81a5dc108d Use pmap_extract_and_hold() in pipe_build_write_buffer(). Consequently,
pipe_build_write_buffer() no longer requires Giant on entry.

Reviewed by:	tegge
2003-09-08 04:58:32 +00:00
tjr
29332e48cc Return EINVAL if the contested bit is not set on the umtx passed to
_umtx_unlock() instead of firing a KASSERT.
2003-09-07 11:14:52 +00:00
alc
390b07844e msync(2) should be declared MP-safe. 2003-09-07 05:42:07 +00:00
sam
23e7708b76 add fast swi taskqueue spinlock to the order_list so witness doesn't complain
Submitted by:	Tor Egge <Tor.Egge@cvsup.no.freebsd.org>
2003-09-06 21:06:08 +00:00
sam
403b1d4a6e correct fast swi taskqueue spinlock name to be different from the sleep lock
Submitted by:	Tor Egge <Tor.Egge@cvsup.no.freebsd.org>
2003-09-06 21:05:18 +00:00
alc
0c9dc7eaa4 Giant is no longer required by pipe_destroy_write_buffer(). Reduce
unnecessary white space from pipe_destroy_write_buffer().
2003-09-06 21:02:10 +00:00
sam
546fd338df "fast swi" taskqueue support. This is a taskqueue that uses spinlocks
making it useful for dispatching swi tasks from fast interrupt handlers.

Sponsered by:	FreeBSD Foundation
2003-09-05 23:09:22 +00:00
sam
27b68e0947 Print a message at boot for interrupt handlers created with INTR_MPSAFE
and/or INTR_FAST.  This belongs elsehwere and perhaps under bootverbose;
I'm committing it for now as it's uesful to know which drivers have
been converted and which have not.
2003-09-05 22:51:18 +00:00
peter
f79f1784c9 Log involuntary context switches correctly. 2003-09-05 22:15:26 +00:00
phk
d999957bf0 Put the message about msgbuf cksum mismatch under bootverbose and tell
people what the consequence is.
2003-09-05 11:12:00 +00:00
phk
158d08d6fb Use the quality to disable timecounters for which we deem Hz too low. 2003-09-03 08:14:16 +00:00
ken
03d0445c16 Move dynamic sysctl(8) variable creation for the cd(4) and da(4) drivers
out of cdregister() and daregister(), which are run from interrupt context.

The sysctl code does blocking mallocs (M_WAITOK), which causes problems
if malloc(9) actually needs to sleep.

The eventual fix for this issue will involve moving the CAM probe process
inside a kernel thread.  For now, though, I have fixed the issue by moving
dynamic sysctl variable creation for these two drivers to a task queue
running in a kernel thread.

The existing task queues (taskqueue_swi and taskqueue_swi_giant) run in
software interrupt handlers, which wouldn't fix the problem at hand.  So I
have created a new task queue, taskqueue_thread, that runs inside a kernel
thread.  (It also runs outside of Giant -- clients must explicitly acquire
and release Giant in their taskqueue functions.)

scsi_cd.c:	Remove sysctl variable creation code from cdregister(), and
		move it to a new function, cdsysctlinit().  Queue
		cdsysctlinit() to the taskqueue_thread taskqueue once we
		have fully registered the cd(4) driver instance.

scsi_da.c:	Remove sysctl variable creation code from daregister(), and
		move it to move it to a new function, dasysctlinit().
		Queue dasysctlinit() to the taskqueue_thread taskqueue once
		we have fully registered the da(4) instance.

taskqueue.h:	Declare the new taskqueue_thread taskqueue, update some
		comments.

subr_taskqueue.c:
		Create the new kernel thread taskqueue.  This taskqueue
		runs outside of Giant, so any functions queued to it would
		need to explicitly acquire/release Giant if they need it.

cd.4:		Update the cd(4) man page to talk about the minimum command
		size sysctl/loader tunable.  Also note that the changer
		variables are available as loader tunables as well.

da.4:		Update the da(4) man page to cover the retry_count,
		default_timeout and minimum_cmd_size sysctl variables/loader
		tunables.  Remove references to /dev/r???, they aren't used
		any longer.

cd.9:		Update the cd(9) man page to describe the CD_Q_10_BYTE_ONLY
		quirk.

taskqueue.9:	Update the taskqueue(9) man page to describe the new thread
		task queue, and the taskqueue_swi_giant queue.

MFC after:	3 days
2003-09-03 04:46:28 +00:00
sam
8c368dfa99 move domain list mutex initialization to earlier in the boot sequence so
statically configured modules like netgraph can call net_init_domain

Noticed by:	D.Rock@t-online.de (D. Rock)
2003-09-02 20:59:23 +00:00
silby
75c663cdc7 Implement MBUF_STRESS_TEST mark II.
Changes from the original implementation:

- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed.  Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.

- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments.  While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.

- Add two modes of random fragmentation.  Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.
2003-09-01 05:55:37 +00:00
sam
f13a722652 o interlock domain list when adding domains
o remove irrlevant spl

Notes:

1. We don't lock domain list traversals as this is safe until we start
   removing domains.
2. The calculation of max_datalen in net_init_domain appears safe as
   noone depends on max_hdr and max_datalen having consistent values.
3. Giant is still held for fast and slow timeouts; this must stay until
   each timeout routine is properly locked (coming soon).

Sponsored by:	FreeBSD Fondation
2003-09-01 05:01:55 +00:00
jeff
86f70ead21 - Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to
bail out if the buffer is not already present.
 - The buffer returned by incore() is not locked and should not be sent to
   brelse().  Use getblk() with the new GB_NOCREAT flag to preserve the
   desired semantics.
2003-08-31 08:50:11 +00:00
jeff
5e7832253c - If there is no vp assume that BKGRDINPROG is not set and set RELPBUF in
brelse().
2003-08-31 01:07:45 +00:00
jeff
0008f2bb1d - In some cases bp->b_vp can be NULL in brelse, don't try to lock the
interlock in that case.

Found by:	alc
2003-08-31 00:06:07 +00:00
alc
8b0114def1 Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy
sockets into machine-dependent files.  The rationale for this
migration is illustrated by the modified amd64 allocator.  It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space.  On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page.  Yuck.

Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.
2003-08-29 20:04:10 +00:00
marcel
b121bea1f9 In bufdone(), change the format specifier for m->valid and m->dirty to
a long type and explicitly cast m->valid and m->dirty to unsigned long.
When PAGE_SIZE is 32K, these fields are in fact unsigned long.
2003-08-28 19:58:11 +00:00
kan
96bae694ed Do not return with vnode interlock held.
Reviewed by:	rwatson
2003-08-28 15:48:15 +00:00
jeff
fc1a2c4016 - Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
   interlock.
 - Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
   buffers.  Check for the BKGRDINPROG flag before recycling or throwing
   away a buffer.  We do this instead because it is not safe for us to move
   the original buffer to a new queue from the callback on the background
   write buffer.
 - Remove the B_LOCKED flag and the locked buffer queue.  They are no longer
   used.
 - The vnode interlock is used around checks for BKGRDINPROG where it may
   not be strictly necessary.  If we hold the buf lock the a back-ground
   write will not be started without our knowledge, one may only be
   completed while we're not looking.  Rather than remove the code, Document
   two of the places where this extra locking is done.  A pass should be
   done to verify and minimize the locking later.
2003-08-28 06:55:18 +00:00
rwatson
c020f70195 Fix a mac_policy_list reference to be a mac_static_policy_list
reference: this fixes mac_syscall() for static policies when using
optimized locking.

Obtained from:	TrustedBSD Project
Sponosred by:	DARPA, Network Associates Laboratories
2003-08-26 17:29:02 +00:00
davidxu
05b6d7c95a Let SA process work under ULE scheduler, originally it would panic kernel.
Reviewed by: jeff
2003-08-26 11:33:15 +00:00
alc
5d6f66de90 Hold the page queues lock when performing vm_page_clear_dirty() and
vm_page_set_invalid().
2003-08-23 18:11:53 +00:00
tjr
8958714bb8 Fix a logic error in osethostid() that was introduced in rev. 1.34:
allow hostid to be set when suser() returns 0, not when it returns
an error. This would have allowed non-root users to set the host ID.
2003-08-23 15:45:57 +00:00
marcel
0663329f6f On ia64 time_t is 64 bit. Explicitly cast tv_sec to long and change
the corresponding format specifier to %ld in a call to printf() in
function softclock(). The printf() is conditional upon DIAGNOSTIC.

Found by: LINT
2003-08-23 08:31:32 +00:00
rwatson
32ed1a62a8 Introduce two new MAC Framework and MAC policy entry points:
mac_reflect_mbuf_icmp()
  mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket.  For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-08-21 18:21:22 +00:00
eivind
ab2c97a462 Change description of kern.osreldate from "Operating system release date" to
"Kernel release date" - userland version is in /usr/include/osreldate.h
2003-08-21 14:47:08 +00:00
rwatson
6f522a9e52 Add mac_check_vnode_deleteextattr() and mac_check_vnode_listextattr():
explicit access control checks to delete and list extended attributes
on a vnode, rather than implicitly combining with the setextattr and
getextattr checks.  This reflects EA API changes in the kernel made
recently, including the move to explicit VOP's for both of these
operations.

Obtained from:	TrustedBSD PRoject
Sponsored by:	DARPA, Network Associates Laboratories
2003-08-21 13:53:01 +00:00
rwatson
85df7c20ad Remove about 40 lines of #ifdef/#endif by using new macros
MAC_DEBUG_COUNTER_INC() and MAC_DEBUG_COUNTER_DEC() to maintain
debugging counter values rather than #ifdef'ing the atomic
operations to MAC_DEBUG.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-08-20 19:16:49 +00:00
imp
ee0d294c7e bde made a number of suggested improvements to the code. This commit
represents the pruely stylistic changes and should have no net impact
on the rest of the code.

bde's more substantive changes will follow in a separate commit once
we've come to closure on them.

Submitted by: bde
2003-08-20 19:12:46 +00:00
imp
ef7e40c451 Fix an extreme edge case in leap second handling. We need to call
ntp_update_second twice when we have a large step in case that step
goes across a scheduled leap second.  The only way this could happen
would be if we didn't call tc_windup over the end of day on the day of
a leap second, which would only happen if timeouts were delayed for
seconds.  While it is an edge case, it is an important one to get
right for my employer.

Sponsored by: Timing Solutions Corporation
2003-08-20 05:34:27 +00:00
sam
59ff2ad5c7 Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter.  This
does not change the behaviour; it just makes the intent more clear.
2003-08-19 17:51:11 +00:00
phk
ba5950210c It is not an error to have no devices in the kernel: Return the
generation number and start it from one instead of zero.
2003-08-17 12:06:19 +00:00
bmilekic
e3861386da Use constants less throughout the code and instead use the objsize
variable.  This makes changing the size of an mbuf or cluster for
testing/debugging/whatever purposes easier.

Submitted by: sam
2003-08-16 19:48:52 +00:00
marcel
c1d4b42a69 Further cleanup <machine/cpu.h> and <machine/md_var.h>: move the MI
prototypes of cpu_halt(), cpu_reset() and swi_vm() from md_var.h to
cpu.h. This affects db_command.c and kern_shutdown.c.

ia64: move all MD prototypes from cpu.h to md_var.h. This affects
madt.c, interrupt.c and mp_machdep.c. Remove is_physical_memory().
It's not used (vm_machdep.c).

alpha: the MD prototypes have been left in cpu.h with a comment
that they should be there. Moving them is left for later. It was
expected that the impact would be significant enough to be done in
a seperate commit.

powerpc: MD prototypes left in cpu.h. Comment added.

Suggested by: bde
Tested with: make universe (pc98 incomplete)
2003-08-16 16:57:57 +00:00
phk
34014d5261 Give timecounters a numeric quality field.
A timecounter will be selected when registered if its quality is
not negative and no less than the current timecounters.

Add a sysctl to report all available timecounters and their qualities.

Give the dummy timecounter a solid negative quality of minus a million.

Give the i8254 zero and the ACPI 1000.

The TSC gets 800, unless APM or SMP forces it negative.

Other timecounters default to zero quality and thereby retain current
selection behaviour.
2003-08-16 08:23:53 +00:00
jhb
837193af8e - Various style fixes in both code and comments.
- Update some stale comments.
- Sort a couple of includes.
- Only set 'newcpu' in updatepri() if we use it.
- No functional changes.

Obtained from:	bde (via an old diff I got a long time ago)
2003-08-15 21:29:06 +00:00
marcel
77c3cd3d30 Add or finish support for machine dependent ptrace requests. When we
check for permissions, do it for all requests, not the known requests.
Later when we actually service the request we deal with the invalid
requests we previously caught earlier.

This commit changes the behaviour of the ptrace(2) interface for
boundary cases such as an unknown request without proper permissions.
Previously we would return EINVAL. Now we return EBUSY or EPERM.

Platforms need to define __HAVE_PTRACE_MACHDEP when they have MD
requests. This makes the prototype of cpu_ptrace() visible and
introduces a call to this function for all requests greater or
equal to PT_FIRSTMACH.

Silence on: audit
2003-08-15 05:25:06 +00:00
jmg
64bcd88750 if we got this far, we definately don't have an EBADF. Return a more
sane result of EPIPE.

Reported by:	nCircle dev team
MFC after:	3 day
2003-08-15 04:31:01 +00:00
cg
d647a00dc3 add a read-only sysctl to display the number of entries in the fixed size
kobj global method table; also kassert that the table has not overflowed
when defining a new method.

there are indications that the table is being overflowed in certain
situations as we gain more kobj consumers- this will allow us to check
whether kobj is at fault.  symptoms would be incorrect methods being called.
2003-08-14 21:16:46 +00:00
grehan
e6a3a6744e Update powerpc to use the (old thread,new thread) calling convention
for cpu_throw() and cpu_switch().
2003-08-14 03:56:24 +00:00
alc
7a81ace60d - The vm_object pointer in pipe_buffer is unused. Remove it.
- Check for successful initialization of pipe_zone in pipeinit()
   rather than every call to pipe(2).
2003-08-13 20:01:38 +00:00
imp
3bc162cfa3 Expand inline the relevant parts of src/COPYRIGHT for Matt Dillon's
copyrighted files.

Approved by: Matt Dillon
2003-08-12 23:24:05 +00:00
mux
43629d3ba9 Remove extra space. 2003-08-12 20:34:31 +00:00
jhb
1c016824f1 - Convert Alpha over to the new calling conventions for cpu_throw() and
cpu_switch() where both the old and new threads are passed in as
  arguments.  Only powerpc uses the old conventions now.
- Update comments in the Alpha swtch.s to reflect KSE changes.

Tested by:	obrien, marcel
2003-08-12 19:33:36 +00:00
alc
23ea8b5c7a Pipespace() no longer requires Giant. 2003-08-11 22:23:25 +00:00
kan
91297961f6 Drop Giant in recvit before returning an error to the caller to avoid
leaking the Giant on the syscall exit.
2003-08-11 19:37:11 +00:00
bms
44aa51e3ae Add the mlockall() and munlockall() system calls.
- All those diffs to syscalls.master for each architecture *are*
   necessary. This needed clarification; the stub code generation for
   mlockall() was disabled, which would prevent applications from
   linking to this API (suggested by mux)
 - Giant has been quoshed. It is no longer held by the code, as
   the required locking has been pushed down within vm_map.c.
 - Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
   to express their intention explicitly.
 - Inspected at the vmstat, top and vm pager sysctl stats level.
   Paging-in activity is occurring correctly, using a test harness.
 - The RES size for a process may appear to be greater than its SIZE.
   This is believed to be due to mappings of the same shared library
   page being wired twice. Further exploration is needed.
 - Believed to back out of allocations and locks correctly
   (tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).

PR:             kern/43426, standards/54223
Reviewed by:    jake, alc
Approved by:    jake (mentor)
MFC after:	2 weeks
2003-08-11 07:14:08 +00:00
silby
bd71f7b671 More pipe changes:
From alc:
Move pageable pipe memory to a seperate kernel submap to avoid awkward
vm map interlocking issues.  (Bad explanation provided by me.)

From me:
Rework pipespace accounting code to handle this new layout, and adjust
our default values to account for the fact that we now have a solid
limit on allocations.

Also, remove the "maxpipes" limit, as it no longer has a purpose.
(The limit on kva usage solves the problem of having two many pipes.)
2003-08-11 05:51:51 +00:00
alc
1625d6386b Use vm_page_hold() instead of vm_page_wire(). Otherwise, a multithreaded
application could cause a wired page to be freed.  In general,
vm_page_hold() should be preferred for ephemeral kernel mappings of pages
borrowed from a user-level address space.  (vm_page_wire() should really be
reserved for indefinite duration pinning by the "owner" of the page.)

Discussed with:	silby
Submitted by:	tegge
2003-08-11 00:17:44 +00:00
nectar
78ff87db8b panic() if we try to handle an out-of-range signal number in
psignal()/tdsignal().  The test was historically in psignal().  It was
changed into a KASSERT, and then later moved to tdsignal() when the
latter was introduced.

Reviewed by:	iedowse, jhb
2003-08-10 23:05:37 +00:00
nectar
f5b9f87e77 Add or correct range checking of signal numbers in system calls and
ioctls.

In the particular case of ptrace(), this commit more-or-less reverts
revision 1.53 of sys_process.c, which appears to have been erroneous.

Reviewed by:	iedowse, jhb
2003-08-10 23:04:55 +00:00
alc
c37c941215 Background: When proc_rwmem() wired and mapped a page, it also added
a reference to the containing object.  The purpose of the reference
being to prevent the destruction of the object and an attempt to free
the wired page.  (Wired pages can't be freed.)  Unfortunately, this
approach does not work.  Some operations, like fork(2) that call
vm_object_split(), can move the wired page to a difference object,
thereby making the reference pointless and opening the possibility
of the wired page being freed.

A solution is to use vm_page_hold() in place of vm_page_wire().  Held
pages can be freed.  They are moved to a special hold queue until the
hold is released.

Submitted by:	tegge
2003-08-09 18:01:19 +00:00
alc
f5d5533b42 - Remove GIANT_REQUIRED from pipespace().
- Remove a duplicate initialization from pipe_create().
2003-08-08 22:38:15 +00:00
deischen
547619d0d3 Copyin the thread mailbox flags from the correct location
in the mailbox.
2003-08-08 20:23:10 +00:00
jhb
af302d132f td_dupfd just needs to be less than 0, it does not have to hold the
negative value of the index of the new file, so just use -1.
2003-08-07 17:08:26 +00:00
nectar
df9de6c5cd Update some argument-documenting comments to match reality.
Add an explicit range check to those same arguments to reduce risk of
cardiac arrest in future code readers.
2003-08-07 16:42:27 +00:00
jhb
37641f86f1 Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort.  In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by:	bde (kern_ktrace.c)
2003-08-07 15:04:27 +00:00
jhb
12f44bde5d The ktrace mutex does not need to be locked around the post of the ktrace
semaphore and doing so can lead to a possible reversal.  WITNESS would have
caught this if semaphores were used more often in the kernel.

Submitted by:	Ted Unangst <tedu@stanford.edu>, Dawson Engler
2003-08-07 13:58:13 +00:00
alc
6178e0ad16 - Remove GIANT_REQUIRED from pipe_free_kmem().
- Remove the acquisition and release of Giant around pipe_kmem_free() and
   uma_zfree() in pipeclose().
2003-08-07 04:32:40 +00:00
yar
65e4901760 If connect(2) has been interrupted by a signal and therefore the
connection is to be established asynchronously, behave as in the
case of non-blocking mode:

- keep the SS_ISCONNECTING bit set thus indicating that
  the connection establishment is in progress, which is the case
  (clearing the bit in this case was just a bug);

- return EALREADY, instead of the confusing and unreasonable
  EADDRINUSE, upon further connect(2) attempts on this socket
  until the connection is established (this also brings our
  connect(2) into accord with IEEE Std 1003.1.)
2003-08-06 14:04:47 +00:00
davidxu
69df6d1c3b kse.h is not needed for these files. 2003-08-05 12:08:49 +00:00
davidxu
93e075cf7a Introduce a thread mailbox flag TMF_NOUPCALL. On some architectures other
than i386 or AMD64, TP register points to thread mailbox, and they can not
atomically clear km_curthread in kse mailbox, in this case, thread retrieves
its thread pointer from TP register and sets flag TMF_NOUPCALL in its thread
mailbox to indicate a critical region.
2003-08-05 12:00:55 +00:00
hsu
fb82c18f66 Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by:   Pavlin Radoslavov <pavlin@icir.org>
2003-08-05 00:27:54 +00:00