Commit Graph

6782 Commits

Author SHA1 Message Date
John Baldwin
861a7db56f Fix a typo in a comment.
Submitted by:	das
2003-11-12 14:55:45 +00:00
Poul-Henning Kamp
1415a09d42 Replace B_PHYS conditional assignment to bio_offset with KASSERT check
to see that the originating code already did it right.
2003-11-12 10:27:06 +00:00
Kirk McKusick
1977597b34 Update the five files derived from /sys/kern/syscalls.master
after the additions made for the new statfs structure (version
1.157). These must be updated in a separate checkin after
syscalls.master has been checked in so that they reflect its
new CVS identity. As these are purely derived files, it is not
clear to me why they are under CVS at all. I presume that it has
something to do with having `make world' operate properly.
2003-11-12 08:09:19 +00:00
Kirk McKusick
fde81c7d8e Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by:	Bruce Evans <bde@zeta.org.au>
Reviewed by:	Tim Robbins <tjr@freebsd.org>
Reviewed by:	Julian Elischer <julian@elischer.org>
Reviewed by:	the hoards of <arch@freebsd.org>
Sponsored by:   DARPA & NAI Labs.
2003-11-12 08:01:40 +00:00
Robert Watson
eca8a663d4 Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c).  This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE.  This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time.  This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change.  Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from:	bmilekic
Obtained from:		TrustedBSD Project
Sponsored by:		DARPA, Network Associates Laboratories
2003-11-12 03:14:31 +00:00
Alexander Kabaev
5c957adbf1 1. Consolidate mount struct allocation/destruction into a common code in
vfs_mount_alloc/vfs_mount_destroy functions and take care to completely
destroy the mount point along with its locks. Mount struct has grown in
coplexity recently and depending on each failure path to destroy it
completely isn't working anymore.

2. Eliminate largely identical vfs_mount and vfs_unmount question by
moving the code to handle both cases into a newly introduced vfs_domount
function.

3. Simplify nfs_mount_diskless to always expect an allocated mount
struct and never attempt an allocation/destruction itself. The
vfs_allocroot allocation was there to support 'magic' swap space
configuration for diskless clients that was already removed by PHK some
time ago.

4. Include a vfs_buildopts cleanups by Peter Edwards to validate the
sanity of nmount parameters passed from userland.

Submitted by:  (4) Peter Edwards <peter.edwards@openet-telecom.com>
Reviewed by:    rwatson
2003-11-12 02:54:47 +00:00
John Baldwin
961a7b244d Add an implementation of turnstiles and change the sleep mutex code to use
turnstiles to implement blocking isntead of implementing a thread queue
directly.  These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.

Turnstiles do not come out of a fixed-sized pool.  Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads.  The queue associated with a given lock
is found by a lookup in a simple hash table.  The turnstile itself is
protected by a lock associated with its entry in the hash table.  This
means that sched_lock is no longer needed to contest on a mutex.  Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.

Currently turnstiles only support mutexes.  Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation.  For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.

The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win).  Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.

Tested on:	i386 SMP, sparc64 SMP, alpha SMP
2003-11-11 22:07:29 +00:00
Joseph Koshy
a5896914f0 Bound the number of iterations a thread can perform inside
ktr_resize_pool(); this eliminates a potential livelock.

Return ENOSPC only if we encountered an out-of-memory condition when
trying to increase the pool size.

Reviewed by:	jhb, bde (style)
2003-11-11 09:09:26 +00:00
Joseph Koshy
b10221ffd9 Have utrace(2) return ENOMEM if malloc() fails. Document this error
return in its manual page.

Reviewed by:	jhb
2003-11-11 04:54:11 +00:00
Alan Cox
e35e0182c3 - Revision 1.469 of vfs_subr.c resulted in the buf's b_object field being
consistency initialized.  Consequently, a number of conditionals that
   checked the validity of b_object before passing it to VM_OBJECT_LOCK()
   and VM_OBJECT_UNLOCK() are no longer needed.
2003-11-11 04:45:37 +00:00
Robert Watson
c8e7bf92ad Whitespace sync to MAC branch, expand comment at the head of the file. 2003-11-11 03:40:04 +00:00
Alfred Perlstein
cd3c61b93d Fix a bug where the taskqueue kproc was being parented by init
because RFNOWAIT was being passed to kproc_create.

The result was that shutdown took quite a bit longer because this
errant "child" would not respond to termination signals from init
at system shutdown.

RFNOWAIT dissassociates itself from the caller by attaching to init
as a parent proc.  We could have had the taskqueue proc listen for
SIGKILL, but being able to SIGKILL a potentially critical system
process doesn't seem like a good idea.
2003-11-10 20:39:44 +00:00
Tim J. Robbins
541c3b66b5 When there are no free sem_undo structs available in semu_alloc(), only
free one sem_undo with un_cnt == 0 instead of all of them. This is a
temporary workaround until the SLIST_FOREACH_PREVPTR loop gets fixed so
that it doesn't cause cycles in semu_list when removing multiple adjacent
items. It might be easier to just use (doubly-linked) LISTs here instead
of complicated SLIST code to achieve O(1) removals.

This bug manifested itself as a complete lockup under heavy semaphore use
by multiple processes with the SEM_UNDO flag set.

PR:		58984
2003-11-10 07:22:41 +00:00
Marcel Moolenaar
fcaa2925a9 Change the clear_ret argument of get_mcontext() to be a flags argument.
Since all callers either passed 0 or 1 for clear_ret, define bit 0 in
the flags for use as clear_ret. Reserve bits 1, 2 and 3 for use by MI
code for possible (but unlikely) future use. The remaining bits are for
use by MD code.

This change is triggered by a need on ia64 to have another knob for
get_mcontext().
2003-11-09 20:31:04 +00:00
Bruce Evans
b698380f33 Quick fix for scaling of statclock ticks in the SMP case. As explained
in the log message for kern_sched.c 1.83 (which should have been
repo-copied to preserve history for this file), the (4BSD) scheduler
algorithm only works right if stathz is nearly 128 Hz.  The old
commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz
but there was a scale factor of 4 to give the requirement of 64 Hz,
and rev.1.83 changed the scale factor so that the requirement became
128 Hz.  The change of the scale factor was incomplete in the SMP
case.  Then scheduling ticks are provided by smp_ncpu CPUs, and the
scheduler cannot tell the difference between this and 1 CPU providing
scheduling ticks smp_ncpu times faster, so we need another scale
factor of smp_ncp or an algorithm change.

This quick fix uses the scale factor without even trying to optimize
the runtime divisions required for this as is done for the other
scale factor.

The main algorithmic problem is the clamp on the scheduling tick counts.
This was 295; it is now approximately 295 * smp_ncpu.  When the limit
is reached, threads get free timeslices and scheduling becomes very
unfair to the threads that don't hit the limit.  The limit can be
reached and maintained in the worst case if the load average is larger
than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with
2 CPUs before this change), so there are algorithmic problems even for
a load average of 1.  Fortunately, the worst case isn't common enough
for the problem to be very noticeable (it is mainly for niced CPU hogs
competing with less nice CPU hogs).
2003-11-09 13:45:54 +00:00
Seigo Tanimura
512824f8f7 - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
Sam Leffler
7902224c6b o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
  operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
  calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
  Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
  have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by:	FreeBSD Foundation
2003-11-08 22:28:40 +00:00
David Xu
685a6c448a Return a reasonable number for top or ps to display for M:N thread,
since there is no direct association between M:N thread and kse,
sometimes, a thread does not have a kse, in that case, return a pctcpu
from its last kse, it is not perfect, but gives a good number to be
displayed.
2003-11-08 03:03:17 +00:00
John Baldwin
dac33f12cc Regen. 2003-11-07 20:30:30 +00:00
John Baldwin
c055e5d412 Mark ptrace(), ktrace(), utrace(), sysarch(), and issetugid() as MP safe.
The parts of these calls that are not yet MP safe acquire Giant explicitly.
2003-11-07 20:23:23 +00:00
Robert Watson
a2f88a8b7c Slight whitespace consistency improvement:
Trim trailing whitespace.
  Remove unmatched " " before ")".
2003-11-07 04:47:14 +00:00
Jeff Roberson
f28b3340c1 - Somehow I botched my last commit. Add an extra ( to fix things up. I'm
still not sure how this happened.

Reported by:	ps
2003-11-06 07:56:01 +00:00
Alan Cox
3b2c54e7bc - Delay the allocation of memory for the pipe mutex until we need it.
This avoids the need to free said memory in various error cases along
   the way.
2003-11-06 05:58:26 +00:00
Alan Cox
fc17df5264 - Simplify pipespace() by eliminating the explicit creation of vm objects.
Instead, let the vm objects be lazily instantiated at fault time.  This
   results in the allocation of fewer vm objects and vm map entries due to
   aggregation in the vm system.
2003-11-06 05:08:12 +00:00
Robert Watson
83b7b0edca Remove the flags argument from mac_externalize_*_label(), as it's not
passed into policies or used internally to the MAC Framework.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-06 03:42:43 +00:00
Jeff Roberson
a70d729bff - Remove the local definition of sched_pin and unpin. They are provided in
sched.h now.
 - Respect the td pin count.
2003-11-06 03:09:51 +00:00
Sam Leffler
d3be1471c7 o make debug_mpsafenet globally visible
o move it from subr_bus.c to netisr.c where it more properly belongs
o add NET_PICKUP_GIANT and NET_DROP_GIANT macros that will be used to
  grab Giant as needed when MPSAFE operation is enabled

Supported by:	FreeBSD Foundation
2003-11-05 23:42:51 +00:00
Warner Losh
252af39a96 Minor style(9) nit 2003-11-05 06:14:48 +00:00
Jeff Roberson
46f8b26550 - It's ok if sched_runnable() has races in it, we don't need the sched_lock
here unless we have something on the assigned queue.
2003-11-05 05:30:12 +00:00
Alexander Kabaev
ca430f2e92 Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff
2003-11-05 04:30:08 +00:00
Max Khon
2332251c6a Back out the following revisions:
1.36      +73 -60    src/sys/compat/linux/linux_ipc.c
1.83      +102 -48   src/sys/kern/sysv_shm.c
1.8       +4 -0      src/sys/sys/syscallsubr.h

That change was intended to support vmware3, but
wantrem parameter is useless because vmware3 uses SYSV shared memory
to talk with X server and X server is native application.
The patch worked because check for wantrem was not valid
(wantrem and SHMSEG_REMOVED was never checked for SHMSEG_ALLOCATED segments).

Add kern.ipc.shm_allow_removed (integer, rw) sysctl (default 0) which when set
to 1 allows to return removed segments in
shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().

MFC after:	1 week
2003-11-05 01:53:10 +00:00
Kirk McKusick
b932dd9b28 Get rid of DIAGNOSTIC that gives false positives on slow CPUs. 2003-11-04 08:03:11 +00:00
Jeff Roberson
9bacd788a1 - Add initial support for pinning and binding. 2003-11-04 07:45:41 +00:00
Kirk McKusick
15a93fcc31 Allow the bufdaemon and update daemon processes to skip the
waitrunningbufspace() calls so that they are always able to
proceed and clean up buffer space.

Submitted by:	Brian Fundakowski Feldman <green@freebsd.org>
2003-11-04 06:30:00 +00:00
Sam Leffler
3465702f13 disable MPSAFE network drivers; we aren't ready yet` 2003-11-04 02:01:42 +00:00
Olivier Houchard
7922cdc855 I believe kbyanc@ really meant this in rev 1.58.
Use zpfind() to see if the process became a zombie if pfind() doesn't find it
and if the caller wants to know about process death, so that the caller knows
the process died even if it happened before the kevent was actually registered.

MFC after:	1 week
2003-11-04 01:41:47 +00:00
Olivier Houchard
f44004690c Do not attempt to report proc event if NOTE_EXIT has already been received.
This fixes a race condition (specifically with signal events) that could
lead to the kn being re-inserted into the list after it has been destroyed,
which is not something we want to happen.

PR:		kern/58258
2003-11-04 01:14:58 +00:00
John Baldwin
8bc0846476 Don't require INTR_FAST handlers to be exclusive in the MI layer. Instead,
let the MD code choose whether or not to implement such a policy.  The new
i386 interrupt code allows multiple FAST handlers for a given source for
example.  However, the code does not allow FAST and non-FAST handlers to be
mixed.
2003-11-03 22:42:58 +00:00
John Baldwin
b95bb3e62b Update spin lock order list for new i386 interrupt and SMP code. 2003-11-03 22:38:30 +00:00
Robert Watson
730ecf8254 Unlock pipe mutex when failing MAC pipe ioctl access control check.
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-03 17:58:23 +00:00
Jeff Roberson
112b6d3aa9 - Remove kseq_find(), we no longer scan other cpu's run queues when we go
idle.  They figure out that we're idle fast enough that the cache pollution
   introduces by scanning their run queue is more expensive than waiting
   a little longer.
 - Add kseq_setidle() to mark us as being idle.  Use this in place of
   kseq_find().
 - Remove kseq_load_highest(), kseq_find() was the only consumer of this
   interface.  kseq_balance() has it's own customized version that finds the
   lowest and highest loads simultaneously.

Continuously told that this would be faster by:	terry
2003-11-03 03:27:22 +00:00
Jeff Roberson
ef1134c9ad - Remove the ksq_loads[] array. We are only interested in three counts,
the total load, the timeshare load, and the number of threads that can
   be migrated to another cpu.  Account for these seperately.
 - Introduce a KSE_CAN_MIGRATE() macro which determines whether or not a KSE
   can be migrated to another CPU.  Currently, this only checks to see if
   we're an interrupt handler.  Eventually this will also be used to support
   CPU binding.
2003-11-02 10:56:48 +00:00
Alexander Kabaev
cb9ddc80ae Take care not to call vput if thread used in corresponding vget
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.

The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.

This fixes the panic on shutdown people have reported.
2003-11-02 04:52:53 +00:00
Jeff Roberson
769a363537 - In sched_prio() only force us onto the current queue if our priority is
being elevated (numerically smaller).
2003-11-02 04:25:59 +00:00
Jeff Roberson
7d1a81b4dc - Rename SCHED_PRI_NTHRESH to SCHED_SLICE_NTHRESH since it is only used in
slice assignment.  Add a comment describing what it does.
 - Remove a stale XXX comment, the nice should not impact the interactivity,
   nice adjustments only effect non-interactive tasks in ULE.
 - Don't allow nice -20 tasks to totally starve nice 0 tasks.  Give them at
   least SCHED_SLICE_MIN ticks.  We still allow nice 0 tasks to starve nice
   +20 tasks as intended.
2003-11-02 04:10:15 +00:00
Jeff Roberson
a0a931cec7 - Remove uses of PRIO_TOTAL and replace them with SCHED_PRI_NRESV
- SCHED_PRI_NRESV does not have the off by one error in PRIO_TOTAL so we
   do not have to account for it in the few places that we use it.

Requested by:	bde
2003-11-02 03:49:32 +00:00
Jeff Roberson
d322132c62 - Change sched_interact_update() to only accept slp+runtime values between
0 and SCHED_SLP_RUN_MAX * 2.  This allows us to simplify the algorithm
   quite a bit.  Before, it dealt with arbitrary values which required us
   to do nasty integer division tricks that didn't quite work out correctly.
 - Chnage sched_wakeup() to detect conditions where the slp+runtime could
   exceed SCHED_SLP_RUN_MAX * 2.  This can happen if we go to sleep for
   longer than 6 seconds.  In this case, we'll just clear the runtime and
   set the sleep time to the max.
 - Define a new function, sched_interact_fork() which updates the slp+runtime
   of a newly forked thread.  We want to limit the amount of history retained
   from the parent so that we learn the child's behavior quickly.  We don't,
   however want to decay it to nothing.  Previously, we would simply divide
   each parameter by 100 whenever we forked.  After a few forks the values
   would reach 0 and tasks would not be considered interactive.
 - Add another KTR entry, cleanup some existing entries.
 - Remove a useless sched_interact_update() from sched_priority().  This is
   already done by the callers that require it.
2003-11-02 03:36:33 +00:00
Alexander Kabaev
492c1e68fb Temporarily undo parts of the stuct mount locking commit by jeff.
It is unsafe to hold a mutex across vput/vrele calls.

This will be redone when a better locking strategy is agreed upon.

Discussed with: jeff
2003-11-01 05:51:54 +00:00
Jeff Roberson
22bf7d9a0e - Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses.  This mechanism is
   broken down into several components.  This is intended to reduce cache
   thrashing by eliminating most cases where one cpu touches another's
   run queues.
 - kseq_notify() appends a kse to a lockless singly linked list and
   conditionally sends an IPI to the target processor.  Right now this is
   protected by sched_lock but at some point I'd like to get rid of the
   global lock.  This is why I used something more complicated than a
   standard queue.
 - kseq_assign() processes our list of kses that have been assigned to us
   by other processors.  This simply calls sched_add() for each item on the
   list after clearing the new KEF_ASSIGNED flag.  This flag is used to
   indicate that we have been appeneded to the assigned queue but not
   added to the run queue yet.
 - In sched_add(), instead of adding a KSE to another processor's queue we
   use kse_notify() so that we don't touch their queue.  Also in sched_add(),
   if KEF_ASSIGNED is already set return immediately.  This can happen if
   a thread is removed and readded so that the priority is recorded properly.
 - In sched_rem() return immediately if KEF_ASSIGNED is set.  All callers
   immediately readd simply to adjust priorites etc.
 - In sched_choose(), if we're running an IDLE task or the per cpu idle thread
   set our cpumask bit in 'kseq_idle' so that other processors may know that
   we are idle.  Before this, make a single pass through the run queues of
   other processors so that we may find work more immediately if it is
   available.
 - In sched_runnable(), don't scan each processor's run queue, they will IPI
   us if they have work for us to do.
 - In sched_add(), if we're adding a thread that can be migrated and we have
   plenty of work to do, try to migrate the thread to an idle kseq.
 - Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
   consideration.
 - No longer use kseq_choose() to steal threads, it can lose it's last
   argument.
 - Create a new function runq_steal() which operates like runq_choose() but
   skips threads based on some criteria.  Currently it will not steal
   PRI_ITHD threads.  In the future this will be used for CPU binding.
 - Create a kseq_steal() that checks each run queue with runq_steal(), use
   kseq_steal() in the places where we used kseq_choose() to steal with
   before.
2003-10-31 11:16:04 +00:00
John Baldwin
e57ea233d9 Ensure that mp_ncpus is set to 1 if mp_cpu_probe() fails. 2003-10-30 21:44:01 +00:00