function back to near the beginning of the file. Rev.1.194 moved it into
the middle of auxiliary functions following kern_execve(). Moved the
__mac_execve() syscall function up together with execve(). It was new in
rev1.1.196 and perfectly misplaced after execve().
sched_cpu() locks an sx lock (allproc_lock) which can sleep if it fails to
acquire the lock, it is not safe to execute this in a callout handler from
softclock().
use it, if we ever did. They have been been VERY poorly maintained for
some time, possibly because they were a NOP. FWIW, This brings our table
formats back closer to the other *BSD's.
cpu could have been bogged down with non-transferable load and still not
migrated a new thread to an idle cpu. This required some benchmarking and
tuning to get right as the comment above it suggests.
- In sched_add(), do the idle check prior to the transfer check so that we
don't try to transfer load from an idle cpu. This fixes panics caused by
IPIs on UP machines running SMP kernels.
Reported/Debugged by: seanc
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
However, this is not sufficient. If we vclean() the vnode the pages owned
by the vnode are lost, potentially while buffers reference them. Implement
parts of vclean() seperately in vgonechrl() so that the pages and bufs
associated with a device vnode are not destroyed while in use.
- The new sched_balance_groups() function does intra-group balancing while
sched_balance() balances the available groups.
- Pick a random time between 0 ticks and hz * 2 ticks to restart each
balancing process. Each balancer has its own timeout.
- Pick a random place in the list of groups to start the search for lowest
and highest group loads. This prevents us from prefering a group based on
numeric position.
- Use a nasty hack to stop us from preferring cpu 0. The problem is that
softclock always runs on cpu 0, so it always has a little extra load. We
ignore this load in the balancer for now. In the future softclock should
run on a random cpu and these hacks can go away.
cpu are added to a group.
- Don't place a cpu into the kseq_idle bitmask until all cpus in that group
have idled.
- Prefer idle groups over idle group members in the new kseq_transfer()
function. In this way we will prefer to balance load across full cores
rather than add further load a partial core.
- Before a cpu goes idle, check the other group members for threads. Since
SMT cpus may freely share threads, this is cheap.
- SMT cores may be individually pinned and bound to now. This contrasts the
old mechanism where binding or pinning would have allowed a thread to run
on any available cpu.
- Remove some unnecessary logic from sched_switch(). Priority propagation
should be properly taken care of in sched_prio() now.
turnstile_unpend(). A racing thread that does not have TDI_LOCK set may
either be running on another CPU or it may be sitting on a run queue if it
was preempted during the very small window in turnstile_wait() between
unlocking the turnstile chain lock and locking sched_lock.
case of a turnstile having no threads is just one instance of the more
general case where the thread we are examining has been partially awakened
already in that it has been removed from the turnstile's blocked list but
still has TDI_LOCK set. We detect that case by checking to see if the
thread has already had a turnstile reassigned to it.
functions less noisy: We printf if a new function took longer than
the previous record holder, or of the previous record holder took
more than twice as long as the current record.
to have the kernel switch to a new thread, instead of doing it in
userland. It is in fact needed on ia64 where syscall restarts do not
return to userland first. It's completely handled inside the kernel.
As such, any context created by the kernel as part of an upcall and
caused by some syscall needs to be restored by the kernel.
Be sure to shift (long)1 << 33 and higher, not (int)1. Otherwise bad
things happen(TM). This is why beast.freebsd.org paniced with ULE.
Reviewed by: jeff
and the mpo_create_cred() MAC policy entry point to
mpo_copy_cred_label(). This is more consistent with similar entry
points for creation and label copying, as mac_create_cred() was
called from crdup() as opposed to during process creation. For
a number of policies, this removes the requirement for special
handling when copying credential labels, and improves consistency.
Approved by: re (scottl)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.
Approved by: re (scottl)
Tested on: i386, amd64, alpha
aid other kernel code, especially code which can be in a module such as
the acpi_cpu(4) driver, to work properly with both SMP and UP kernels.
The exported symbols include mp_ncpus, all_cpus, mp_maxid, smp_started, and
the smp_rendezvous() function. This also means that CPU_ABSENT() is now
always implemented the same on all kernels.
Approved by: re (scottl)
to sendfile(2) being erroneously automatically restarted after a signal
is delivered. Fixed by converting ERESTART to EINTR prior to exiting.
Updated manual page to indicate the potential EINTR error, its cause
and consequences.
Approved by: re@freebsd.org
forced unmount case. Otherwise, a file system that is referenced
only by process fd_cdir/fd_rdir references to the file system root
vnode will be successfully unmounted without the MNT_FORCE flag.
The previous behaviour was not compatible with the unmount semantics
required by amd(8), so file systems could be unexpectedly unmounted
while there were still references to the file system root directory.
Reported by: Erez Zadok <ezk@cs.sunysb.edu>
Approved by: re (scottl)
very early (SI_SUB_TUNABLES - 1) and is responsible for setting mp_maxid.
cpu_mp_probe() is now called at SI_SUB_CPU and determines if SMP is
actually present and sets mp_ncpus and all_cpus. Splitting these up
allows an architecture to probe CPUs later than SI_SUB_TUNABLES by just
setting mp_maxid to MAXCPU in cpu_mp_setmaxid(). This could allow the
CPU probing code to live in a module, for example, since modules
sysinit's in modules cannot be invoked prior to SI_SUB_KLD. This is
needed to re-enable the ACPI module on i386.
- For the alpha SMP probing code, use LOCATE_PCS() instead of duplicating
its contents in a few places. Also, add a smp_cpu_enabled() function
to avoid duplicating some code. There is room for further code
reduction later since much of this code is also present in cpu_mp_start().
- All archs besides i386 still set mp_maxid to the same values they set it
to before this change. i386 now sets mp_maxid to MAXCPU.
Tested on: alpha, amd64, i386, ia64, sparc64
Approved by: re (scottl)
happen in interrupt context; 1) sleep locks, and 2) malloc/free
calls.
1) is fixed by using spin locks instead.
2) is fixed by preallocating a FIFO (implemented with a STAILQ)
and using elements from this FIFO instead. This turns out
to be rather fast.
OK'ed by: re (scottl)
Thanks to: peter, jhb, rwatson, jake
Apologies to: *
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
to see_other_uids but with the logical conversion. This is based
on (but not identical to) the patch submitted by Samy Al Bahra.
Submitted by: Samy Al Bahra <samy@kerneled.com>
- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
(in particular, bootstrap)
- This is still a WIP. It seems that there are some highly bogus bioses
on nVidia nForce3-150 boards. I can't stress how broken these boards
are. I have a workaround in mind, but right now the Asus SK8N is broken.
The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE. SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
atpic' in addition, because they somehow managed to forget to connect the
8254 timer to the apic, even though its in the same silicon! ARGH!
This directly violates the ACPI spec.
system calls, and prefer these calls over getsockopt()/setsockopt()
for ABI reasons. When addressing UNIX domain sockets, these calls
retrieve and modify the socket label, not the label of the
rendezvous vnode.
- Create mac_copy_socket_label() entry point based on
mac_copy_pipe_label() entry point, intended to copy the socket
label into temporary storage that doesn't require a socket lock
to be held (currently Giant).
- Implement mac_copy_socket_label() for various policies.
- Expose socket label allocation, free, internalize, externalize
entry points as non-static from mac_net.c.
- Use mac_socket_label_set() in __mac_set_fd().
MAC-aware applications may now use mac_get_fd(), mac_set_fd(), and
mac_get_peer() to retrieve and set various socket labels without
directly invoking the getsockopt() interface.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
physical mapping.
- Move the sf_buf API to its own header file; make struct sf_buf's
definition machine dependent. In this commit, we remove an
unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
Ultimately, we may eliminate struct sf_buf on those architecures
except as an opaque pointer that references a vm page.
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested. This fixes process
queries of socket labels.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
kses from the run queues. Also, on SMP, we track the transferable
count here. Threads are transferable only as long as they are on the
run queue.
- Previously, we adjusted our load balancing based on the transferable count
minus the number of actual cpus. This was done to account for the threads
which were likely to be running. All of this logic is simpler now that
transferable accounts for only those threads which can actually be taken.
Updated various places in sched_add() and kseq_balance() to account for
this.
- Rename kseq_{add,rem} to kseq_load_{add,rem} to reflect what they're
really doing. The load is accounted for seperately from the runq because
the load is accounted for even as the thread is running.
- Fix a bug in sched_class() where we weren't properly using the PRI_BASE()
version of the kg_pri_class.
- Add a large comment that describes the impact of a seemingly simple
conditional in sched_add().
- Also in sched_add() check the transferable count and KSE_CAN_MIGRATE()
prior to checking kseq_idle. This reduces the frequency of access for
kseq_idle which is a shared resource.
in exit1(), make sure the p_klist is empty after sending NOTE_EXIT.
The process won't report fork() or execve() and won't be able to handle
NOTE_SIGNAL knotes anyway.
This fixes some race conditions with do_tdsignal() calling knote() while
the process is exiting.
Reported by: Stefan Farfeleder <stefan@fafoe.narf.at>
MFC after: 1 week
parts of ptrace using proc_rwmem(). proc_rwmem() requires giant, and
giant must be acquired prior to the proc lock, so ptrace must require giant
still.
Give the HZ/overflow check a 10% margin.
Eliminate bogus newline.
If timecounters have equal quality, prefer higher frequency.
Some inspiration from: bde
and empty its turnstile while the blocking threads still pointed to the
turnstile. If the thread on the first CPU blocked on a lock owned by
one of the threads blocked on the turnstile just woken up, then the
first CPU could try to manipulate a bogus thread queue in the turnstile
during priority propagation.
- Update locking notes for ts_owner and always clear ts_owner, not just
under INVARIANTS.
Tested by: sam (1)
deleted in 1.81. Increase the initial timeout limit to 2ms to
eliminate spurious messages of excessive timeouts in the NFS
client code.
Requested by: Poul-Henning Kamp <phk@phk.freebsd.dk>
Requested by: Mike Silbersack <silby@silby.com>
Requested by: Sam Leffler <sam@errno.com>
Giant and is also MPSAFE.
Push Giant further down into __mac_get_fd() and __mac_set_fd(),
grabbing it only for constrained regions dealing with VFS, and
dropping it entirely for operations related to labeling of pipes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
after the additions made for the new statfs structure (version
1.157). These must be updated in a separate checkin after
syscalls.master has been checked in so that they reflect its
new CVS identity. As these are purely derived files, it is not
clear to me why they are under CVS at all. I presume that it has
something to do with having `make world' operate properly.
accurate reporting of multi-terabyte filesystem sizes.
You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.
This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.
While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.
NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.
Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
vfs_mount_alloc/vfs_mount_destroy functions and take care to completely
destroy the mount point along with its locks. Mount struct has grown in
coplexity recently and depending on each failure path to destroy it
completely isn't working anymore.
2. Eliminate largely identical vfs_mount and vfs_unmount question by
moving the code to handle both cases into a newly introduced vfs_domount
function.
3. Simplify nfs_mount_diskless to always expect an allocated mount
struct and never attempt an allocation/destruction itself. The
vfs_allocroot allocation was there to support 'magic' swap space
configuration for diskless clients that was already removed by PHK some
time ago.
4. Include a vfs_buildopts cleanups by Peter Edwards to validate the
sanity of nmount parameters passed from userland.
Submitted by: (4) Peter Edwards <peter.edwards@openet-telecom.com>
Reviewed by: rwatson
turnstiles to implement blocking isntead of implementing a thread queue
directly. These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.
Turnstiles do not come out of a fixed-sized pool. Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads. The queue associated with a given lock
is found by a lookup in a simple hash table. The turnstile itself is
protected by a lock associated with its entry in the hash table. This
means that sched_lock is no longer needed to contest on a mutex. Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.
Currently turnstiles only support mutexes. Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation. For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.
The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win). Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.
Tested on: i386 SMP, sparc64 SMP, alpha SMP
ktr_resize_pool(); this eliminates a potential livelock.
Return ENOSPC only if we encountered an out-of-memory condition when
trying to increase the pool size.
Reviewed by: jhb, bde (style)
consistency initialized. Consequently, a number of conditionals that
checked the validity of b_object before passing it to VM_OBJECT_LOCK()
and VM_OBJECT_UNLOCK() are no longer needed.
because RFNOWAIT was being passed to kproc_create.
The result was that shutdown took quite a bit longer because this
errant "child" would not respond to termination signals from init
at system shutdown.
RFNOWAIT dissassociates itself from the caller by attaching to init
as a parent proc. We could have had the taskqueue proc listen for
SIGKILL, but being able to SIGKILL a potentially critical system
process doesn't seem like a good idea.
free one sem_undo with un_cnt == 0 instead of all of them. This is a
temporary workaround until the SLIST_FOREACH_PREVPTR loop gets fixed so
that it doesn't cause cycles in semu_list when removing multiple adjacent
items. It might be easier to just use (doubly-linked) LISTs here instead
of complicated SLIST code to achieve O(1) removals.
This bug manifested itself as a complete lockup under heavy semaphore use
by multiple processes with the SEM_UNDO flag set.
PR: 58984
Since all callers either passed 0 or 1 for clear_ret, define bit 0 in
the flags for use as clear_ret. Reserve bits 1, 2 and 3 for use by MI
code for possible (but unlikely) future use. The remaining bits are for
use by MD code.
This change is triggered by a need on ia64 to have another knob for
get_mcontext().
in the log message for kern_sched.c 1.83 (which should have been
repo-copied to preserve history for this file), the (4BSD) scheduler
algorithm only works right if stathz is nearly 128 Hz. The old
commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz
but there was a scale factor of 4 to give the requirement of 64 Hz,
and rev.1.83 changed the scale factor so that the requirement became
128 Hz. The change of the scale factor was incomplete in the SMP
case. Then scheduling ticks are provided by smp_ncpu CPUs, and the
scheduler cannot tell the difference between this and 1 CPU providing
scheduling ticks smp_ncpu times faster, so we need another scale
factor of smp_ncp or an algorithm change.
This quick fix uses the scale factor without even trying to optimize
the runtime divisions required for this as is done for the other
scale factor.
The main algorithmic problem is the clamp on the scheduling tick counts.
This was 295; it is now approximately 295 * smp_ncpu. When the limit
is reached, threads get free timeslices and scheduling becomes very
unfair to the threads that don't hit the limit. The limit can be
reached and maintained in the worst case if the load average is larger
than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with
2 CPUs before this change), so there are algorithmic problems even for
a load average of 1. Fortunately, the worst case isn't common enough
for the problem to be very noticeable (it is mainly for niced CPU hogs
competing with less nice CPU hogs).
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().
- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.
- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.
Not objected in: -arch, -current
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive
Supported by: FreeBSD Foundation
since there is no direct association between M:N thread and kse,
sometimes, a thread does not have a kse, in that case, return a pctcpu
from its last kse, it is not perfect, but gives a good number to be
displayed.
Instead, let the vm objects be lazily instantiated at fault time. This
results in the allocation of fewer vm objects and vm map entries due to
aggregation in the vm system.
o move it from subr_bus.c to netisr.c where it more properly belongs
o add NET_PICKUP_GIANT and NET_DROP_GIANT macros that will be used to
grab Giant as needed when MPSAFE operation is enabled
Supported by: FreeBSD Foundation
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.
Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.
Discussed with: jeff
1.36 +73 -60 src/sys/compat/linux/linux_ipc.c
1.83 +102 -48 src/sys/kern/sysv_shm.c
1.8 +4 -0 src/sys/sys/syscallsubr.h
That change was intended to support vmware3, but
wantrem parameter is useless because vmware3 uses SYSV shared memory
to talk with X server and X server is native application.
The patch worked because check for wantrem was not valid
(wantrem and SHMSEG_REMOVED was never checked for SHMSEG_ALLOCATED segments).
Add kern.ipc.shm_allow_removed (integer, rw) sysctl (default 0) which when set
to 1 allows to return removed segments in
shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().
MFC after: 1 week
waitrunningbufspace() calls so that they are always able to
proceed and clean up buffer space.
Submitted by: Brian Fundakowski Feldman <green@freebsd.org>
Use zpfind() to see if the process became a zombie if pfind() doesn't find it
and if the caller wants to know about process death, so that the caller knows
the process died even if it happened before the kevent was actually registered.
MFC after: 1 week
This fixes a race condition (specifically with signal events) that could
lead to the kn being re-inserted into the list after it has been destroyed,
which is not something we want to happen.
PR: kern/58258
let the MD code choose whether or not to implement such a policy. The new
i386 interrupt code allows multiple FAST handlers for a given source for
example. However, the code does not allow FAST and non-FAST handlers to be
mixed.
idle. They figure out that we're idle fast enough that the cache pollution
introduces by scanning their run queue is more expensive than waiting
a little longer.
- Add kseq_setidle() to mark us as being idle. Use this in place of
kseq_find().
- Remove kseq_load_highest(), kseq_find() was the only consumer of this
interface. kseq_balance() has it's own customized version that finds the
lowest and highest loads simultaneously.
Continuously told that this would be faster by: terry
the total load, the timeshare load, and the number of threads that can
be migrated to another cpu. Account for these seperately.
- Introduce a KSE_CAN_MIGRATE() macro which determines whether or not a KSE
can be migrated to another CPU. Currently, this only checks to see if
we're an interrupt handler. Eventually this will also be used to support
CPU binding.
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.
The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.
This fixes the panic on shutdown people have reported.
slice assignment. Add a comment describing what it does.
- Remove a stale XXX comment, the nice should not impact the interactivity,
nice adjustments only effect non-interactive tasks in ULE.
- Don't allow nice -20 tasks to totally starve nice 0 tasks. Give them at
least SCHED_SLICE_MIN ticks. We still allow nice 0 tasks to starve nice
+20 tasks as intended.
- SCHED_PRI_NRESV does not have the off by one error in PRIO_TOTAL so we
do not have to account for it in the few places that we use it.
Requested by: bde
0 and SCHED_SLP_RUN_MAX * 2. This allows us to simplify the algorithm
quite a bit. Before, it dealt with arbitrary values which required us
to do nasty integer division tricks that didn't quite work out correctly.
- Chnage sched_wakeup() to detect conditions where the slp+runtime could
exceed SCHED_SLP_RUN_MAX * 2. This can happen if we go to sleep for
longer than 6 seconds. In this case, we'll just clear the runtime and
set the sleep time to the max.
- Define a new function, sched_interact_fork() which updates the slp+runtime
of a newly forked thread. We want to limit the amount of history retained
from the parent so that we learn the child's behavior quickly. We don't,
however want to decay it to nothing. Previously, we would simply divide
each parameter by 100 whenever we forked. After a few forks the values
would reach 0 and tasks would not be considered interactive.
- Add another KTR entry, cleanup some existing entries.
- Remove a useless sched_interact_update() from sched_priority(). This is
already done by the callers that require it.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.
clobbers this variable. Long ago, when the idle loop wasn't in a
process, it set switchtime.tv_sec to zero to indicate that the time
needs to be read after the idle loop finishes. The special case for
this isn't needed now that there is an idle process (for each CPU).
The time is read in the normal way when the idle process is switched
away from. The seconds component of the time is only zero for the
first second after the uptime is set, and the mostly-dead code was only
executed during this time. (This was slightly broken by using uptimes
instead of times relative to the Epoch -- in the original version the
seconds component of the time was only 0 for the first second after
the Epoch.)
In mi_switch(), moved the setting of switchticks to just after the
first (and now only) setting of switchtime. This setting used to be
delayed since a late setting was needed for the idle case and an early
setting was not needed. Now the early setting is needed so that
fork_exit() doesn't need to set either switchtime or switchticks.
Removed now-completely-rotted comment attached to this. Most of the
code described by the comment had already moved to sched_switch().
begin with sched_lock held but not recursed, so this variable was
always 0.
Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch(). Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up. Only
sched_lock.mtx_lock really needs a fixup.
Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up. This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.
with an mbuf until it is reclaimed. This is in contrast to tags that
vanish when an mbuf chain passes through an interface. Persistent tags
are used, for example, by MAC labels.
Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp. This fixes problems
with "tag leakage".
Pointed out by: Jonathan Stone
Reviewed by: Robert Watson
Contributed by: Thomaswuerfl@gmx.de
- In sched_prio(), adjust the run queue for threads which may need to move
to the current queue due to priority propagation .
- In sched_switch(), fix style bug introduced when the KSE support went in.
Columns are 80 chars wide, not 90.
- In sched_switch(), Fix the comparison in the idle case and explicitly
re-initialize the runq in the not propagated case.
- Remove dead code in sched_clock().
- In sched_clock(), If we're an IDLE class td set NEEDRESCHED so that threads
that have become runnable will get a chance to.
- In sched_runnable(), if we're not the IDLETD, we should not consider
curthread when examining the load. This mimics the 4BSD behavior of
returning 0 when the only runnable thread is running.
- In sched_userret(), remove the code for setting NEEDRESCHED entirely.
This is not necessary and is not implemented in 4BSD.
- Use the correct comparison in sched_add() when checking to see if an idle
prio task has had it's priority temporarily elevated.
if we do acquire an advisory lock, great! We'll release it later.
However, if we fail to acquire a lock, we perform the coredump
anyway. This problem became particularly visible with NFS after
the introduction of rpc.lockd: if the lock manager isn't running,
then locking calls will fail, aborting the core dump (resulting in
a zero-byte dump file).
Reported by: Yogeshwar Shenoy <ynshenoy@alumni.cs.ucsb.edu>
- Add a DDB function to dump the contents of an ithread and optionally
details about each handler in that ithread. This function can be used
by MD code to implement DDB commands that display information about
interrupt sources and their registered handlers.
Include src/sys/security/mac/mac_internal.h in kern_mac.c.
Remove redundant defines from the include: SYSCTL_DECL(), debug macros,
composition macros.
Unstaticize various bits now exposed to the remainder of the kernel:
mac_init_label(), mac_destroy_label().
Remove all the functions now implemented in mac_process/mac_vfs/mac_net/
mac_pipe. Also remove debug counters, sysctls exporting debug
counters, enforcement flags, sysctls exporting enforcement flags.
Leave module declaration, sysctl nodes, mactemp malloc type, system
calls.
This should conclude MAC/LINT/NOTES breakage from the break-out process,
but I'm running builds now to make sure I caught everything.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Unstaticize mac_late.
Remove ea_warn_once, now in mac_vfs.c.
Unstaticisize mac_policy_list, mac_static_policy_list, use
struct mac_policy_list_head instead of LIST_HEAD() directly.
Unstaticize and un-inline MAC policy locking functions so they can
be referenced from mac_*.c.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
by libguile that needs to know the base of the RSE backing store. We
currently do not export the fixed address to userland by means of a
sysctl so user code needs to hardcode it for now. This will be revisited
later.
The RSE backing store is now at the bottom of region 4. The memory stack
is at the top of region 4. This means that the whole region is usable
for the stacks, giving a 61-bit stack space.
Port: lang/guile (depended of x11/gnome2)
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.
As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.
To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.
Reviewed by: iedowse
Idea run past: jhb
the point where it being a macro is no longer sensible, and it will
only be more so in days to come.
BIO_STRATEGY() is now only used from DEV_STRATEGY() and should not
be used directly anymore.
Put the contents of both in the new function dev_strategy() and
make DEV_STRATEGY() call that function.
In addition, this allows us to make the rather magic bufdonebio()
helper function static.
This alse saves hunderedandsome bytes of code in a typical kernel.
definition structure. Define one flag, CN_FLAG_NODEBUG, which
indicates the console driver cannot be used in the context of the
debugger. This may be used, for example, if the console device
interacts with kernel services that cannot be used from the
debugger context, such as the network stack. These drivers are
skipped over for calls to cn_checkc() and cn_putc(), and the
calling function simply moves on to the next available console.
a fair bit of difference to the power consumption and lets my cpu cool
down enough for the temperature sensitive fan controller to completely
stop the cpu fan at times.
halt state that minimizes power consumption while still preserving
cache and TLB coherency. Halting the processor is not conditional at
this time. Tested with UP and SMP kernels.
classes and if a method is not found in a given class, its base classes
are searched (in the order they were declared). This search is recursive,
i.e. a method may be define in a base class of a base class.
* Change the kobj method lookup algorithm to one which is SMP-safe. This
relies only on the constraint that an observer of a sequence of writes
of pointer-sized values will see exactly one of those values, not a
mixture of two or more values. This assumption holds for all processors
which FreeBSD supports.
* Add locking to kobj class initialisation.
* Add a simpler form of 'inheritance' for devclasses. Each devclass can
have a parent devclass. Searches for drivers continue up the chain of
devclasses until either a matching driver is found or a devclass is
reached which has no parent. This can allow, for instance, pci drivers
to match cardbus devices (assuming that cardbus declares pci as its
parent devclass).
* Increment __FreeBSD_version.
This preserves the driver API entirely except for one minor feature used
by the ISA compatibility shims. A workaround for ISA compatibility will
be committed separately. The kobj and newbus ABI has changed - all modules
must be recompiled.
rounding errors. This was the source of the majority of the
interactivity problems. Reintroduce the old algorithm and its XXX.
- Up the interactivity threshold to 30. It really could stand to be even
a tiny bit higher.
- Let the sleep and run time accumulate up to 5 seconds of history rather
than two. This helps stop XFree86 from becoming non-interactive during
bursts of activity.
elevated either due to priority propagation or because we're in the
kernel in either case, put us on the current queue so that we dont
stop others from using important resources. At some point the priority
elevations from sleeping in the kernel should go away.
- Remove an optimization in sched_userret(). Before we would only set
NEEDRESCHED if there was something of a higher priority available. This
is a trivial optimization and it breaks priority propagation because it
doesn't take threads which we may be blocking into account. Notice that
the thread which is blocking others gets up to one tick of cpu time before
we honor this NEEDRESCHED in sched_clock().
accesses softc after it is freed. Use a different malloc type for
softc than the rest of the bus code to make it more clear when these
things happen that it is the driver that's at fault, not the bus code.
Suggested by: sam and/or phk (I think)
you on the current queue. In the future, it would be nice if priority
propagation could deterministicly pluck a thread off of the next queue
and put it on the current queue. Until then this hack stops us from
holding up our entire current queue, including interrupt handlers, while
a thread on the next queue is blocked while holding Giant.
- Inherit our pctcpu information from our parent.
kqueue write events on a socket and you regularly create tons of pipes
which overwrites the structure causing a panic when removing the knote
from the list. If the peer has gone away (and it's a write knote), then
don't bother trying to remove the knote from the list.
Submitted by: Brian Buchanan and myself
Obtained from: nCircle
caused snapshot related problems.
- The vp can not be NULL here or we would panic in vfs_bio_awrite(). Stop
confusing the logic by checking for it in several places.
Submitted by: kirk and then rototilled by me to remove vp == NULL checks.
Use pre-emption detection to avoid the need for wiring a userland buffer
when copying opaque data structures.
sysctl_wire_old_buffer() is now a no-op. Other consumers of this
API should use pre-emption detection to notice update collisions.
vslock() and vsunlock() should no longer be called by any code
and should be retired in subsequent commits.
Discussed with: pete, phk
MFC after: 1 week
go away in due course. Involuntary pre-emption means that we can't count
on wiring of pages alone for consistency when performing a SYSCTL_OUT()
bigger than PAGE_SIZE.
Discussed with: pete, phk
LK_RETRY either, we don't want this vnode if it turns into another.
- Remove the code that checks the mount point after acquiring the lock
we are guaranteed to either fail or get the vnode that we wanted.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
passed. We won't vgonel if someone has either acquired a hold or usecount
or started the vgone process elsewhere. This is because we may have been
removed from the free list while we were inspecting the vnode for
recycling.
- The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
the same vnode. To further reduce the likelyhood of this event, requeue
the vnode on the tail of the list prior to calling vtryrecycle(). We can
not actually remove the vnode from the list until we know that it's
going to be recycled because other interlock holders may see the VI_FREE
flag and try to remove it from the free list.
- Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it
regardless of MNT_WAIT because the vnode does not actually belong to
this filesystem.
purge, the purge in vclean, and the filesystems purge, we had 3 purges
per vnode.
- Move the insmntque(vp, 0) to vclean() so that we may remove it from the
two vgone() functions and reduce the number of lock operations required.
whether or not the sync failed. This could potentially get set between
the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
lead to the sync being delayed by an extra 30 seconds. If we do not move
the vnode it could cause an endless loop if it continues to fail to sync.
- Use vhold and vdrop to stop the vnode from changing identities while we
have it unlocked. Other internal vfs lists are likely to follow this
scheme.
- Create a new function, vgonechrl(), which performs vgone for an in-use
character device. Move the code from vflush() that did this into
vgonechrl().
- Hold the xlock across the entirety of vgonel() and vgonechrl() so that
at no point will an invalid vnode exist on any list without XLOCK set.
- Move the xlock code out of vclean() now that it is in the vgone*()
functions.
This is so that we may grab the interlock while still holding the
sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock
order runs the other way.
- If we don't meet any of the preconditions, reinsert the vp into the
list for the next second.
- We don't need to panic if we fail to sync here because each FSYNC
function handles this case. Removing this redundant code also
simplifies locking.
fail. Remove the panic from that case and document why it might fail.
- Document the reason for calling cache_purge() on a newly created vnode.
- In insmntque() order the operations so that we can call mtx_unlock()
one fewer times. This makes the code somewhat clearer as well.
- Add XXX comments in sched_sync() and vflush().
- In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
set.
- In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
call. It's ok if we race here, the vinvalbuf will just do nothing.
- Increase the scope of the lock in vgonel() to reduce the number of lock
operations that are performed.
validating the offset within a given memory buffer before handing the
real work off to uiomove(9).
Use uiomove_frombuf in procfs to correct several issues with
integer arithmetic that could result in underflows/overflows. As a
side-effect, the code is significantly simplified.
Add additional sanity checks when computing a memory allocation size
in pfs_read.
Submitted by: rwatson (original uiomove_frombuf -- bugs are mine :-)
Reported by: Joost Pol <joost@pine.nl> (integer underflows/overflows)
fd_cmask field in the file descriptor structure for the first process
indirectly from CMASK, and when an fd structure is initialized before
being filled in, and instead just use CMASK. This appears to be an
artifact left over from the initial integration of quotas into BSD.
Suggested by: peter
the TLB and ~1600 if it is not. Therefore, it is more effecient to
invalidate the TLB after operations that use CMAP rather than before.
- So that the tlb is invalidated prior to switching off of a processor, we
must change the switchin functions to switchout functions.
- Remove td_switchout from the thread and move it to the x86 pcb.
- Move the code that calls switchout into swtch.s. These changes make this
optimization truely x86 specific.
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.
Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.
Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.
Tested on: alpha, i386, ia64, sparc64
do exactly the same as vop_nopoll() for consistency and put a
comment in the two pointing at each other.
Retire seltrue() in favour of no_poll().
Create private default functions in kern_conf.c instead of public
ones.
Change default strategy to return the bio with ENODEV instead of
doing nothing which would lead the bio stranded.
Retire public nullopen() and nullclose() as well as the entire band
of public no{read,write,ioctl,mmap,kqfilter,strategy,poll,dump}
funtions, they are the default actions now.
Move the final two trivial functions from subr_xxx.c to kern_conf.c
and retire the now empty subr_xxx.c
provide no methods does not make any sense, and is not used by any
driver.
It is a pretty hard to come up with even a theoretical concept of
a device driver which would always fail open and close with ENODEV.
Change the defaults to be nullopen() and nullclose() which simply
does nothing.
Remove explicit initializations to these from the drivers which
already used them.
consdev structure.
If the consdev name is not set and we have a cn_dev, set the name
from there. Try to issue a printf about this, even though it may
not have a place to go.
Modify the sysctl related code to pick up the name from the consdev
instead.
systems where the data/stack/etc limits are too big for a 32 bit process.
Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.
Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.
Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.
Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.
Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.
freed belong to the kernel object.)
- Increase the granularity of the vm object locking in vm_hold_load_pages()
in order to reduce the number of times that we acquire and release the
same lock.