of the special handling for ".." and perform an ISDOTDOT VOP_LOOKUP()
for a filesystem root vnode. Handle this case inside lookup().
Submitted by: tegge
PR: 92785
MFC after: 1 week
sonewconn() in unp_connect(). This avoids a race that occurs due to
v_socket being an uncounted reference, as the lock was being released in
order to call sonewconn(), which otherwise recurses into the UNIX domain
socket code via pru_attach, as well as holding the lock over a sleeping
memory allocation in uipc_attach(). Switch to a non-sleeping memory
allocation during UNIX domain socket attach.
This fix non-ideal in that it requires enabling recursion, but is a much
smaller change than moving to using true references for v_socket. The
reported panic occurs in unp_connect() following the return of
sonewconn().
Update copyright year.
Panic reported by: jhb
doing a CLEARFILE option. Do a vrele instead. This prevents
a panic later due to v_writecount being negative when the vnode
is taken off the freelist.
Submitted by: jhb
to become negative. This will detect the underflow when it
happens, instead of having it discovered when the vnode is
taken off the freelist, long after the offending process is long
gone.
- Fix these types in ULE as well. This fixes bugs in priority index
calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.
Reported by: kkenn using pho's stress2.
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout(). This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.
Reviewed by: mdodd
sleep lock missed the witness code, and the system will panic
immediately on boot if WITNESS is enabled.
Changed the witness definition to the new type.
firmware in that module (eventhough this is a programming error) - drop the
reference to the module again.
Submitted by: Benjamin Close
MFC after: 3 days
minimize IPIs and rescheduling when scheduling like tasks while keeping
latency low for important threads.
1) An idle thread is running.
2) The current thread is worse than realtime and the new thread is
better than realtime. Realtime to realtime doesn't preempt.
3) The new thread's priority is less than the threshold.
support sched_4bsd.
- Rename the KTR level for non schedgraph parsed events. They take event
space from things we'd like to graph.
- Reset our slice value after we sleep. The slice is simply there to
prevent starvation among equal priorities. A thread which had almost
exhausted it's slice and then slept doesn't need to be rescheduled a
tick after it wakes up.
- Set the maximum slice value to a more conservative 100ms now that it is
more accurately enforced.
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.
Reviewed by: andre@
negative. Use unsigned integers for sleep and run time so this doesn't
disturb sched_interact_score(). This should fix the invalid interactive
priority panics reported by several users.
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
- Define our own maybe_preempt() as sched_preempt(). We want to be able
to preempt idlethread in all cases.
- Define our idlethread to require preemption to exit.
- Get the cpu estimation tick from sched_tick() so we don't have to worry
about errors from a sampling interval that differs from the time
domain. This was the source of sched_priority prints/panics and
inaccurate pctcpu display in top.
setrunqueue() was mostly empty. The few asserts and thread state
setting were moved to the individual schedulers. sched_add() was
chosen to displace it for naming consistency reasons.
- Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
different on all three schedulers where it was only called in one place
each.
- Remove the long ifdef'd out remrunqueue code.
- Remove the now redundant ts_state. Inspect the thread state directly.
- Don't set TSF_* flags from kern_switch.c, we were only doing this to
support a feature in one scheduler.
- Change sched_choose() to return a thread rather than a td_sched. Also,
rely on the schedulers to return the idlethread. This simplifies the
logic in choosethread(). Aside from the run queue links kern_switch.c
mostly does not care about the contents of td_sched.
Discussed with: julian
- Move the idle thread loop into the per scheduler area. ULE wants to
do something different from the other schedulers.
Suggested by: jhb
Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
the mount options list with vfs_deleteopt(). At this point, the export
information is saved in mp->mnt_export, so we can delete
the "export" mount option from mp->mnt_optnew and mp->mnt_opt.
This fixes read-write/read-only update mounts (mount -u -o rw, mount -u -o ro)
of NFS exported directories.
For some reason, I could only reproduce the problem with a configuration
supplied by Andre:
- "options QUOTA" enabled in kernel config
- "/ -maproot=root 10.0.1.105" in /etc/exports
Reported by: kris, Andre Guibert de Bruet <andy siliconlandmark com>,
Andrzej Tobola <ato iem pw edu pl>
Tested by: Andre Guibert de Bruet
control data but no payload data is passed.
Change m_uiotombuf() to return at least one empty mbuf if the requested
length was zero. Add comment to sosend_dgram and sosend_generic().
Diagnoses by: jhb
Regression test by: rwatson
Pointy hat to. andre
--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.
A simplified scenario:
root fs var fs
/ A / (/var) D
/var B /log (/var/log) E
vfs lock C vfs lock F
Within each file system, the lock order is clear: C->A->B and F->D->E
When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:
L1: C->A->B
|
+->F->D->E
L2: F->D->E
|
+->C->A->B
The lookup() process for namei("/var") mixes those two lock orders:
VOP_LOOKUP() obtains B while A is held
vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
violates L2)
vput() releases lock on B
VOP_UNLOCK() releases lock on A
VFS_ROOT() obtains lock on D while shared lock on F is held
vfs_unbusy() releases shared lock on F
vn_lock() obtains lock on A while D is held (violates L1, follows L2)
dounmount() follows L1 (B is locked while F is drained).
Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).
With unmount, you can get 4 processes in a deadlock:
p1: holds D, want A (in lookup())
p2: holds shared lock on F, want D (in VFS_ROOT())
p3: holds B, want drain lock on F (in dounmount())
p4: holds A, want B (in VOP_LOOKUP())
You can have more than one instance of p2.
The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.
- Tor Egge
To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.
Idea by: ups
Reviewed by: tegge, ups, jeff, rwatson (mac interaction)
Tested by: Peter Holm
MFC after: 2 weeks
a power saving mode otherwise.
- If the thread is already bound in sched_bind() unbind it before
re-binding it to a new cpu. I don't like these semantics but they are
expected by some code in the tree. Patch by jkoshy.
the ipi settings. If NEEDRESCHED is set and an ipi is later delivered
it will clear it rather than cause extra context switches. However, if
we miss setting it we can have terrible latency.
- In sched_bind() correctly implement bind. Also be slightly more
tolerant of code which calls bind multiple times. However, we don't
change binding if another call is made with a different cpu. This
does not presently work with hwpmc which I believe should be changed.
- Switch back to direct modification of remote CPU run queues. This added
a lot of complexity with questionable gain. It's easy enough to
reimplement if it's shown to help on huge machines.
- Re-implement the old tdq_transfer() call as tdq_pickidle(). Change
sched_add() so we have selectable cpu choosers and simplify the logic
a bit here.
- Implement tdq_pickpri() as the new default cpu chooser. This algorithm
is similar to Solaris in that it tries to always run the threads with
the best priorities. It is actually slightly more complex than
solaris's algorithm because we also tend to favor the local cpu over
other cpus which has a boost in latency but also potentially enables
cache sharing between the waking thread and the woken thread.
- Add a bunch of tunables that can be used to measure effects of different
load balancing strategies. Most of these will go away once the
algorithm is more definite.
- Add a new mechanism to steal threads from busy cpus when we idle. This
is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The
threshold is the required length of a tdq's run queue before another
cpu will be able to steal runnable threads. This prevents most queue
imbalances that contribute the long latencies.
Approved by: gnn
Add a new function hashinit_flags() which allows NOT-waiting
for memory (or waiting). The old hashinit() function now
calls hashinit_flags(..., HASH_WAITOK);
call, its semantics were unintentionally changed. It went from
returning the time state to returning 0 or -1. Since 0 means time
normal, and non-zero effectively only shows up around leap seconds,
this went unnoticed until now. At least unnoticed until someone was
trying to run a binary they didn't have source for and it was
misbehaving...
Submitted by: Judah Levine
MFC After: 2 weeks
preemptions when adjusting the priority of a thread that is on a run
queue. This was only observed when FULL_PREEMPTION was enabled.
Reported by: kris
Diagnosed by: ups
MFC after: 1 week
the UCB license now excludes the advertising clause. I'm not
interested in it either, so move my copyright. This leaves
only a CGD copyright with the advertising clause.
MFC after: 3 days
- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.
MFC after: 3 days
of max() when computing the divisor in SCHED_TICK_PRI(). This prevents
cases where rounding down would allow the quotient to exceed
SCHED_PRI_RANGE.
- Garbage collect some unused flags and fields.
- Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
duplicated this functionality.
- Re-enable the rebalancer by default and fix the sysctl so it can be
modified.