explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
two ticks by counting the number of switches and the load when
sched_clock() is called.
- If the busy metric exceeds a threshold allow the idle thread to spin
waiting for new work for a brief period to avoid using IPIs. This
reduces the cost on the sender and receiver as well as reducing wakeup
latency considerably when it works.
Sponsored by: Nokia
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.
Sponsored by: Nokia
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.
Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.
The implementation of the lf_purgelocks() is submitted by dfr.
Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks
o Implement IPI_PREEMPT,
o Set td_lock for the thread being switched out,
o For ULE & SMP, loop while td_lock points to blocked_lock for
the thread being switched in,
o Enable ULE by default in GENERIC and SKI,
public namespace for WITNESS as they are only used internally so just
move them in the private namespace for the subsystem (with all related
supporting definitions).
under bootverbose.
Struct ct is used for setting/reading real time clocks and I'm about
to Do Things to some of those, so a bit of preemptive debugging is
in order.
Remove a pointless __inline.
the only one difference is that lockmgr*() functions now accept
LK_NOWITNESS flag which skips ordering for the instanced calling.
- Remove an unuseful stub in witness_checkorder() (because the above check
doesn't allow ever happening) and allow witness_upgrade() to accept
non-try operation too.
lookup hard interrupt events by number. Ignore the irq# for soft intrs.
- Add support to cpuset for binding hardware interrupts. This has the
side effect of binding any ithread associated with the hard interrupt.
As per restrictions imposed by MD code we can only bind interrupts to
a single cpu presently. Interrupts can be 'unbound' by binding them
to all cpus.
Reviewed by: jhb
Sponsored by: Nokia
no longer needed, but for now we still want to be consistent with other
similar checks in the tree.
- Call ASSERT_VOP_ELOCKED() only when vget() returns 0.
Reviewed by: jeff
o create a private task queue thread that sets up root and current
directories (hooking mountroot event as needed); this is necessary
because task queue threads are parented from proc0 and it does not
have a reference to rootvnode (lost when / mounting moved to init)
o bounce image load + unload requests through the private task q so
we can load images even when the request is made from a thread that
does not have sufficient context (e.g. task q thread)
o add a check in the task q thread to fail requests before root is
mounted (just in case)
Reviewed by: jhb, mlaier, luigi (glance)
MFC after: 1 month
bit in order to allow per-bit checks on the options flag, in particular
in the consumers code [1]
- Re-enable the check against TDP_DEADLKTREAT as the anti-waiters
starvation patch allows exclusive waiters to override new shared
requests.
[1] Requested by: pjd, jeff
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.
In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.
This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.
The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.
In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.
Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.
Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007
allows all the INTR_FILTER #ifdef's to be removed from the MD interrupt
code.
- Rename the intr_event 'eoi', 'disable', and 'enable' hooks to
'post_filter', 'pre_ithread', and 'post_ithread' to be less x86-centric.
Also, add a comment describe what the MI code expects them to do.
- On amd64, i386, and powerpc this is effectively a NOP.
- On arm, don't bother masking the interrupt unless the ithread is
scheduled in the non-INTR_FILTER case to match what INTR_FILTER did.
Also, don't bother unmasking the interrupt in the post_filter case if
we never masked it. The INTR_FILTER case had been doing this by having
arm_unmask_irq for the post_filter (formerly 'eoi') hook.
- On ia64, stray interrupts are now masked for the non-INTR_FILTER case.
They were already masked in the INTR_FILTER case.
- On sparc64, use the a NULL pre_ithread hook and use intr_enable_eoi() for
both the 'post_filter' and 'post_ithread' hooks to match what the
non-INTR_FILTER code did.
- On sun4v, retire the ithread wrapper hack by using an appropriate
'post_ithread' hook instead (it's what 'post_ithread'/'enable' was
designed to do even in 5.x).
Glanced at by: piso
Reviewed by: marius
Requested by: marius [1], [5]
Tested on: amd64, i386, arm, sparc64
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame
allocators. Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object. In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT. This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL. This change teaches
UMA how to reset the object field to the kernel object.
Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
spinning when readers hold a lock. This spinning is speculative because,
unlike the write case, we can not test whether the owners are running.
- Add speculative read spinning for readers who are blocked by pending
writers while a read lock is still held. This allows the thread to
spin until the write lock succeeds after which it may spin until the
writer has released the lock. This prevents excessive context switches
when readers and writers both hold the lock for brief periods.
Sponsored by: Nokia
fixed pri boost with '1' or any priority less than the current thread's
priority with a value greater than two. Default the boost to
PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
threads in the kernel. This prevents these user-threads from also
being scheduled as if they are high fixed-priority kernel threads.
- Restore the setting of lowpri in tdq_choose(). It has to be either here
or in sched_switch(). I accidentally removed it from both places.
Tested by: kris
do this either. Simply check P_NOLOAD. It'd be nice if this was
in a thread flag so we didn't have an extra cache miss every time we
add and remove a thread from the run-queue.
- Move callout thread creation from kern_intr.c to kern_timeout.c
- Call callout_tick() on every processor via hardclock_cpu() rather than
inspecting callout internal details in kern_clock.c.
- Remove callout implementation details from callout.h
- Package up all of the global variables into a per-cpu callout structure.
- Start one thread per-cpu. Threads are not strictly bound. They prefer
to execute on the native cpu but may migrate temporarily if interrupts
are starving callout processing.
- Run all callouts by default in the thread for cpu0 to maintain current
ordering and concurrency guarantees. Many consumers may not properly
handle concurrent execution.
- The new callout_reset_on() api allows specifying a particular cpu to
execute the callout on. This may migrate a callout to a new cpu.
callout_reset() schedules on the last assigned cpu while
callout_reset_curcpu() schedules on the current cpu.
Reviewed by: phk
Sponsored by: Nokia
These functions try the specified operation (rlocking and wlocking) and
true is returned if the operation completes, false otherwise.
The KPI is enriched by this commit, so __FreeBSD_version bumping and
manpage updating will happen soon.
Requested by: jeff, kris