list lock, as there has been a report that an alternative lock order
is getting introduced. This should help ferret it out.
Reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.
For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.
Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().
Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.
Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.
Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.
Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.
Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days
mutex instead of a MTX_DEF one in order to defer preemption while
reading the date and time registers. If we don't manage to read them
within the time slot where we are guaranteed that no updates occur we
might actually read them during an update in which case the output is
undefined.
3ware's 9xxx series controllers. This corresponds to
the 9.2 release (for FreeBSD 5.2.1) on the 3ware website.
Highlights of this release are:
1. The driver has been re-architected to use a "Common Layer"
(all tw_cl* files), which is a consolidation of all OS-independent
parts of the driver. The FreeBSD OS specific portions of the
driver go into an "OS Layer" (all tw_osl* files).
This re-architecture is to achieve better maintainability, consistency
of behavior across OS's, and better portability to new OS's (drivers
for new OS's can be written by just adding an OS Layer that's specific
to the OS, by complying to a "Common Layer Programming Interface" API.
2. The driver takes advantage of multiple processors.
3. The driver has a new firmware image bundled, the new features of which
include Online Capacity Expansion and multi-lun support, among others.
More details about 3ware's 9.2 release can be found here:
http://www.3ware.com/download/Escalade9000Series/9.2/9.2_Release_Notes_Web.pdf
Since the Common Layer is used across OS's, the FreeBSD specific include
path for header files (/sys/dev/twa) is not part of the #include pre-processor
directive in any of the source files. For being able to integrate twa into
the kernel despite this, Makefile.<arch> has been changed to add the include
path to CFLAGS.
Reviewed by: scottl
at some point result in a status event being triggered (it should
be a link down event: the Microsoft driver design guide says you
should generate one when the NIC is initialized). Some drivers
generate the event during MiniportInitialize(), such that by the
time MiniportInitialize() completes, the NIC is ready to go. But
some drivers, in particular the ones for Atheros wireless NICs,
don't generate the event until after a device interrupt occurs
at some point after MiniportInitialize() has completed.
The gotcha is that you have to wait until the link status event
occurs one way or the other before you try to fiddle with any
settings (ssid, channel, etc...). For the drivers that set the
event sycnhronously this isn't a problem, but for the others
we have to pause after calling ndis_init_nic() and wait for the event
to arrive before continuing. Failing to wait can cause big trouble:
on my SMP system, calling ndis_setstate_80211() after ndis_init_nic()
completes, but _before_ the link event arrives, will lock up or
reset the system.
What we do now is check to see if a link event arrived while
ndis_init_nic() was running, and if it didn't we msleep() until
it does.
Along the way, I discovered a few other problems:
- Defered procedure calls run at PASSIVE_LEVEL, not DISPATCH_LEVEL.
ntoskrnl_run_dpc() has been fixed accordingly. (I read the documentation
wrong.)
- Similarly, the NDIS interrupt handler, which is essentially a
DPC, also doesn't need to run at DISPATCH_LEVEL. ndis_intrtask()
has been fixed accordingly.
- MiniportQueryInformation() and MiniportSetInformation() run at
DISPATCH_LEVEL, and each request must complete before another
can be submitted. ndis_get_info() and ndis_set_info() have been
fixed accordingly.
- Turned the sleep lock that guards the NDIS thread job list into
a spin lock. We never do anything with this lock held except manage
the job list (no other locks are held), so it's safe to do this,
and it's possible that ndis_sched() and ndis_unsched() can be
called from DISPATCH_LEVEL, so using a sleep lock here is
semantically incorrect. Also updated subr_witness.c to add the
lock to the order list.
witness_proc_has_locks(), as they are unused, which results in a compiler
error. This problem was introduced with the implementation of "show
alllocks".
Spotted by: Artem Kuchin <matrix at itlegion dot ru>
of lock types in the kernel. This results in an increase of witness
data usage from ~145k to ~280k on i386 for kernels with
'options WITNESS'.
- Remove the unused witness malloc bucket.
Submitted by: Michal Mertl mime at traveller dot cz (1)
and threads currently holding sleep mutexes (and spin mutexes for
curthread). This can be quite useful in looking for a lock condition
summary for a system, as it avoids manually iterating through threads
and processes to find all the interesting locks.
NB: "alllocks" is up there with "lockedvnods" for a bad argument for
show.
MFC after: 2 weeks
remove previous entropy harvesting mutex names as they are no longer
present. Commit to this file was ommitted when randomdev_soft.c:1.5
was made.
Feet shot: Robert Huff <roberthuff at rcn dot com>
when the spin lock in question isn't -- it's the critical_enter() that
KDB set. No more panic in DDB for console -> syscons -> tty -> knote
operations.
spin-wait code to use the same spin mutex (smp_tlb_mtx) as the TLB ipi
and spin-wait code snippets so that you can't get into the situation of
one CPU doing a TLB shootdown to another CPU that is doing a lazy pmap
shootdown each of which are waiting on each other. With this change, only
one of the CPUs would do an IPI and spin-wait at a time.
o Make debugging code conditional upon KDB instead of DDB.
o s/WITNESS_DDB/WITNESS_KDB/g
o s/witness_ddb/witness_kdb/g
o Rename the debug.witness_ddb sysctl to debug.witness_kdb.
o Call kdb_backtrace() instead of backtrace().
o Call kdb_enter() instead Debugger().
o Assert kdb_active instead of db_active.
assigning a pointer to the list and then dereferencing the pointer as a
second step. When the first spin lock is acquired, curthread is not in
a critical section so it may be preempted and would end up using another
CPUs lock list instead of its own.
When this code was in witness_lock() this sequence was safe as curthread
was in a critical section already since witness_lock() is called after the
lock is acquired.
Tested by: Daniel Lang dl at leo.org
order definition for witness. Send lock before receive lock, and
socket locks after accept but before select:
filedesc -> accept -> so_snd -> so_rcv -> sellck
All routing locks after send lock:
so_rcv -> radix node head
All protocol locks before socket locks:
unp -> so_snd
udp -> udpinp -> so_snd
tcp -> tcpinp -> so_snd
double NULL entries signal Witness to stop processing the array of
order entries meaning none of the spin locks are added resulting in
panics on boot.
- Add a missing NULL, NULL terminator to the Slip locks list to keep them
separate from the spin locks.
relationships:
Sockets: filedesc->accept->sellck
Routing: radix node head->rtentry->ifaddr
UDP: udp->udpinp
TCP: tcp->tcpinp
SLIP: slip_mtx->slip sc_mtx
Drop in a place holder section for UNIX domain sockets. Various
sections to be expanded over the next few days.
to queue threads sleeping on a wait channel similar to how turnstiles are
used to queue threads waiting for a lock. This subsystem will be used as
the backend for sleep/wakeup and condition variables initially. Eventually
it will also be used to replace the ithread-specific iwait thread
inhibitor.
Sleep queues are also not locked by sched_lock, so this splits sched_lock
up a bit further increasing concurrency within the scheduler. Sleep queues
also natively support timeouts on sleeps and interruptible sleeps allowing
for the reduction of a lot of duplicated code between the sleep/wakeup and
condition variable implementations. For more details on the sleep queue
implementation, check the comments in sys/sleepqueue.h and
kern/subr_sleepqueue.c.
- witness_lock() is split into two pieces: witness_checkorder() and
witness_lock(). Witness_checkorder() determines if acquiring a specified
lock at the time it is called would result in a lock order. It
optionally adds a new lock order relationship as well. witness_lock()
updates witness's data structures to assume that a lock has been acquired
by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
acquire a lock and continue to call witness_lock() after the acquire is
completed. This will let witness catch a deadlock before it happens
rather than trying to do so after the threads have deadlocked (i.e. never
actually report it).
- A new function witness_defineorder() has been added that adds a lock
order between two locks at runtime without having to acquire the locks.
If the lock order cannot be added it will return an error. This function
is available to programmers via the WITNESS_DEFINEORDER() macro which
accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
witness_checkorder() anywhere as a way of enforcing locking assertions
in code that might acquire a certain lock in some situations. The
macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
was unnested by using a goto to vastly improve the readability of this
function.
happen in interrupt context; 1) sleep locks, and 2) malloc/free
calls.
1) is fixed by using spin locks instead.
2) is fixed by preallocating a FIFO (implemented with a STAILQ)
and using elements from this FIFO instead. This turns out
to be rather fast.
OK'ed by: re (scottl)
Thanks to: peter, jhb, rwatson, jake
Apologies to: *
- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
(in particular, bootstrap)
- This is still a WIP. It seems that there are some highly bogus bioses
on nVidia nForce3-150 boards. I can't stress how broken these boards
are. I have a workaround in mind, but right now the Asus SK8N is broken.
The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE. SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
atpic' in addition, because they somehow managed to forget to connect the
8254 timer to the apic, even though its in the same silicon! ARGH!
This directly violates the ACPI spec.
turnstiles to implement blocking isntead of implementing a thread queue
directly. These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.
Turnstiles do not come out of a fixed-sized pool. Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads. The queue associated with a given lock
is found by a lookup in a simple hash table. The turnstile itself is
protected by a lock associated with its entry in the hash table. This
means that sched_lock is no longer needed to contest on a mutex. Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.
Currently turnstiles only support mutexes. Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation. For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.
The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win). Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.
Tested on: i386 SMP, sparc64 SMP, alpha SMP
another thread. We use the td_oncpu member of the other field to locate
it's associated CPU and then search the that CPU's list of spin locks
contained in its per-CPU data. This is not always safe and may in fact
panic or just not work, but it is useful in at least one case.
by allprison_mtx), a unique prison/jail identifier field, two path
fields (pr_path for reporting and pr_root vnode instance) to store
the chroot() point of each jail.
o Add jail_attach(2) to allow a process to bind to an existing jail.
o Add change_root() to perform the chroot operation on a specified
vnode.
o Generalize change_dir() to accept a vnode, and move namei() calls
to callers of change_dir().
o Add a new sysctl (security.jail.list) which is a group of
struct xprison instances that represent a snapshot of active jails.
Reviewed by: rwatson, tjr
as it could be and can do with some more cleanup. Currently its under
options LAZY_SWITCH. What this does is avoid %cr3 reloads for short
context switches that do not involve another user process. ie: we can
take an interrupt, switch to a kthread and return to the user without
explicitly flushing the tlb. However, this isn't as exciting as it could
be, the interrupt overhead is still high and too much blocks on Giant
still. There are some debug sysctls, for stats and for an on/off switch.
The main problem with doing this has been "what if the process that you're
running on exits while we're borrowing its address space?" - in this case
we use an IPI to give it a kick when we're about to reclaim the pmap.
Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a
few more things and get some more feedback before turning it on by default.
This is NOT a replacement for Bosko's lazy interrupt stuff. This was more
meant for the kthread case, while his was for interrupts. Mine helps a
little for interrupts, but his helps a lot more.
The stats are enabled with options SWTCH_OPTIM_STATS - this has been a
pseudo-option for years, I just added a bunch of stuff to it.
One non-trivial change was to select a new thread before calling
cpu_switch() in the first place. This allows us to catch the silly
case of doing a cpu_switch() to the current process. This happens
uncomfortably often. This simplifies a bit of the asm code in cpu_switch
(no longer have to call choosethread() in the middle). This has been
implemented on i386 and (thanks to jake) sparc64. The others will come
soon. This is actually seperate to the lazy switch stuff.
Glanced at by: jake, jhb
is set to 0, it now has the same affect as setting witness_dead used to
have.
- Added a sysctl handler that allows root to change witness_watch from a
non-zero value to zero to disable witness at runtime. Note that you
can't turn witness back on once it is off. You can only turn it off as
a one-way switch.
- Added a comment describing the possible values of witness_watch.