10822 Commits

Author SHA1 Message Date
Jeff Roberson
804e60d4cf - Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested.  Handle this case specially before the while loop.
 - Use the held vnode lock to check for VI_DOOMED.  The vnode lock and
   interlock must both be held to set VI_DOOMED so either one held, even
   shared, is sufficient to check it.

No objection by:	kib
2008-03-24 04:17:35 +00:00
Konstantin Belousov
1be222e9df Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed:	on stable@, with bde
Reviewed by:	tegge, jeff [1]
MFC after:	2 weeks
2008-03-23 13:45:24 +00:00
David Xu
34d05d83f6 Remove commented out code, thread suspension is done in thread library. 2008-03-23 02:03:06 +00:00
Jeff Roberson
e6b2545b3b - Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list.  This prevents sched_sync() from
   re-queueing a vnode which may have been freed already.

Discussed with:	kib
2008-03-23 01:44:28 +00:00
Jeff Roberson
f6a8cecfc6 - Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by:	kris
2008-03-23 01:42:19 +00:00
Poul-Henning Kamp
4218a7310b In abort2(2): Accept a NULL arg pointer if nargs == 0 2008-03-22 16:32:52 +00:00
Jeff Roberson
698b1a6643 - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
Alfred Perlstein
435cdf88ea Fix a race where timeout/untimeout could cause crashes for Giant locked
code.

The bug:

There exists a race condition for timeout/untimeout(9) due to the
way that the softclock thread dequeues timeouts.

The softclock thread sets the c_func and c_arg of the callout to
NULL while holding the callout lock but not Giant.  It then drops
the callout lock and acquires Giant.

It is at this point where untimeout(9) on another cpu/thread could
be called.

Since c_arg and c_func are cleared, untimeout(9) does not touch the
callout and returns as if the callout is canceled.

The softclock then tries to acquire Giant and likely blocks due to
the other cpu/thread holding it.

The other cpu/thread then likely deallocates the backing store that
c_arg points to and finishes working and hence drops Giant.

Softclock resumes and acquires giant and calls the function with
the now free'd c_arg and we have corruption/crash.

The fix:

We need to track curr_callout even for timeout(9) (LOCAL_ALLOC)
callouts.  We need to free the callout after the softclock processes
it to deal with the race here.

Obtained from: Juniper Networks, iedowse
Reviewed by: jhb, iedowse
MFC After: 2 weeks.
2008-03-22 07:29:45 +00:00
Konstantin Belousov
e7ffdf423a Reduce contention on the vnode interlock by not acquiring the BO_LOCK
around the check for the BV_BKGRDINPROG in the brelse() and bqrelse().
See the comment for the explanation why it is safe.

Tested by:	pho
Submitted by:	jeff
2008-03-21 12:38:44 +00:00
Jeff Roberson
0169d126a6 - Reduce contention on the global bdonelock and bpinlock by using
a pool mutex to protect these sleep/wakeup/counter races.  This
   still is preferable to bloating each bio with a mtx.
2008-03-21 10:00:05 +00:00
Jeff Roberson
b7edba7704 - Add a new td flag TDF_NEEDSUSPCHK that is set whenever a thread needs
to enter thread_suspend_check().
 - Set TDF_ASTPENDING along with TDF_NEEDSUSPCHK so we can move the
   thread_suspend_check() to ast() rather than userret().
 - Check TDF_NEEDSUSPCHK in the sleepq_catch_signals() optimization so
   that we don't miss a suspend request.  If this is set use the
   expensive signal path.
 - Set NEEDSUSPCHK when creating a new thread in thr in case the
   creating thread is due to be suspended as well but has not yet.

Reviewed by:	davidxu (Authored original patch)
2008-03-21 08:23:25 +00:00
John Baldwin
dcc8106854 Implement a BUS_BIND_INTR() method in the bus interface to bind an IRQ
resource to a CPU.  The default method is to pass the request up to the
parent similar to BUS_CONFIG_INTR() so that all busses don't have to
explicitly implement bus_bind_intr.  A bus_bind_intr(9) wrapper routine
similar to bus_setup/teardown_intr() is added for device drivers to use.
Unbinding an interrupt is done by binding it to NOCPU.  The IRQ resource
must be allocated, but it can happen in any order with respect to
bus_setup_intr().  Currently it is only supported on amd64 and i386 via
nexus(4) methods that simply call the intr_bind() routine.

Tested by:	gallatin
2008-03-20 21:24:32 +00:00
Konstantin Belousov
69aa768aef Fix the leak of the vmspace on the fork when the process limits
are exceeded.

Pointy hat to:	me
MFC after:	3 days
2008-03-20 15:24:49 +00:00
Jeff Roberson
9727e63745 - Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
 - Compile kern_switch.c independently again and stop #include'ing it from
   schedulers.
 - Remove the ts_thread backpointers and convert most code to go from
   struct thread to struct td_sched.
 - Cleanup the ts_flags #define garbage that was causing us to sometimes
   do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
 - Export the kern.sched sysctl node in sysctl.h
2008-03-20 05:51:16 +00:00
Jeff Roberson
52e95411f8 - Remove the unused and redundant sched_newproc() function.
- Remove the unused and redundant sched_newthread() which peaks into scheduler
   private structures.
2008-03-20 03:09:15 +00:00
Jeff Roberson
79813875ab - There is no sense in calling sched_newthread() at thread_init() and
thread_fini().  The schedulers initialize themselves properly during
   sched_fork_thread() anyhow.  fini is only called when we're returning
   the memory to the allocator which surely doesn't care what state the
   memory is in.
2008-03-20 03:07:57 +00:00
Jeff Roberson
8b16c208e6 - ULE and 4BSD share only one line of code from sched_newthread() so implement
the required pieces in sched_fork_thread().  The td_sched pointer is already
   setup by thread_init anyway.
2008-03-20 03:06:33 +00:00
Jeff Roberson
0ac213ef80 - Don't call the empty sched_newproc() function. sched_newproc() already
existed as sched_fork() which is a non empty function in both schedulers.
2008-03-20 03:05:17 +00:00
Jeff Roberson
a90f3f2547 - Move maybe_preempt() from kern_switch.c to sched_4bsd.c. This is function
is only used by 4bsd.
 - Create a new runq_choose_fuzz() function rather than polluting runq_choose()
   with 4BSD specific code.
 - Move the fuzz sysctl into sched_4bsd.c
 - Remove some dead code from kern_switch.c
2008-03-20 02:14:02 +00:00
Jeff Roberson
a564bfc7fa - Directly include opt_sched.h in sched_4bsd. 2008-03-20 01:32:48 +00:00
Maxim Sobolev
073d8ba485 Revert previous change - it appears that the limit I was hitting was a
maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.

Requested by:   rwhatson, jeff
2008-03-19 09:58:25 +00:00
Pawel Jakub Dawidek
4682cd0b7d Remove extra uihold() call that accidentally sneak in during perforce
change @125544.
2008-03-19 07:52:07 +00:00
Jeff Roberson
6d55b3ec9c - Remove some dead code and comments related to KSE.
- Don't set tdq_lowpri on every switch, it should be precisely maintained now.
 - Add some comments to sched_thread_priority().
2008-03-19 07:36:37 +00:00
Jeff Roberson
241fbd3d13 - At the top of sleepq_catch_signals() lock the thread and check TDF_NEEDSIGCHK
before doing the very expensive cursig() and related locking.  NEEDSIGCHK
   is updated whenever our signal mask change or when a signal is delivered and
   should be sufficient to avoid the more expensive tests.  This eliminates
   another source of PROC_LOCK contention in multithreaded programs.
2008-03-19 07:35:14 +00:00
Jeff Roberson
bd4e153568 - Remove stale comment.
- In the last revision the code was changed to use maxfilesperproc rather than
   the per-process file limit to restrict the size of the poll array.  This
   eliminates a significant source of process lock contention in multithreaded
   programs and is cheaper.  This had been committed with the wrong batch of
   changes.
2008-03-19 07:33:16 +00:00
Jeff Roberson
afc5854dbc - Add a facility similar to LOCK_PROFILING under SLEEPQUEUE_PROFILING. Keep
a simple (wmesg, count) tuple in a hash to keep track of how many times
   we sleep at each wait message.  We hash on message and not channel.  No
   line number information is given as typically wait messages are not used in
   more than one place.  Identical strings defined at different addresses will
   show up with seperate counters.
 - Use debug.sleepq.enable to enable, .reset to reset, and .stats dumps stats.
 - Do an unsynchronized check in sleepq_switch() prior to switching before
   calling sleepq_profile() which uses a global lock to synchronize the hash.
   Only sleeps which actually cause a context switch are counted.
2008-03-19 07:22:07 +00:00
Jeff Roberson
fbd762f197 - Fix the last of the threading bugs that were introduced as far back as
1.38 in 2001.  Break out of the FOREACH_THREAD_IN_PROC loop when we've
   discovered a new proc in the chain.
 - Increment i and check for maxlockdepth once per matching process not
   once per thread.  This didn't properly terminate the loop before.
 - Fix a bug which has existed potentially since rev 1.1.  waitblock->lf_next
   can be NULL when a thread has been woken-up but not yet scheduled.  Check
   for this condition rather than blindly dereferencing.

Found by:	libMicro
2008-03-19 07:13:24 +00:00
Jeff Roberson
45aea8de6e - Restore the NULL check for td_cpuset. This can happen if a partially
constructed thread was torn down as is the case when we fail to allocate
   a kernel stack.
2008-03-19 06:20:21 +00:00
Jeff Roberson
374ae2a393 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
John Baldwin
6d2d1c044f Simplify the interrupt code a bit:
- Always include the ie_disable and ie_eoi methods in 'struct intr_event'
  and collapse down to one intr_event_create() routine.  The disable and
  eoi hooks simply aren't used currently in the !INTR_FILTER case.
- Expand 'disab' to 'disable' in a few places.
- Use function casts for arm and i386:intr_eoi_src() instead of wrapper
  routines since to trim one extra indirection.

Compiled on:	{arm,amd64,i386,ia64,ppc,sparc64} x {FILTER, !FILTER}
Tested on:	{amd64,i386} x {FILTER, !FILTER}
2008-03-17 22:42:01 +00:00
Konstantin Belousov
aeeb4202df Fix two races in the handling of the d_gianttrick for the D_NEEDGIANT
drivers.

In the giant_XXX wrappers for the device methods of the D_NEEDGIANT
drivers, do not dereference the cdev->si_devsw. It is racing with
the destroy_devl() clearing of the si_devsw. Instead, use the
dev_refthread() and return ENXIO for the destroyed device. [1]

The check for the D_INIT in the prep_cdevsw() was not synchronized with
the call of the fini_cdevsw() in destroy_devl(), that under rapid device
creation/destruction may result in the use of uninitialized cdevsw [2].
Change the protocol for the prep_cdevsw(), requiring it to be called
under dev_mtx, where the check for D_INIT is done.

Do not free the memory allocated for the gianttrick cdevsw while holding
the dev_mtx, put it into the free list to be freed later. Reuse the
d_gianttrick pointer to keep the size and layout of the struct cdevsw
(requested by phk). Free the memory in the dev_unlock_and_free(), and do
all the free after the dev_mtx is dropped (suggested by jhb).

Reported by:	bsdimp + many [1], pho [2]
Reviewed by:	phk, jhb
Tested by:	pho
MFC after:	1 week
2008-03-17 13:17:10 +00:00
Pawel Jakub Dawidek
4582cb68b1 - There is no more "uidinfo struct" mutex.
- The "uidinfo hash" lock is now a rwlock.

Reminded by:	kib
2008-03-17 11:48:40 +00:00
Pawel Jakub Dawidek
709446e782 Whitespace cleanups. 2008-03-16 21:32:20 +00:00
Pawel Jakub Dawidek
1b072fbcab - Use wait-free method to manage ui_sbsize and ui_proccnt fields in the
uidinfo structure. This entirely removes contention observed on the
  ui_mtxp mutex (as it is now gone).
- Convert the uihashtbl_mtx mutex to a rwlock, as most of the time we just
  need to read-lock it.

Reviewed by:	jhb, jeff, kris & others
Tested by:	kris
2008-03-16 21:29:02 +00:00
Robert Watson
45fa2c8a87 Consistently use ANSI C declarationsfor all functions in kern_synch.c. 2008-03-16 18:59:21 +00:00
Pawel Jakub Dawidek
e056770745 Style fixes. 2008-03-16 18:26:59 +00:00
Pawel Jakub Dawidek
67e83b07c6 Fix information leak. We can find PIDs of running processes from within
a jail, etc. by simply calling setpriority(PRIO_PROCESS, <PID>, 0) and
checking the return value: 0 means that the process exists and -1 that
it doesn't exist.

Reviewed by:	rwatson
MFC after:	1 week
2008-03-16 17:55:06 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Maxim Sobolev
c9370ff4d0 Properly set size of the file_zone to match kern.maxfiles parameter.
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.

MFC after:	2 weeks
2008-03-16 06:21:30 +00:00
Ruslan Ermilov
1f49b573e1 Fix panic on e.g. "kldload /dev/null".
PR:		kern/121427
Reviewed by:	sem
MFC after:	3 days
2008-03-15 17:40:18 +00:00
John Baldwin
eaf86d1678 Add preliminary support for binding interrupts to CPUs:
- Add a new intr_event method ie_assign_cpu() that is invoked when the MI
  code wishes to bind an interrupt source to an individual CPU.  The MD
  code may reject the binding with an error.  If an assign_cpu function
  is not provided, then the kernel assumes the platform does not support
  binding interrupts to CPUs and fails all requests to do so.
- Bind ithreads to CPUs on their next execution loop once an interrupt
  event is bound to a CPU.  Only shared ithreads are bound.  We currently
  leave private ithreads for drivers using filters + ithreads in the
  INTR_FILTER case unbound.
- A new intr_event_bind() routine is used to bind an interrupt event to
  a CPU.
- Implement binding on amd64 and i386 by way of the existing pic_assign_cpu
  PIC method.
- For x86, provide a 'intr_bind(IRQ, cpu)' wrapper routine that looks up
  an interrupt source and binds its interrupt event to the specified CPU.
  MI code can currently (ab)use this by doing:

	intr_bind(rman_get_start(irq_res), cpu);

  however, I plan to add a truly MI interface (probably a bus_bind_intr(9))
  where the implementation in the x86 nexus(4) driver would end up calling
  intr_bind() internally.

Requested by:	kmacy, gallatin, jeff
Tested on:	{amd64, i386} x {regular, INTR_FILTER}
2008-03-14 19:41:48 +00:00
John Baldwin
d628fbfa98 Make the function prototype for cpu_search() match the declaration so that
this still compiles with gcc3.
2008-03-14 15:22:38 +00:00
Jeff Roberson
f4d77e9e54 PR 117603
- Close a sleepqueue signal race by interlocking with the per-process
   spinlock.  This was mistakenly omitted from the thread_lock patch and
   has been a race since.

MFC After:	1 week
PR:		bin/117603
Reported by:	Danny Braniss <danny@cs.huji.ac.il>
2008-03-13 00:46:12 +00:00
Jeff Roberson
6617724c5f Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
Jeff Roberson
c5aa6b581d - Pass the priority argument from *sleep() into sleepq and down into
sched_sleep().  This removes extra thread_lock() acquisition and
   allows the scheduler to decide what to do with the static boost.
 - Change the priority arguments to cv_* to match sleepq/msleep/etc.
   where 0 means no priority change.  Catch -1 in cv_broadcastpri() and
   convert it to 0 for now.
 - Set a flag when sleeping in a way that is compatible with swapping
   since direct priority comparisons are meaningless now.
 - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
   controls the boost behavior.  Turning it off gives better performance
   in some workloads but needs more investigation.
 - While we're modifying sleepq, change signal and broadcast to both
   return with the lock held as the lock was held on enter.

Reviewed by:	jhb, peter
2008-03-12 06:31:06 +00:00
Jeff Roberson
bdb5bdf0b7 - KSE may free a thread that was never actually forked. This will leave
td_cpuset NULL.  Check for this condition before dereferencing the
   cpuset.

Reported by:	david@catwhisker.org, miwi@freebsd.org
Sponsored by:	Nokia
2008-03-12 05:01:14 +00:00
Jeff Roberson
c143ac21af - Fix the invalid priority panics people are seeing by forcing
tdq_runq_add to select the runq rather than hoping we set it properly
   when we adjusted the priority.  This involves the same number of
   branches as before so should perform identically without the extra
   fragility.

Tested by:	bz
Reviewed by:	bz
2008-03-10 22:48:27 +00:00
Jeff Roberson
7217d8d1ee - Don't rely on a side effect of sched_prio() to set the initial ts_runq
for thread0.  Set it directly in sched_setup().  This fixes traps on boot
   seen on some machines.

Reported by:	phk
2008-03-10 09:50:29 +00:00
Jeff Roberson
8f93d79d05 - Handle kdb switch panics outside of mi_switch() to remove some instructions
from the common path and make the code more clear.  Whether this has any
   impact on performance may depend on optimization levels.

Sponsored by:	Nokia
2008-03-10 03:16:51 +00:00
Jeff Roberson
73daf66f41 Reduce ULE context switch time by over 25%.
- Only calculate timeshare priorities once per tick or when a thread is woken
   from sleeping.
 - Keep the ts_runq pointer valid after all priority changes.
 - Call tdq_runq_add() directly from sched_switch() without passing in via
   tdq_add().  We don't need to adjust loads or runqs anymore.
 - Sort tdq and ts_sched according to utilization to improve cache behavior.

Sponsored by:	Nokia
2008-03-10 03:15:19 +00:00