Commit Graph

10374 Commits

Author SHA1 Message Date
kib
5ddf5664cc Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed:	on stable@, with bde
Reviewed by:	tegge, jeff [1]
MFC after:	2 weeks
2008-03-23 13:45:24 +00:00
davidxu
c32a483ae9 Remove commented out code, thread suspension is done in thread library. 2008-03-23 02:03:06 +00:00
jeff
8103d042fb - Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list.  This prevents sched_sync() from
   re-queueing a vnode which may have been freed already.

Discussed with:	kib
2008-03-23 01:44:28 +00:00
jeff
73b6a5597c - Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by:	kris
2008-03-23 01:42:19 +00:00
phk
5a1f4173f5 In abort2(2): Accept a NULL arg pointer if nargs == 0 2008-03-22 16:32:52 +00:00
jeff
a9d123c3ab - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
alfred
b283b3e59a Fix a race where timeout/untimeout could cause crashes for Giant locked
code.

The bug:

There exists a race condition for timeout/untimeout(9) due to the
way that the softclock thread dequeues timeouts.

The softclock thread sets the c_func and c_arg of the callout to
NULL while holding the callout lock but not Giant.  It then drops
the callout lock and acquires Giant.

It is at this point where untimeout(9) on another cpu/thread could
be called.

Since c_arg and c_func are cleared, untimeout(9) does not touch the
callout and returns as if the callout is canceled.

The softclock then tries to acquire Giant and likely blocks due to
the other cpu/thread holding it.

The other cpu/thread then likely deallocates the backing store that
c_arg points to and finishes working and hence drops Giant.

Softclock resumes and acquires giant and calls the function with
the now free'd c_arg and we have corruption/crash.

The fix:

We need to track curr_callout even for timeout(9) (LOCAL_ALLOC)
callouts.  We need to free the callout after the softclock processes
it to deal with the race here.

Obtained from: Juniper Networks, iedowse
Reviewed by: jhb, iedowse
MFC After: 2 weeks.
2008-03-22 07:29:45 +00:00
kib
bc4bc893dd Reduce contention on the vnode interlock by not acquiring the BO_LOCK
around the check for the BV_BKGRDINPROG in the brelse() and bqrelse().
See the comment for the explanation why it is safe.

Tested by:	pho
Submitted by:	jeff
2008-03-21 12:38:44 +00:00
jeff
72142b2fae - Reduce contention on the global bdonelock and bpinlock by using
a pool mutex to protect these sleep/wakeup/counter races.  This
   still is preferable to bloating each bio with a mtx.
2008-03-21 10:00:05 +00:00
jeff
ba540b27d6 - Add a new td flag TDF_NEEDSUSPCHK that is set whenever a thread needs
to enter thread_suspend_check().
 - Set TDF_ASTPENDING along with TDF_NEEDSUSPCHK so we can move the
   thread_suspend_check() to ast() rather than userret().
 - Check TDF_NEEDSUSPCHK in the sleepq_catch_signals() optimization so
   that we don't miss a suspend request.  If this is set use the
   expensive signal path.
 - Set NEEDSUSPCHK when creating a new thread in thr in case the
   creating thread is due to be suspended as well but has not yet.

Reviewed by:	davidxu (Authored original patch)
2008-03-21 08:23:25 +00:00
jhb
6cf6d7b22b Implement a BUS_BIND_INTR() method in the bus interface to bind an IRQ
resource to a CPU.  The default method is to pass the request up to the
parent similar to BUS_CONFIG_INTR() so that all busses don't have to
explicitly implement bus_bind_intr.  A bus_bind_intr(9) wrapper routine
similar to bus_setup/teardown_intr() is added for device drivers to use.
Unbinding an interrupt is done by binding it to NOCPU.  The IRQ resource
must be allocated, but it can happen in any order with respect to
bus_setup_intr().  Currently it is only supported on amd64 and i386 via
nexus(4) methods that simply call the intr_bind() routine.

Tested by:	gallatin
2008-03-20 21:24:32 +00:00
kib
28174d9ffb Fix the leak of the vmspace on the fork when the process limits
are exceeded.

Pointy hat to:	me
MFC after:	3 days
2008-03-20 15:24:49 +00:00
jeff
a3f8e0c20d - Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
 - Compile kern_switch.c independently again and stop #include'ing it from
   schedulers.
 - Remove the ts_thread backpointers and convert most code to go from
   struct thread to struct td_sched.
 - Cleanup the ts_flags #define garbage that was causing us to sometimes
   do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
 - Export the kern.sched sysctl node in sysctl.h
2008-03-20 05:51:16 +00:00
jeff
4274384df8 - Remove the unused and redundant sched_newproc() function.
- Remove the unused and redundant sched_newthread() which peaks into scheduler
   private structures.
2008-03-20 03:09:15 +00:00
jeff
898428987b - There is no sense in calling sched_newthread() at thread_init() and
thread_fini().  The schedulers initialize themselves properly during
   sched_fork_thread() anyhow.  fini is only called when we're returning
   the memory to the allocator which surely doesn't care what state the
   memory is in.
2008-03-20 03:07:57 +00:00
jeff
19aab7bccf - ULE and 4BSD share only one line of code from sched_newthread() so implement
the required pieces in sched_fork_thread().  The td_sched pointer is already
   setup by thread_init anyway.
2008-03-20 03:06:33 +00:00
jeff
055d72e2b9 - Don't call the empty sched_newproc() function. sched_newproc() already
existed as sched_fork() which is a non empty function in both schedulers.
2008-03-20 03:05:17 +00:00
jeff
ded0975003 - Move maybe_preempt() from kern_switch.c to sched_4bsd.c. This is function
is only used by 4bsd.
 - Create a new runq_choose_fuzz() function rather than polluting runq_choose()
   with 4BSD specific code.
 - Move the fuzz sysctl into sched_4bsd.c
 - Remove some dead code from kern_switch.c
2008-03-20 02:14:02 +00:00
jeff
a2c51b3fe4 - Directly include opt_sched.h in sched_4bsd. 2008-03-20 01:32:48 +00:00
sobomax
d818a8db68 Revert previous change - it appears that the limit I was hitting was a
maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.

Requested by:   rwhatson, jeff
2008-03-19 09:58:25 +00:00
pjd
da476acd46 Remove extra uihold() call that accidentally sneak in during perforce
change @125544.
2008-03-19 07:52:07 +00:00
jeff
d6d07d2730 - Remove some dead code and comments related to KSE.
- Don't set tdq_lowpri on every switch, it should be precisely maintained now.
 - Add some comments to sched_thread_priority().
2008-03-19 07:36:37 +00:00
jeff
d43ad8d37e - At the top of sleepq_catch_signals() lock the thread and check TDF_NEEDSIGCHK
before doing the very expensive cursig() and related locking.  NEEDSIGCHK
   is updated whenever our signal mask change or when a signal is delivered and
   should be sufficient to avoid the more expensive tests.  This eliminates
   another source of PROC_LOCK contention in multithreaded programs.
2008-03-19 07:35:14 +00:00
jeff
d4862a02d2 - Remove stale comment.
- In the last revision the code was changed to use maxfilesperproc rather than
   the per-process file limit to restrict the size of the poll array.  This
   eliminates a significant source of process lock contention in multithreaded
   programs and is cheaper.  This had been committed with the wrong batch of
   changes.
2008-03-19 07:33:16 +00:00
jeff
ea2b75bd30 - Add a facility similar to LOCK_PROFILING under SLEEPQUEUE_PROFILING. Keep
a simple (wmesg, count) tuple in a hash to keep track of how many times
   we sleep at each wait message.  We hash on message and not channel.  No
   line number information is given as typically wait messages are not used in
   more than one place.  Identical strings defined at different addresses will
   show up with seperate counters.
 - Use debug.sleepq.enable to enable, .reset to reset, and .stats dumps stats.
 - Do an unsynchronized check in sleepq_switch() prior to switching before
   calling sleepq_profile() which uses a global lock to synchronize the hash.
   Only sleeps which actually cause a context switch are counted.
2008-03-19 07:22:07 +00:00
jeff
4cd4553bb5 - Fix the last of the threading bugs that were introduced as far back as
1.38 in 2001.  Break out of the FOREACH_THREAD_IN_PROC loop when we've
   discovered a new proc in the chain.
 - Increment i and check for maxlockdepth once per matching process not
   once per thread.  This didn't properly terminate the loop before.
 - Fix a bug which has existed potentially since rev 1.1.  waitblock->lf_next
   can be NULL when a thread has been woken-up but not yet scheduled.  Check
   for this condition rather than blindly dereferencing.

Found by:	libMicro
2008-03-19 07:13:24 +00:00
jeff
4350e599a3 - Restore the NULL check for td_cpuset. This can happen if a partially
constructed thread was torn down as is the case when we fail to allocate
   a kernel stack.
2008-03-19 06:20:21 +00:00
jeff
46f09d5bc3 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
jhb
c04bb048f6 Simplify the interrupt code a bit:
- Always include the ie_disable and ie_eoi methods in 'struct intr_event'
  and collapse down to one intr_event_create() routine.  The disable and
  eoi hooks simply aren't used currently in the !INTR_FILTER case.
- Expand 'disab' to 'disable' in a few places.
- Use function casts for arm and i386:intr_eoi_src() instead of wrapper
  routines since to trim one extra indirection.

Compiled on:	{arm,amd64,i386,ia64,ppc,sparc64} x {FILTER, !FILTER}
Tested on:	{amd64,i386} x {FILTER, !FILTER}
2008-03-17 22:42:01 +00:00
kib
d5211e24af Fix two races in the handling of the d_gianttrick for the D_NEEDGIANT
drivers.

In the giant_XXX wrappers for the device methods of the D_NEEDGIANT
drivers, do not dereference the cdev->si_devsw. It is racing with
the destroy_devl() clearing of the si_devsw. Instead, use the
dev_refthread() and return ENXIO for the destroyed device. [1]

The check for the D_INIT in the prep_cdevsw() was not synchronized with
the call of the fini_cdevsw() in destroy_devl(), that under rapid device
creation/destruction may result in the use of uninitialized cdevsw [2].
Change the protocol for the prep_cdevsw(), requiring it to be called
under dev_mtx, where the check for D_INIT is done.

Do not free the memory allocated for the gianttrick cdevsw while holding
the dev_mtx, put it into the free list to be freed later. Reuse the
d_gianttrick pointer to keep the size and layout of the struct cdevsw
(requested by phk). Free the memory in the dev_unlock_and_free(), and do
all the free after the dev_mtx is dropped (suggested by jhb).

Reported by:	bsdimp + many [1], pho [2]
Reviewed by:	phk, jhb
Tested by:	pho
MFC after:	1 week
2008-03-17 13:17:10 +00:00
pjd
cf9cd1298d - There is no more "uidinfo struct" mutex.
- The "uidinfo hash" lock is now a rwlock.

Reminded by:	kib
2008-03-17 11:48:40 +00:00
pjd
4ef010fa26 Whitespace cleanups. 2008-03-16 21:32:20 +00:00
pjd
9123873999 - Use wait-free method to manage ui_sbsize and ui_proccnt fields in the
uidinfo structure. This entirely removes contention observed on the
  ui_mtxp mutex (as it is now gone).
- Convert the uihashtbl_mtx mutex to a rwlock, as most of the time we just
  need to read-lock it.

Reviewed by:	jhb, jeff, kris & others
Tested by:	kris
2008-03-16 21:29:02 +00:00
rwatson
e7b290ea3d Consistently use ANSI C declarationsfor all functions in kern_synch.c. 2008-03-16 18:59:21 +00:00
pjd
52f3c4136e Style fixes. 2008-03-16 18:26:59 +00:00
pjd
d61d590ad7 Fix information leak. We can find PIDs of running processes from within
a jail, etc. by simply calling setpriority(PRIO_PROCESS, <PID>, 0) and
checking the return value: 0 means that the process exists and -1 that
it doesn't exist.

Reviewed by:	rwatson
MFC after:	1 week
2008-03-16 17:55:06 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
sobomax
1560402d31 Properly set size of the file_zone to match kern.maxfiles parameter.
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.

MFC after:	2 weeks
2008-03-16 06:21:30 +00:00
ru
a786aa30ca Fix panic on e.g. "kldload /dev/null".
PR:		kern/121427
Reviewed by:	sem
MFC after:	3 days
2008-03-15 17:40:18 +00:00
jhb
9c113163fb Add preliminary support for binding interrupts to CPUs:
- Add a new intr_event method ie_assign_cpu() that is invoked when the MI
  code wishes to bind an interrupt source to an individual CPU.  The MD
  code may reject the binding with an error.  If an assign_cpu function
  is not provided, then the kernel assumes the platform does not support
  binding interrupts to CPUs and fails all requests to do so.
- Bind ithreads to CPUs on their next execution loop once an interrupt
  event is bound to a CPU.  Only shared ithreads are bound.  We currently
  leave private ithreads for drivers using filters + ithreads in the
  INTR_FILTER case unbound.
- A new intr_event_bind() routine is used to bind an interrupt event to
  a CPU.
- Implement binding on amd64 and i386 by way of the existing pic_assign_cpu
  PIC method.
- For x86, provide a 'intr_bind(IRQ, cpu)' wrapper routine that looks up
  an interrupt source and binds its interrupt event to the specified CPU.
  MI code can currently (ab)use this by doing:

	intr_bind(rman_get_start(irq_res), cpu);

  however, I plan to add a truly MI interface (probably a bus_bind_intr(9))
  where the implementation in the x86 nexus(4) driver would end up calling
  intr_bind() internally.

Requested by:	kmacy, gallatin, jeff
Tested on:	{amd64, i386} x {regular, INTR_FILTER}
2008-03-14 19:41:48 +00:00
jhb
c162b59cc2 Make the function prototype for cpu_search() match the declaration so that
this still compiles with gcc3.
2008-03-14 15:22:38 +00:00
jeff
a469063987 PR 117603
- Close a sleepqueue signal race by interlocking with the per-process
   spinlock.  This was mistakenly omitted from the thread_lock patch and
   has been a race since.

MFC After:	1 week
PR:		bin/117603
Reported by:	Danny Braniss <danny@cs.huji.ac.il>
2008-03-13 00:46:12 +00:00
jeff
acb93d599c Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
jeff
3b1acbdce2 - Pass the priority argument from *sleep() into sleepq and down into
sched_sleep().  This removes extra thread_lock() acquisition and
   allows the scheduler to decide what to do with the static boost.
 - Change the priority arguments to cv_* to match sleepq/msleep/etc.
   where 0 means no priority change.  Catch -1 in cv_broadcastpri() and
   convert it to 0 for now.
 - Set a flag when sleeping in a way that is compatible with swapping
   since direct priority comparisons are meaningless now.
 - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
   controls the boost behavior.  Turning it off gives better performance
   in some workloads but needs more investigation.
 - While we're modifying sleepq, change signal and broadcast to both
   return with the lock held as the lock was held on enter.

Reviewed by:	jhb, peter
2008-03-12 06:31:06 +00:00
jeff
e30139dff5 - KSE may free a thread that was never actually forked. This will leave
td_cpuset NULL.  Check for this condition before dereferencing the
   cpuset.

Reported by:	david@catwhisker.org, miwi@freebsd.org
Sponsored by:	Nokia
2008-03-12 05:01:14 +00:00
jeff
540fa064d9 - Fix the invalid priority panics people are seeing by forcing
tdq_runq_add to select the runq rather than hoping we set it properly
   when we adjusted the priority.  This involves the same number of
   branches as before so should perform identically without the extra
   fragility.

Tested by:	bz
Reviewed by:	bz
2008-03-10 22:48:27 +00:00
jeff
e53ae3b798 - Don't rely on a side effect of sched_prio() to set the initial ts_runq
for thread0.  Set it directly in sched_setup().  This fixes traps on boot
   seen on some machines.

Reported by:	phk
2008-03-10 09:50:29 +00:00
jeff
aa3cc14d3d - Handle kdb switch panics outside of mi_switch() to remove some instructions
from the common path and make the code more clear.  Whether this has any
   impact on performance may depend on optimization levels.

Sponsored by:	Nokia
2008-03-10 03:16:51 +00:00
jeff
14a6f96adb Reduce ULE context switch time by over 25%.
- Only calculate timeshare priorities once per tick or when a thread is woken
   from sleeping.
 - Keep the ts_runq pointer valid after all priority changes.
 - Call tdq_runq_add() directly from sched_switch() without passing in via
   tdq_add().  We don't need to adjust loads or runqs anymore.
 - Sort tdq and ts_sched according to utilization to improve cache behavior.

Sponsored by:	Nokia
2008-03-10 03:15:19 +00:00
imp
be1fee2a1a Tiny bit of KNF to make bus_setup_intr() look like the rest of this
function.
2008-03-10 01:48:25 +00:00