Commit Graph

10418 Commits

Author SHA1 Message Date
attilio
d5dbd84790 - Use a different encoding for lockmgr options: make them encoded by
bit in order to allow per-bit checks on the options flag, in particular
  in the consumers code [1]
- Re-enable the check against TDP_DEADLKTREAT as the anti-waiters
  starvation patch allows exclusive waiters to override new shared
  requests.

[1] Requested by:	pjd, jeff
2008-04-07 14:46:38 +00:00
truckman
3ab6955dbc vfs_syscalls.c 1.452 mistakenly swapped the behavior of chown() and lchown(). 2008-04-07 00:29:32 +00:00
attilio
07441f19e1 Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw().  These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock.  In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer.  This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI.  In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by:      kris, pho, jeff, danger
Reviewed by:    jeff
Sponsored by:   Google, Summer of Code program 2007
2008-04-06 20:08:51 +00:00
jeff
d1c199f415 - Correct a major error introduced in the per-cpu timeout commit. Sleep
and wakeup require the same wait channel to function properly.

Found by:	kris
Pointy hat:	me
2008-04-06 11:08:49 +00:00
jhb
68917b32fc Move INTR_FILTER from opt_global.h to its own header. 2008-04-05 20:13:15 +00:00
jhb
79918c45a6 Add a MI intr_event_handle() routine for the non-INTR_FILTER case. This
allows all the INTR_FILTER #ifdef's to be removed from the MD interrupt
code.
- Rename the intr_event 'eoi', 'disable', and 'enable' hooks to
  'post_filter', 'pre_ithread', and 'post_ithread' to be less x86-centric.
  Also, add a comment describe what the MI code expects them to do.
- On amd64, i386, and powerpc this is effectively a NOP.
- On arm, don't bother masking the interrupt unless the ithread is
  scheduled in the non-INTR_FILTER case to match what INTR_FILTER did.
  Also, don't bother unmasking the interrupt in the post_filter case if
  we never masked it.  The INTR_FILTER case had been doing this by having
  arm_unmask_irq for the post_filter (formerly 'eoi') hook.
- On ia64, stray interrupts are now masked for the non-INTR_FILTER case.
  They were already masked in the INTR_FILTER case.
- On sparc64, use the a NULL pre_ithread hook and use intr_enable_eoi() for
  both the 'post_filter' and 'post_ithread' hooks to match what the
  non-INTR_FILTER code did.
- On sun4v, retire the ithread wrapper hack by using an appropriate
  'post_ithread' hook instead (it's what 'post_ithread'/'enable' was
  designed to do even in 5.x).

Glanced at by:	piso
Reviewed by:	marius
Requested by:	marius [1], [5]
Tested on:	amd64, i386, arm, sparc64
2008-04-05 19:58:30 +00:00
alc
067dba5f97 Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.)  UMA_SLAB_KERNEL is now required by the jumbo frame
allocators.  Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object.  In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT.  This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL.  This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
2008-04-04 18:41:12 +00:00
jeff
d6dabc3153 - Add sysctls at debug.rwlock to control the behavior of the speculative
spinning when readers hold a lock.  This spinning is speculative because,
   unlike the write case, we can not test whether the owners are running.
 - Add speculative read spinning for readers who are blocked by pending
   writers while a read lock is still held.  This allows the thread to
   spin until the write lock succeeds after which it may spin until the
   writer has released the lock.  This prevents excessive context switches
   when readers and writers both hold the lock for brief periods.

Sponsored by:	Nokia
2008-04-04 10:00:46 +00:00
jeff
73796e923c - Add a Nokia copyright to cpuset to reflect their generous
contribution to this work.
2008-04-04 01:22:04 +00:00
jeff
85d3ffe23c - Allow static_boost to specify no boost with '0', traditional kernel
fixed pri boost with '1' or any priority less than the current thread's
   priority with a value greater than two.  Default the boost to
   PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
   threads in the kernel.  This prevents these user-threads from also
   being scheduled as if they are high fixed-priority kernel threads.
 - Restore the setting of lowpri in tdq_choose().  It has to be either here
   or in sched_switch().  I accidentally removed it from both places.

Tested by:	kris
2008-04-04 01:16:18 +00:00
jeff
c50de590cc - Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't
do this either.  Simply check P_NOLOAD.  It'd be nice if this was
   in a thread flag so we didn't have an extra cache miss every time we
   add and remove a thread from the run-queue.
2008-04-04 01:04:43 +00:00
jeff
7d635b683d - Fix a mis-merge that crept in during the softclock changes.
Spotted by:	jhb
2008-04-04 01:03:23 +00:00
davidxu
6e5250730e let umtxq_busy() only spin on mp machine. make function name
do_rwlock_unlock to be consistent with others.
2008-04-03 11:49:20 +00:00
jeff
ca28ca664f - Convert two timeout users to the new callout_reset_curcpu() api.
Sponsored by:	Nokia
2008-04-02 11:21:42 +00:00
jeff
b065517935 Implement per-cpu callout threads, wheels, and locks.
- Move callout thread creation from kern_intr.c to kern_timeout.c
 - Call callout_tick() on every processor via hardclock_cpu() rather than
   inspecting callout internal details in kern_clock.c.
 - Remove callout implementation details from callout.h
 - Package up all of the global variables into a per-cpu callout structure.
 - Start one thread per-cpu.  Threads are not strictly bound.  They prefer
   to execute on the native cpu but may migrate temporarily if interrupts
   are starving callout processing.
 - Run all callouts by default in the thread for cpu0 to maintain current
   ordering and concurrency guarantees.  Many consumers may not properly
   handle concurrent execution.
 - The new callout_reset_on() api allows specifying a particular cpu to
   execute the callout on.  This may migrate a callout to a new cpu.
   callout_reset() schedules on the last assigned cpu while
   callout_reset_curcpu() schedules on the current cpu.

Reviewed by:	phk
Sponsored by:	Nokia
2008-04-02 11:20:30 +00:00
kib
c951adca24 Add two missed chunks from the rev. 1.210, for the giant_read() and
giant_ioctl().

PR:	kern/122287
MFC after:	3 days
2008-04-02 11:11:58 +00:00
jeff
639ca8f21b - Destroy the bo mtx when the vnode is destroyed. 2008-04-02 10:40:03 +00:00
davidxu
aefa44f0cc Fix compiling problem for amd64. 2008-04-02 05:54:41 +00:00
davidxu
f4f495d3ed Er, don't restart a timeout version. 2008-04-02 04:26:59 +00:00
davidxu
ebdf401288 Introduce kernel based userland rwlock. Each umtx chain now has two lists,
one for readers and one for writers, other types of synchronization
object just use first list.

Asked by: jeff
2008-04-02 04:08:37 +00:00
attilio
672b78c87b Add rw_try_rlock() and rw_try_wlock() to rwlocks.
These functions try the specified operation (rlocking and wlocking) and
true is returned if the operation completes, false otherwise.

The KPI is enriched by this commit, so __FreeBSD_version bumping and
manpage updating will happen soon.

Requested by:	jeff, kris
2008-04-01 20:31:55 +00:00
dfr
60db59bdb1 Don't try to use an SX lock while holding the vnode interlock.
Sponsored by:	Isilon Systems
2008-04-01 16:07:01 +00:00
kib
5c017b360f Regen 2008-03-31 12:12:27 +00:00
kib
6687cc3940 Add the openat(), fexecve() and other *at() syscalls to the table.
Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:06:55 +00:00
kib
78facfe99b Implement the fexecve(2) syscall.
Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:05:52 +00:00
kib
e0eb2de7e9 Implement the
openat(2), faccessat(2), fchmodat(2), fchownat(2), fstatat(2),
	futimesat(2), linkat(2), mkdirat(2), mkfifoat(2), mknodat(2),
	readlinkat(2), renameat(2), symlinkat(2)
syscalls.

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:04:20 +00:00
kib
eff8c6d35e Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:01:21 +00:00
kib
fb67926ebb Add the support for the O_EXEC open(2) mode, as specified by the
POSIX Extended API Set Part 2 extension specification.

Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 11:57:18 +00:00
kib
fe816f911f Add the utility function vn_commname() to retrieve the command name
from the vfs namecache, when available.

Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 11:53:03 +00:00
jeff
c2c4476b86 - Consistently return EDEADLK when presented with a new set that is
incompatible with existing bindings.
 - Try to copyout the setid in cpuset() before migrating the proc to the
   setid in case the user has supplied a bad buffer.
 - Rename cpuset_root() and cpuset_base() to cpuset_ref{root,base} to
   be more descriptive and free cpuset_root to be used as a different
   type of symbol.
 - Make cpuset_root the cpuset_t set of all cpus in the system.  This
   should contain the same bitmask as all_cpus presently.
 - Add a CPU_CMP() macro to compare two sets.
2008-03-30 11:31:14 +00:00
jeff
ef1b5135a9 - Don't allow calls to vn_lock() with no lock type requested. Callers
which simply want a reference should use vref().  Callers which want
   to check validity need to hold a lock while performing any action
   based on that validity.  vn_lock() would always release the interlock
   before returning making any action synchronous with the validity check
   impossible.
2008-03-29 23:36:26 +00:00
jeff
26629f5add - Use vget() to lock the vnode rather than refing without a lock and
locking in separate steps.
2008-03-29 23:30:40 +00:00
attilio
7e107a0c8c b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by:	jeff
Approved by:	jeff, kib
2008-03-28 12:30:12 +00:00
jb
735643909e Regen after makesyscalls.sh change. 2008-03-27 01:55:06 +00:00
jb
4a6deb0614 Generate another function for the DTrace syscall provider to specify
the syscall argument types.

This code is only compiled into the systrace kernel modul and has no
effect otherwise.
2008-03-27 01:53:44 +00:00
phk
fa71439e44 The "free-lance" timer in the i8254 is only used for the speaker
these days, so de-generalize the acquire_timer/release_timer api
to just deal with speakers.

The new (optional) MD functions are:
	timer_spkr_acquire()
	timer_spkr_release()
and
	timer_spkr_setfreq()

the last of which configures the timer to generate a tone of a given
frequency, in Hz instead of 1/1193182th of seconds.

Drop entirely timer2 on pc98, it is not used anywhere at all.

Move sysbeep() to kern/tty_cons.c and use the timer_spkr*() if
they exist, and do nothing otherwise.

Remove prototypes and empty acquire-/release-timer() and sysbeep()
functions from the non-beeping archs.

This eliminate the need for the speaker driver to know about
i8254frequency at all.  In theory this makes the speaker driver MI,
contingent on the timer_spkr_*() functions existing but the driver
does not know this yet and still attaches to the ISA bus.

Syscons is more tricky, in one function, sc_tone(), it knows the hz
and things are just fine.

In the other function, sc_bell() it seems to get the period from
the KDMKTONE ioctl in terms if 1/1193182th second, so we hardcode
the 1193182 and leave it at that.  It's probably not important.

Change a few other sysbeep() uses which obviously knew that the
argument was in terms of i8254 frequency, and leave alone those
that look like people thought sysbeep() took frequency in hertz.

This eliminates the knowledge of i8254_freq from all but the actual
clock.c code and the prof_machdep.c on amd64 and i386, where I think
it would be smart to ask for help from the timecounters anyway [TBD].
2008-03-26 20:09:21 +00:00
dfr
1c5a20ad66 Regen. 2008-03-26 15:24:02 +00:00
dfr
79d2dfdaa6 Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
scottl
a3b7a4bce8 Implement taskqueue_block() and taskqueue_unblock(). These functions allow
the owner of a queue to block and unblock execution of the tasks in the
queue while allowing tasks to continue to be added queue.  Combining this
with taskqueue_drain() allows a queue to be safely disabled.  The unblock
function may run (or schedule to run) the queue when it is called, just as
calling taskqueue_enqueue() would.

Reviewed by: jhb, sam
2008-03-25 22:38:45 +00:00
ru
3b1bf8c2e9 Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by:	arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
2008-03-25 09:39:02 +00:00
ru
0655a583e2 Regen after changing prototypes of cpuset_{get,set}affinity(). 2008-03-25 09:14:17 +00:00
ru
4feaeed265 Fixed type of the fourth argument of cpuset_{get,set}affinity(2) to be size_t.
Prodded by:	davidxu
2008-03-25 09:11:53 +00:00
jeff
3ad75daf19 - Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
   call.  The kernel makes no guarantees about which reference was the
   last to close a file or when the actual inactive processing will
   happen.  The previous code was designed to preserve existing semantics
   in the face of shared locks, however, this was unnecessary.

Discussed with:	mckusick
2008-03-24 04:22:58 +00:00
jeff
1bf44343e2 - Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested.  Handle this case specially before the while loop.
 - Use the held vnode lock to check for VI_DOOMED.  The vnode lock and
   interlock must both be held to set VI_DOOMED so either one held, even
   shared, is sufficient to check it.

No objection by:	kib
2008-03-24 04:17:35 +00:00
kib
5ddf5664cc Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed:	on stable@, with bde
Reviewed by:	tegge, jeff [1]
MFC after:	2 weeks
2008-03-23 13:45:24 +00:00
davidxu
c32a483ae9 Remove commented out code, thread suspension is done in thread library. 2008-03-23 02:03:06 +00:00
jeff
8103d042fb - Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list.  This prevents sched_sync() from
   re-queueing a vnode which may have been freed already.

Discussed with:	kib
2008-03-23 01:44:28 +00:00
jeff
73b6a5597c - Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by:	kris
2008-03-23 01:42:19 +00:00
phk
5a1f4173f5 In abort2(2): Accept a NULL arg pointer if nargs == 0 2008-03-22 16:32:52 +00:00
jeff
a9d123c3ab - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00