Commit Graph

10521 Commits

Author SHA1 Message Date
Attilio Rao
13ddf72de7 Really, no explicit checks against against lock_class_* object should be
done in consumers code: using locks properties is much more appropriate.
Fix current code doing these bogus checks.

Note: Really, callout are not usable by all !(LC_SPINLOCK | LC_SLEEPABLE)
primitives like rmlocks doesn't implement the generic lock layer
functions, but they can be equipped for this, so the check is still
valid.

Tested by: matteo, kris (earlier version)
Reviewed by: jhb
2008-02-06 00:04:09 +00:00
Robert Watson
3f0bfcccfd Further clean up sorflush:
- Expose sbrelease_internal(), a variant of sbrelease() with no
  expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
  and destroying a socket buffer lock for the temporary stack copy of the
  actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
  things have gotten less ugly in sorflush() lately.

This makes socket close cleaner, and possibly also marginally faster.

MFC after:	3 weeks
2008-02-04 12:25:13 +00:00
Poul-Henning Kamp
b75a1171d8 Give sendfile(2) a SF_SYNC flag which makes it wait until all mbufs
referencing the files VM pages are returned from the network stack,
making changes to the file safe.

This flag does not guarantee that the data has been transmitted to the
other end.
2008-02-03 15:54:41 +00:00
Poul-Henning Kamp
cf827063a9 Give MEXTADD() another argument to make both void pointers to the
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.

Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.

Likely break the only non-convetional user of the API, after informing
the relevant committer.

Update the mbuf(9) manual page, which was already out of sync on
this point.

Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.

This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.

This does not affect the memory layout or size of mbufs.

Parental oversight by:	sam and rwatson.

No MFC is anticipated.
2008-02-01 19:36:27 +00:00
Robert Watson
e603be7ada Use FEATURE() macro to advertise aio availability. 2008-02-01 11:59:14 +00:00
Robert Watson
265de5bb62 Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
  might be sleeping (potentially indefinitely) while holding sblock(),
  such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
  delivered to the thread calling sorflush() doesn't cause sblock() to
  fail.  The sblock() is required to ensure that all other socket
  consumer threads have, in fact, left, and do not enter, the socket
  buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag.  When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by:	bz, kmacy
Reported by:	Jos Backus <jos at catnook dot com>
2008-01-31 08:22:24 +00:00
Ruslan Ermilov
007b1b7bae Add a wrapper function that bound checks writes to the dump device. 2008-01-28 19:04:07 +00:00
Mitsuru IWASAKI
4f7f6238af Add devctl_process_running() so that power management system driver
can check whether devd(8) is running.

MFC after:	1 week
2008-01-27 16:06:37 +00:00
Konstantin Belousov
58145c6aa2 In rev. 1.156, the convertion of the minor number to the unit number
resulted in the argument to the make_dev() to be a unit number.

Correct this by supplying a minor number to make_dev(), and using
the unit number for the calculation of the slave tty name.

Reported and tested by:	Peter Holm
Reviewed by:	jhb
Yet another pointy hat to:	kib
MFC after:	1 day
2008-01-26 06:09:23 +00:00
John Baldwin
02d23fdd74 Fix a bug where a thread that hit the race where the sleep timeout fires
while the thread does not hold the thread lock would stop blocking for
subsequent interruptible sleeps and would always immediately fail the
sleep with EWOULDBLOCK instead (even sleeps that didn't have a timeout).

Some background:
- KSE has a facility for allowing one thread to interrupt another thread.
  During this process, the target thread aborts any interruptible sleeps
  much as if the target thread had a pending signal.  Once the target
  thread acknowledges the interrupt, normal sleep handling resumes.  KSE
  manages this via the TDF_INTERRUPTED flag.  Specifically, it sets the
  flag when it sends an interrupt to another thread and clears it when
  the interrupt is acknowledged.  (Note that this is purely a software
  interrupt sort of thing and has no relation to hardware interrupts
  or kernel interrupt threads.)
- The old code for handling the sleep timeout race handled the race
  by setting the TDF_INTERRUPT flag and faking a KSE-style thread
  interrupt to the thread in the process of going to sleep.  It probably
  should have just checked the TDF_TIMEOUT flag in sleepq_catch_signals()
  instead.
- The bug was that the sleepq code would set TDF_INTERRUPT but it was
  never cleared.  The sleepq code couldn't safely clear it in case there
  actually was a real KSE thread interrupt pending for the target thread
  (in fact, the sleepq timeout actually stomped on said pending interrupt).
  Thus, any future interruptible sleeps (*sleep(.. PCATCH ..) or
  cv_*wait_sig()) would see the TDF_INTERRUPT flag set and immediately
  fail with EWOULDBLOCK.  The flag could be cleared if the thread belonged
  to a KSE process and another thread posted an interrupt to the original
  thread.  However, in the more common case of a non-KSE process, the
  thread would pretty much stop sleeping.
- Fix the bug by just setting TDF_TIMEOUT in the sleepq timeout code and
  not messing with TDF_INTERRUPT and td_intrval.  With yesterday's fix to
  fix sleepq_switch() to check TDF_TIMEOUT, this is now sufficient.

MFC after:	3 days
2008-01-25 19:44:46 +00:00
John Baldwin
515594a06f Fix a race in the sleepqueue timeout code that resulted in sleeps not
being properly cancelled by a timeout.  In general there is a race
between a the sleepq timeout handler firing while the thread is still
in the process of going to sleep.  In 6.x with sched_lock, the race was
largely protected by sched_lock.  The only place it was "exposed" and had
to be handled was while checking for any pending signals in
sleepq_catch_signals().

With the thread lock changes, the thread lock is dropped in between
sleepq_add() and sleepq_*wait*() opening up a new window for this race.
Thus, if the timeout fired while the sleeping thread was in between
sleepq_add() and sleepq_*wait*(), the thread would be marked as timed
out, but the thread would not be dequeued and sleepq_switch() would
still block the thread until it was awakened via some other means.  In
the case of pause(9) where there is no other wakeup, the thread would
never be awakened.

Fix this by teaching sleepq_switch() to check if the thread has had its
sleep canceled before blocking by checking the TDF_TIMEOUT flag and
aborting the sleep and dequeueing the thread if it is set.

MFC after:	3 days
Reported by:	dwhite, peter
2008-01-25 02:09:38 +00:00
Jean-Sébastien Pédron
a8afa221cc When asked to use kqueue, AIO stores its internal state in the
`kn_sdata' member of the newly registered knote. The problem is that
this member is overwritten by a call to kevent(2) with the EV_ADD flag,
targetted at the same kevent/knote. For instance, a userland application
may set the pointer to NULL, leading to a panic.

A testcase was provided by the submitter.

PR:	kern/118911
Submitted by:	MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after:	1 day
2008-01-24 17:10:19 +00:00
Attilio Rao
0e9eb108f0 Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
  always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
  Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
  present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
  curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo
2008-01-24 12:34:30 +00:00
Bjoern A. Zeeb
79ba395267 Replace the last susers calls in netinet6/ with privilege checks.
Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).

Leave a few comments to be addressed later.

Reviewed by:	rwatson (older version, before addressing his comments)
2008-01-24 08:25:59 +00:00
Jeff Roberson
317da70593 - sched_prio() should only adjust tdq_lowpri if the thread is running or on
a run-queue.  If the priority is numerically raised only change lowpri
   if we're certain it will be correct.  Some slop is allowed however
   previously we could erroneously raise lowpri for an idle cpu that a
   thread had recently run on which lead to errors in load balancing
   decisions.
2008-01-23 03:10:18 +00:00
Robert Watson
20c6fe828a Regenerate. 2008-01-20 23:44:24 +00:00
Robert Watson
6c902059f2 Use audit events AUE_SHMOPEN and AUE_SHMUNLINK with new system calls
shm_open() and shm_unlink().  More auditing will need to be done for
these calls to capture arguments properly.
2008-01-20 23:43:06 +00:00
Robert Watson
07dd4a31b5 Export a type for POSIX SHM file descriptors via kern.proc.filedesc as
used by procstat, or SHM descriptors will show up as type unknown in
userspace.
2008-01-20 19:55:52 +00:00
Attilio Rao
d638e093d6 - Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
  locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
  VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)
2008-01-19 17:36:23 +00:00
Robert Watson
8c96f9c193 Move unlock of global UNIX domain socket lock slightly lower in
unp_connect(): it is expected to return with the lock held, and two
possible error paths otherwise returned with it unlocked.

The fix committed here is slightly different from the patch in the
PR, but along an alternative line suggested in the PR.

PR:		119778
MFC after:	3 days
Submitted by:	James Juran <james dot juran at baesystems dot com>
2008-01-18 19:16:03 +00:00
Konstantin Belousov
81aa963bc7 In the rev. 1.153, the one place for converting minor number to unit
was missed. As result, pty_create_slave() may index out of the names[]
bounds, creating wrong slave tty names.

Tested by:	kensmith
Reviewed by:	jhb
MFC after:	3 days
2008-01-18 18:07:04 +00:00
Julian Elischer
ce3b9e3aea refactor code so it can run in a chroot without having to have /dev/mounted
MFC After: 1 week
2008-01-18 17:02:14 +00:00
David Xu
0e17ccbe36 Make sure reading td_runtime in critical section since thread may be
preempted and td_runtime will be modified.
2008-01-18 13:00:28 +00:00
David Xu
00d6ac63cd Add POSIX clock id CLOCK_THREAD_CPUTIME_ID, this can be used to measure
per-thread runtime in user code.
2008-01-18 07:04:42 +00:00
John Baldwin
2c17901060 Add 'compat_freebsd[4567]' features corresponding to the kernel options
COMPAT_FREEBSD[4567].

MFC after:	1 week
Requested by:	kris
2008-01-17 22:46:32 +00:00
Sam Leffler
eeb76a1889 promote ath_defrag to m_collapse (and retire private+unused
m_collapse from cxgb)

Reviewed by:	pyun, jhb, kmacy
MFC after:	2 weeks
2008-01-17 21:25:09 +00:00
John Baldwin
cff3c4fdc5 Remove a conditional that is always true.
MFC after:	2 weeks
2008-01-17 20:15:15 +00:00
John Baldwin
8ffbe1559e Add a set of regression tests for the POSIX shm API (shm_open(2) and
shm_unlink(2)).
2008-01-16 15:51:24 +00:00
Nate Lawson
e1f13773ec Remove duplicate cpufreq levels, i.e. ones that are within 25 Mhz of each
other.  The first one survives, the rest are removed.  So far, it appears
only some acpi_perf(4) BIOS tables have these invalid states, but address
this in the core to be sure to handle other potential driver data.

PR:		kern/114722
Tested by:	stefan.lambrev / moneybookers.com
MFC after:	3 days
2008-01-16 01:05:21 +00:00
Jeff Roberson
a755f21484 - When executing the 'tryself' branch in sched_pickcpu() look at the
lowest priority on the queue for the current cpu vs curthread's
   priority.  In the case that curthread is waking up many threads of a
   lower priority as would happen with a turnstile_broadcast() or wakeup()
   of many threads this prevents them from all ending up on the current cpu.
 - In sched_add() make the relationship between a scheduled ithread and
   the current cpu advisory rather than strict.  Only give the ithread
   affinity for the current cpu if it's actually being scheduled from
   a hardware interrupt.  This prevents it from migrating when it simply
   blocks on a lock.

Sponsored by:	Nokia
2008-01-15 09:03:09 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Attilio Rao
d1127e669c lockmgr() function will return successfully when trying to work under
panic but it won't actually lock anything.
This can lead some paths to reach lockmgr_disown() with inconsistent
lock which will let trigger the relative assertions.

Fix those in order to recognize panic situation and to not trigger.

Reported by: pho
Submitted by: kib
2008-01-11 16:38:12 +00:00
Robert Watson
d92909c1d4 Don't zero td_runtime when billing thread CPU usage to the process;
maintain a separate td_incruntime to hold unbilled CPU usage for
the thread that has the previous properties of td_runtime.

When thread information is requested using the thread monitoring
sysctls, export thread td_runtime instead of process rusage runtime
in kinfo_proc.

This restores the display of individual ithread and other kernel
thread CPU usage since inception in ps -H and top -SH, as well for
libthr user threads, valuable debugging information lost with the
move to try kthreads since they are no longer independent processes.

There is universal agreement that we should rewrite the process and
thread export sysctls, but this commit gets things going a bit
better in the mean time.  Likewise, there are resevations about the
continued validity of statclock given the speed of modern processors.

Reviewed by:		attilio, emaste, jhb, julian
2008-01-10 22:11:20 +00:00
Robert Watson
8a69e5fa71 Remove "lock pushdown" todo item in comment -- I did that for 7.0.
MFC after:	3 weeks
2008-01-10 12:38:17 +00:00
Robert Watson
a635784569 Correct typos in comments.
MFC after:	3 weeks
2008-01-10 12:29:12 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
Attilio Rao
6edbb3ee9e Fix a last second typo about recent lockmgr_disown() introduction. 2008-01-09 00:02:43 +00:00
Attilio Rao
d7a7e17968 Remove explicit calling of lockmgr() with the NULL argument.
Now, lockmgr() function can only be called passing curthread and the
KASSERT() is upgraded according with this.

In order to support on-the-fly owner switching, the new function
lockmgr_disown() has been introduced and gets used in BUF_KERNPROC().
KPI, so, results changed and FreeBSD version will be bumped soon.
Differently from previous code, we assume idle thread cannot try to
acquire the lockmgr as it cannot sleep, so loose the relative check[1]
in BUF_KERNPROC().

Tested by: kris

[1] kib asked for a KASSERT in the lockmgr_disown() about this
condition, but after thinking at it, as this is a well known general
rule, I found it not really necessary.
2008-01-08 23:48:31 +00:00
John Baldwin
4ad6d200d6 Regen for shm_open(2) and shm_unlink(2). 2008-01-08 22:01:26 +00:00
John Baldwin
8e38aeff17 Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
  object which provides the backing store.  Each descriptor starts off with
  a size of zero, but the size can be altered via ftruncate(2).  The shared
  memory file descriptors also support fstat(2).  read(2), write(2),
  ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
  memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
  manage shared memory file descriptors.  The virtual namespace that maps
  pathnames to shared memory file descriptors is implemented as a hash
  table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
  of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
  path argument to shm_open(2).  In this case, an unnamed shared memory
  file descriptor will be created similar to the IPC_PRIVATE key for
  shmget(2).  Note that the shared memory object can still be shared among
  processes by sharing the file descriptor via fork(2) or sendmsg(2), but
  it is unnamed.  This effectively serves to implement the getmemfd() idea
  bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
  collected when they are not referenced by any open file descriptors or
  the shm_open(2) virtual namespace.

Submitted by:	dillon, peter (previous versions)
Submitted by:	rwatson (I based this on his version)
Reviewed by:	alc (suggested converting getmemfd() to shm_open())
2008-01-08 21:58:16 +00:00
John Baldwin
39033470fe Close a race in the kern.ttys sysctl handler that resulted in panics in
dev2udev() when a tty was being detached concurrently with the sysctl
handler:
- Hold the 'tty_list_mutex' lock while we read all the fields out of the
  struct tty for copying out later.  Previously the pty(4) and pts(4)
  destroy routines could set t_dev to NULL, drop their reference on the
  tty and destroy the cdev while the sysctl handler was attempting to
  invoke dev2udev() on the cdev being destroyed.  This happened when the
  sysctl handler read the value of t_dev prior to it being set to NULL
  either due to it being stale or due to timing races.  By holding the
  list lock we guarantee that the destroy routines will block in ttyrel()
  in that case and not destroy the cdev until after we've copied all of our
  data.  We may see a NULL cdev pointer or we may see the previous value,
  but the previous value will no longer point to a destroyed cdev if we
  see it.
- Fix the ttyfree() routine used by tty device drivers in their detach
  methods to use ttyrel() on the tty so we don't leak them.  Also, fix it
  to use the same order of operations as pty/pts destruction (set t_dev
  NULL, ttyrel(), destroy_dev()) so it cooperates with the sysctl handler.

MFC after:	3 days
Tested by:	avatar
2008-01-08 04:53:28 +00:00
Kris Kennaway
357911ce77 Fix logic in skipcount handling (used to sample every 1/N lock operations
to reduce profiling overhead)
2008-01-08 01:11:40 +00:00
Robert Watson
57d7e86b65 Free MAC label on a POSIX semaphore when the semaphore is freed.
MFC after:	3 days
Submitted by:	jhb
2008-01-07 22:03:19 +00:00
John Baldwin
e46502943a Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
  a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
  object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
  vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by:	rwatson
2008-01-07 20:05:19 +00:00
Bruce Evans
9283848511 In sequential_heuristic():
- spell 16384 as 16384 and not as BKVASIZE.  16384 is (not quite) just a
  magic size that works well in practice.  BKVASIZE should be MAXBSIZE
  (65536), but is 16384 because i386's don't have enough kva for it to
  be MAXBSIZE; 16384 works (not so well) for it for much the same reasons
  that it works well in the heuristic.
- expand and/or add comments about this and other details.
- don't explicitly inline this function.
- fix some other style bugs.
2008-01-05 08:54:51 +00:00
Peter Wemm
4113f8d741 Fall back to the binary-specified interpreter (ld-elf.so.1) if the
ABI override binary isn't found.  This could probably be smoother, but
it is what I did in p4 change #126891 on 2007/09/27.  It should solve
the "ld-elf32.so.1"-in-chroot problem.
2008-01-05 08:35:56 +00:00
Jeff Roberson
fd0b8c783d - Restore timeslicing code for all bit SCHED_FIFO priority classes.
Reported by:	Peter Jeremy <peterjeremy@optushome.com.au>
2008-01-05 04:47:31 +00:00
Bjoern A. Zeeb
a82be55d42 Add missing sb_sndptr* fields to db_print_sockbuf().
While here change %d to %u for u_ints.

Discussed with:	rwatson, kmacy
2008-01-03 15:19:31 +00:00
Jeff Roberson
a57decdf32 - In sysctl_kern_file skip fdps with negative lastfiles. This can
happen if there are no files open.  Accounting for these can
   eventually return a negative value for olenp causing sysctl to
   crash with a bad malloc.

Reported by:	Pawel Worach <pawel.worach@gmail.com>
2008-01-03 01:26:59 +00:00
David E. O'Brien
bedff79a00 Note what is too {short,long}. 2008-01-02 18:48:27 +00:00
John Baldwin
c0cfd9d113 A few whitespace fixes. 2008-01-02 17:09:15 +00:00
Jeff Roberson
41e0f66d41 - Place the fhold() in unp_internalize_fp to be more consistent with refs.
- Clear all of the gc flags before doing a run.  Stale flags were causing
   us to skip some descriptors.
 - If a unp socket has been marked REF in a gc pass it can't be dead.

Found by:	rwatson's test tool.
2008-01-01 01:46:42 +00:00
Craig Rodrigues
450ea867c5 In vfs_scanopt(), make sure that the mount option value is not NULL
before calling vsscanf().

PR:		118531
Submitted by:	Jaakko Heinonen <jh saunalahti fi>
MFC after:	3 days
2007-12-31 23:44:53 +00:00
John Baldwin
0deabe7e53 Actually declare the kern.features sysctl node.
Pointy hat to:	jhb
2007-12-31 22:03:57 +00:00
Jeff Roberson
0c66dc6758 - Pause a while after disabling lock profiling and before resetting it
to be sure that all participating CPUs have stopped updating it.
 - Restore the behavior of printing the name of the lock type in the output.
2007-12-31 03:45:51 +00:00
Jeff Roberson
6f552cb098 - Check the correct variable against NULL in two places.
- If the unp_file is NULL that means it has never been internalized and it
   must be reachable.
2007-12-31 03:44:54 +00:00
Warner Losh
c94a7cac1f Rather than not redirting the bp when we get ENXIO, only redirty it
when the error is EIO.  This catches a much larger class of errors
that are unlikely to succeed if retried.

Submitted by: bde
2007-12-30 05:53:45 +00:00
Jeff Roberson
397c19d175 Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
   in such a way that the ops vector is only valid after the data, type,
   and flags are valid.
 - Protect f_flag and f_count with atomic operations.
 - Remove the global list of all files and associated accounting.
 - Rewrite the unp garbage collection such that it no longer requires
   the global list of all files and instead uses a list of all unp sockets.
 - Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by:	kris, pho
2007-12-30 01:42:15 +00:00
Alan Cox
f8a47341fe Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages.  (The earlier part was
the rewrite of the physical memory allocator.)  The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64.  (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)
2007-12-29 19:53:04 +00:00
Robert Watson
c5f1beb02a In "show lockedvnods" DDB command, use db_printf() rather than printf()
so that the results end up in the DDB output stream rather than the
console output stream.

This should likely also be done for the vprint() function it calls.

MFC after:	3 months
2007-12-28 00:47:31 +00:00
Attilio Rao
100f241571 Trimm out now unused option LK_EXCLUPGRADE from the lockmgr namespace.
This option just adds complexity and the new implementation no longer
will support it, so axing it now that it is unused is probabilly the
better idea.

FreeBSD version is bumped in order to reflect the KPI breakage introduced
by this patch.

In the ports tree, kris found that only old OSKit code uses it, but as
it is thought to work only on 2.x kernels serie, version bumping will
solve any problem.
2007-12-28 00:38:13 +00:00
Attilio Rao
7a1d78fa3f In order to avoid a huge class of deadlocks (in particular in interactions
with the interlock), owner of the lock should be only curthread or at
least, for its limited usage, NULL which identifies LK_KERNPROC.

The thread "extra argument" for the lockmgr interface is going to be
removed in the near future, but for the moment, just let kernel run for
some days with this check on in order to find potential deadlocking
places around the kernel and fix them.
2007-12-27 22:56:57 +00:00
Robert Watson
0417fe5421 Return ESRCH when a kernel stack is queried on a process in execve() --
p_candebug() will return EAGAIN which, if the other process never
leaves execve(), will result in the sysctl spinning and never returning
to userspace.  Processes should always eventually leave execve(), but
spinning in kernel while we wait is bad for countless reasons, and
particularly harmful if execve() itself is deadlocked.

Possibly we should return another error, or return a marker indicating
the thread is in execve() so it can be reported that way in userspace.

Reported by:	kris
2007-12-27 22:44:01 +00:00
Attilio Rao
98e4f2e2bf As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames
equivalent with this and so operate the switch.

That call is the only one remaining LK_EXCLUPGRADE consumer and removing
it will prepare the ground for LK_EXCLUPGRADE axing and further
lockmgr improvements.

Discussed with: jeff, ups
2007-12-27 20:52:05 +00:00
Warner Losh
b27aa20e8d A partial solution to some of the 'pull the umass device with a
mounted FS' problems.  These are more along the lines of 'avoiding an
avoidable panic' than a complete solution to removable devices.  We
now close the barn door after the horse has gotten lose and has been
hit by a truck, as it were.  The barn no longer catches fire in this
case, but the horse is still dead :-).

The vfs_bio.c fix causes us not to put a failed write back into the
dirty pool if the error returned was ENXIO.  In that case, the buffer
is treated like any other clean buffer that's being retured.  ENXIO
means the device isn't there anymore and will never be there again in
the future, so retrying is futile.

The vfs_mount.c fix treats 'ENXIO' as success for unmounting a file
system.  If the device is gone, retrying later won't help and we'll
never be able to unmount the device.

These two are part of a larger patch set submitted by the author.  The
other patches will be forth coming.  I added comments to these two
patches.

Submitted by: Henrik Gulbrandsen
Reviewed by: phk@
PR: usb/46176 (partial)
2007-12-27 16:38:28 +00:00
Robert Watson
618c7db30a Add textdump(4) facility, which provides an alternative form of kernel
dump using mechanically generated/extracted debugging output rather than
a simple memory dump.  Current sources of debugging output are:

- DDB output capture buffer, if there is captured output to save
- Kernel message buffer
- Kernel configuration, if included in kernel
- Kernel version string
- Panic message

Textdumps are stored in swap/dump partitions as with regular dumps, but
are laid out as ustar files in order to allow multiple parts to be stored
as a stream of sequentially written blocks.  Blocks are written out in
reverse order, as the size of a textdump isn't known a priori.  As with
regular dumps, they will be extracted using savecore(8).

One new DDB(4) command is added, "textdump", which accepts "set",
"unset", and "status" arguments.  By default, normal kernel dumps are
generated unless "textdump set" is run in order to schedule a textdump.
It can be canceled using "textdump unset" to restore generation of a
normal kernel dump.

Several sysctls exist to configure aspects of textdumps;
debug.ddb.textdump.pending can be set to check whether a textdump is
pending, or set/unset in order to control whether the next kernel dump
will be a textdump from userspace.

While textdumps don't have to be generated as a result of a DDB script
run automatically as part of a kernel panic, this is a particular useful
way to use them, as instead of generating a complete memory dump, a
simple transcript of an automated DDB session can be captured using the
DDB output capture and textdump facilities.  This can be used to
generate quite brief kernel bug reports rich in debugging information
but not dependent on kernel symbol tables or precisely synchronized
source code.  Most textdumps I generate are less than 100k including
the full message buffer.  Using textdumps with an interactive debugging
session is also useful, with capture being enabled/disabled in order to
record some but not all of the DDB session.

MFC after:	3 months
2007-12-26 11:32:33 +00:00
Wojciech A. Koszek
4ffcc89aa6 Rewrite kern.console handling in sbuf(9). My intention is to leave
kern.console format as is. Thus, no difference in output format should
appear after this commit.

Reviewed by:	cognet@ (mentor)
Approved by:	cognet@ (mentor)
2007-12-25 21:17:34 +00:00
Robert Watson
3de213cc00 Add a new 'why' argument to kdb_enter(), and a set of constants to use
for that argument.  This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.

Assign approximate why values to all current consumers of the
kdb_enter() interface.
2007-12-25 17:52:02 +00:00
Julian Elischer
6829a5c59e give thread0 the tid 100000 and bumpt the others to start at 100001
MFC after:	1 week
2007-12-22 04:56:48 +00:00
Wojciech A. Koszek
731016fe36 Make SCHED_ULE buildable with gcc3.
Reviewed by:	cognet (mentor), jeffr
Approved by:	cognet (mentor), jeffr
2007-12-21 23:30:18 +00:00
Warner Losh
d4277fef7b When devclass_get_maxunit is passed a NULL, return -1 to indicate that
there's nothing allocated at all yet.
2007-12-19 22:05:07 +00:00
David E. O'Brien
10c2b8e128 Be more exact with sigaction SA_SIGINFO handling.
Reviewed by:	marcel
2007-12-18 20:39:13 +00:00
Kip Macy
5e0f5cfaed Add SB_NOCOALESCE flag to disable socket buffer update in place 2007-12-17 10:02:01 +00:00
David Xu
7fab871d8c Check NULL pointer. 2007-12-17 08:09:37 +00:00
David Xu
9514dcc041 Add missing changes for fixing LOR of umtx lock and thread lock, follow
the committing of files:
	kern_resource.c revision 1.181
	sched_4bsd.c	revision 1.111
	sched_ule.c	revision 1.218
2007-12-17 05:55:07 +00:00
Jeff Roberson
ace8398da0 Refactor select to reduce contention and hide internal implementation
details from consumers.

 - Track individual selecters on a per-descriptor basis such that there
   are no longer collisions and after sleeping for events only those
   descriptors which triggered events must be rescaned.
 - Protect the selinfo (per descriptor) structure with a mtx pool mutex.
   mtx pool mutexes were chosen to preserve api compatibility with
   existing code which does nothing but bzero() to setup selinfo
   structures.
 - Use a per-thread wait channel rather than a global wait channel.
 - Hide select implementation details in a seltd structure which is
   opaque to the rest of the kernel.
 - Provide a 'selsocket' interface for those kernel consumers who wish to
   select on a socket when they have no fd so they no longer have to
   be aware of select implementation details.

Tested by:	kris
Reviewed on:	arch
2007-12-16 06:21:20 +00:00
Randall Stewart
6b4959bf2f - fix tab to space issue, hmm maybe I should use vi. 2007-12-15 23:14:53 +00:00
Jeff Roberson
eea4f254fe - Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled.  There is no longer an embedded lock_profile_object
   in each lock.  Instead a list of lock_profile_objects is kept per-thread
   for each lock it may own.  The cnt_hold statistic is now always 0 to
   facilitate this.
 - Support shared locking by tracking individual lock instances and
   statistics in the per-thread per-instance lock_profile_object.
 - Make the lock profiling hash table a per-cpu singly linked list with a
   per-cpu static lock_prof allocator.  This removes the need for an array
   of spinlocks and reduces cache contention between cores.
 - Use a seperate hash for spinlocks and other locks so that only a
   critical_enter() is required and not a spinlock_enter() to modify the
   per-cpu tables.
 - Count time spent spinning in the lock statistics.
 - Remove the LOCK_PROFILE_SHARED option as it is always supported now.
 - Specifically drop and release the scheduler locks in both schedulers
   since we track owners now.

In collaboration with:	Kip Macy
Sponsored by:	Nokia
2007-12-15 23:13:31 +00:00
David E. O'Brien
d08aed068a style.Makefile(5) 2007-12-14 21:30:51 +00:00
David Xu
435806d31b Fix LOR of thread lock and umtx's priority propagation mutex due
to the reworking of scheduler lock.

MFC: after 3 days
2007-12-11 08:25:36 +00:00
Robert Watson
63d79c4fd6 Check for P_WEXIT before PHOLD() on a process in kstack and vm query
sysctls, as PHOLD() asserts !P_WEXIT.

Reported by:	Michael Plass <mfp49_freebsd at plass-family dot net>
2007-12-09 17:22:27 +00:00
Joseph Koshy
d07f36b075 Kernel and hwpmc(4) support for callchain capture.
Sponsored by:	FreeBSD Foundation and Google Inc.
2007-12-07 08:20:17 +00:00
John Baldwin
74427aa423 Move several data structure definitions out of freebsd32_misc.c and into
freebsd32.h instead.

MFC after:	1 week
2007-12-06 23:11:27 +00:00
Randall Stewart
cf70a46b47 - Puts default limits on 4k/9k and 16k zones for mbufs all based
on 1/2 of each of the successive limits tied to the limit for
  2k clusters.
- Adds real functionality in so that doing a sysctl to change these
  actually changes them :-)

MFC after:	1 week
2007-12-05 15:29:44 +00:00
Konstantin Belousov
973bdaa06f Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer,
when applicable.

Aquire Giant slightly later for vnlru.

In the syncer, aquire the Giant only when a vnode belongs to the
non-MPsafe fs.

In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from
the lbolt sleep queue after the syncer state is modified, not before.

Herded by:	attilio
Tested by:	Peter Holm
Reviewed by:	ups
MFC after:	1 week
2007-12-05 09:34:04 +00:00
Craig Rodrigues
62bdb328bb In nmount(), internally convert the mount option: "rdonly" to "ro".
This makes updates mounts such as:
 "mount -u -o rdonly" work more like, "mount -u -o ro".

References to "-o rdonly" were changed to "-o ro" in revision 1.60 of
the mount(8) man page,
but some people still like to use "-o rdonly" since it was documented
in earlier versions of FreeBSD.

Requested by:	rwatson
MFC after:	1 week
2007-12-05 03:26:14 +00:00
Andrew Thompson
22cf347586 Apply a workaround for the unkillable jail problem where some devices created
within the jail are never freed. si_cred is only used by the MAC framework so
make the cred reference conditional on it being compiled in, this is not a fix
and will need to be reviewed for any new consumers of si_cred.

This will quell some user complaint when using jails with a default kernel.

Reviewed by:	rwatson
MFC after:	3 days
2007-12-05 01:22:03 +00:00
Konstantin Belousov
f231de478e Implement fetching of the __FreeBSD_version from the ELF ABI-tag note.
The value is read into the p_osrel member of the struct proc. p_osrel
is set to 0 for the binaries without the note.

MFC after:	3 days
2007-12-04 12:28:07 +00:00
Konstantin Belousov
93d1c72883 Check for the program headers alignment of the ELF images before
dereferencing. Unaligned access could cause panic on strict alignment
architectures.

Reviewed by:	marcel, marius (also tested on sparc64, thanks !)
MFC after:	3 days
2007-12-04 12:21:27 +00:00
Alan Cox
ba63339a0a Introduce an UMA backend page allocator for the jumbo frame zones that
allocates physically contiguous memory.

MFC after: 3 months
Requested and reviewed by: Kip Macy
Tested by: Andrew Gallatin and Pyun YongHyeon
2007-12-04 07:06:08 +00:00
Robert Watson
56905239ae When a symbol name can't be resolved, return "??" as the name, rather
than "Unknown func", in order to avoid putting spaces in what ideally
is a string separated by white space.
2007-12-03 14:44:35 +00:00
Robert Watson
1cc8c45c54 Add another new sysctl in support of the forthcoming procstat(1) to
support its -k argument:

kern.proc.kstack - dump the kernel stack of a process, if debugging
  is permitted.

This sysctl is present if either "options DDB" or "options STACK" is
compiled into the kernel.  Having support for tracing the kernel
stacks of processes from user space makes it much easier to debug
(or understand) specific wmesg's while avoiding the need to enter
DDB in order to determine the path by which a process came to be
blocked on a particular wait channel or lock.
2007-12-02 21:52:18 +00:00
Robert Watson
3c90d1ea74 Break out stack(9) from ddb(4):
- Introduce per-architecture stack_machdep.c to hold stack_save(9).
- Introduce per-architecture machine/stack.h to capture any common
  definitions required between db_trace.c and stack_machdep.c.
- Add new kernel option "options STACK"; we will build in stack(9) if it is
  defined, or also if "options DDB" is defined to provide compatibility
  with existing users of stack(9).

Add new stack_save_td(9) function, which allows the capture of a stacktrace
of another thread rather than the current thread, which the existing
stack_save(9) was limited to.  It requires that the thread be neither
swapped out nor running, which is the responsibility of the consumer to
enforce.

Update stack(9) man page.

Build tested:	amd64, arm, i386, ia64, powerpc, sparc64, sun4v
Runtime tested:	amd64 (rwatson), arm (cognet), i386 (rwatson)
2007-12-02 20:40:35 +00:00
Robert Watson
cc43c38c87 Add two new sysctls in support of the forthcoming procstat(1) to support
its -f and -v arguments:

kern.proc.filedesc - dump file descriptor information for a process, if
  debugging is permitted, including socket addresses, open flags, file
  offsets, file paths, etc.

kern.proc.vmmap - dump virtual memory mapping information for a process,
  if debugging is permitted, including layout and information on
  underlying objects, such as the type of object and path.

These provide a superset of the information historically available
through the now-deprecated procfs(4), and are intended to be exported
in an ABI-robust form.
2007-12-02 10:10:27 +00:00
Alan Cox
30418ed31c Eliminate vfs_page_set_valid()'s unused argument. 2007-12-02 01:28:35 +00:00
Robert Watson
9ccca7d1b1 Modify stack(9) stack_print() and stack_sbuf_print() routines to use new
linker interfaces for looking up function names and offsets from
instruction pointers.  Create two variants of each call: one that is
"DDB-safe" and avoids locking in the linker, and one that is safe for
use in live kernels, by virtue of observing locking, and in particular
safe when kernel modules are being loaded and unloaded simultaneous to
their use.  This will allow them to be used outside of debugging
contexts.

Modify two of three current stack(9) consumers to use the DDB-safe
interfaces, as they run in low-level debugging contexts, such as inside
lockmgr(9) and the kernel memory allocator.

Update man page.
2007-12-01 22:04:16 +00:00
Robert Watson
cdd475b347 The kernel linker includes a number of utility functions to look up symbol
information in support of DDB(4); these functions bypass normal linker
locking as they may run in contexts where locking is unsafe (such as the
kernel debugger).

Add a new interface linker_ddb_search_symbol_name(), which looks up a
symbol name and offset given an address, and also
linker_search_symbol_name() which does the same but *does* follow the
locking conventions of the linker.

Unlike existing functions, these functions place the name in a
caller-provided buffer, which is stable even after linker locks have been
released.  These functions will be used in upcoming revisions to stack(9)
to support kernel stack trace generation in contexts as part of a live,
rather than suspended, kernel.
2007-12-01 19:24:28 +00:00
Peter Wemm
e16aed66ee Deal with the possibility of device_set_unit() being called when attaching
the associated devinfo sysctl tree.
2007-11-30 21:30:14 +00:00
Peter Wemm
cd17ceaab8 Add sysctl_rename_oid() to support device_set_unit() usage. Otherwise,
when unit numbers are changed, the sysctl devinfo tree gets out of sync
and duplicate trees are attempted to be attached with the original name.
2007-11-30 21:29:08 +00:00
Robert Watson
ef54068b54 Move use of 'i' in cp_time sysctl under SCTL_MASK32 so that it compiles
without warnings on systems that don't define it.
2007-11-29 08:38:22 +00:00
Peter Wemm
7628402b07 Move the shared cp_time array (counts %sys, %user, %idle etc) to the
per-cpu area.  cp_time[] goes away and a new function creates a merged
cp_time-like array for things like linprocfs, sysctl etc.  The
atomic ops for updating cp_time[] in statclock go away, and the scope
of the thread lock is reduced.

sysctl kern.cp_time returns a backwards compatible cp_time[] array.
A new kern.cp_times sysctl returns the individual per-cpu stats.

I have pending changes to make top and vmstat optionally show per-cpu
stats.

I'm very aware that there are something like 5 or 6 other versions "out
there" for doing this - but none were handy when I needed them.

I did merge my changes with John Baldwin's, and ended up replacing a
few chunks of my stuff with his, and stealing some other code.

Reviewed by:  jhb
Partly obtained from:  jhb
2007-11-29 06:34:30 +00:00
Attilio Rao
573c6b82df Make ADAPTIVE_GIANT as the default in the kernel and remove the option.
Currently, Giant is not too much contented so that it is ok to treact it
like any other mutexes.

Please don't forget to update your own custom config kernel files.

Approved by:	cognet, marcel (maintainers of arches where option is
		not enabled at the moment)
2007-11-28 05:50:45 +00:00
Attilio Rao
49aead8a10 Simplify the adaptive spinning algorithm in rwlock and mutex:
currently, before to spin the turnstile spinlock is acquired and the
waiters flag is set.
This is not strictly necessary, so just spin before to acquire the
spinlock and to set the flags.
This will simplify a lot other functions too, as now we have the waiters
flag set only if there are actually waiters.
This should make wakeup/sleeping couplet faster under intensive mutex
workload.
This also fixes a bug in rw_try_upgrade() in the adaptive case, where
turnstile_lookup() will recurse on the ts_lock lock that will never be
really released [1].

[1] Reported by: jeff with Nokia help
Tested by: pho, kris (earlier, bugged version of rwlock part)
Discussed with: jhb [2], jeff
MFC after: 1 week

[2] John had a similar patch about 6.x and/or 7.x about mutexes probabilly
2007-11-26 22:37:35 +00:00
Attilio Rao
4a32616a77 Fix the spinlock static table adding missing spinlocks.
- rm_spinlock has turnstile chain as child
- srclock has callout and clk as child, found by witness "emulation".
  Just move it very high in our ranking
2007-11-24 04:32:32 +00:00
Attilio Rao
2c2bebfcb3 transferlockers() is a very dangerous and hack-ish function as waiters
should never be moved by one lock to another.
As, luckily, nothing in our tree is using it, axe the function.

This breaks lockmgr KPI, so interested, third-party modules should update
their source code with appropriate replacement.

Ok'ed by: ups, rwatson
MFC after: 3 days
2007-11-24 04:22:28 +00:00
Kris Kennaway
e6d64a0f15 Remove remaining Giant acquisition around vn_fullpath1. This was missed
in r1.106 and has not been required for some years now.

Reviewed by:  jeff
MFC After:    1 week
2007-11-22 21:26:25 +00:00
Attilio Rao
557f5e51e9 Cache the value of c_lock as it can change, in the struct,
while the global callout spinlock is not held, and can lead to PF#.

Reported by: dougb, Mark Atkinson <atkin901 at yahoo dot com>
Tested by: dougb
Diagnosed by: jhb
2007-11-22 12:15:54 +00:00
David Xu
110de0cf17 Add function UMTX_OP_WAIT_UINT, the function causes thread to wait for
an integer to be changed.
2007-11-21 04:21:02 +00:00
Robert Watson
965b55e2b4 Test that p_textvp is non-NULL be dereferencing, as no executable vnode is
set for kernel processes.

Reported by:	Skip Ford <skip at menantico dot com>
MFC after:	3 days
2007-11-20 18:03:09 +00:00
Attilio Rao
64b9ee201a Add the function callout_init_rw() to callout facility in order to use
rwlocks in conjuction with callouts.  The function does basically what
callout_init_mtx() alredy does with the difference of using a rwlock
as extra argument.
CALLOUT_SHAREDLOCK flag can be used, now, in order to acquire the lock only
in read mode when running the callout handler.  It has no effects when used
in conjuction with mtx.

In order to implement this, underlying callout functions have been made
completely lock type-unaware, so accordingly with this, sysctl
debug.to_avg_mtxcalls is now changed in the generic
debug.to_avg_lockcalls.

Note: currently the allowed lock classes are mutexes and rwlocks because
callout handlers run in softclock swi, so they cannot sleep and they
cannot acquire sleepable locks like sx or lockmgr.

Requested by: kmacy, pjd, rwatson
Reviewed by: jhb
2007-11-20 00:37:45 +00:00
John Baldwin
790c2471b9 Bump up the number of ttys supported by pty(4) to 512 by making use of
[pt]ty[lmnoLMNO][0-9a-v].

MFC after:	3 days
Reviewed by:	rwatson
2007-11-19 20:49:42 +00:00
Jean-Sébastien Pédron
4b5b09e744 The kernel uses two ways to write data on a pipe:
o  buffered write, for chunks smaller than PIPE_MINDIRECT bytes
    o  direct write, for everything else

A call to writev(2) may receive struct iov of various size and the
kernel may have to switch from one solution to the other. Before doing
this, it must wake reader processes and any select/poll/kqueue up.

This commit fixes a bug where select/poll/kqueue are not triggered
when switching from buffered write to direct write. It adds calls to
pipeselwakeup().

I give more details on freebsd-arch@:
http://lists.freebsd.org/pipermail/freebsd-arch/2007-September/006790.html

This should fix issues with Erlang (lang/erlang) and kqueue.

Reported by:	Rickard Green (Erlang)
2007-11-19 15:05:20 +00:00
Attilio Rao
f9721b43ed Expand lock class with the "virtual" function lc_assert which will offer
an unified way for all the lock primitives to express lock assertions.
Currenty, lockmgrs and rmlocks don't have assertions, so just panic in
that case.
This will be a base for more callout improvements.

Ok'ed by: jhb, jeff
2007-11-18 14:43:53 +00:00
Randall Stewart
7c7454fe95 - Add in missing event handler invokes for initial proc and thread. 2007-11-18 13:56:51 +00:00
John Birrell
f6c1530162 Add a function to list symbols in a file and their values at the
same time rather than having to list the symbols and then go back
and look each one up by name.
2007-11-18 00:23:31 +00:00
John Baldwin
cd808cec50 Acquire the process mutex and spin locks before calling thread_exit() in
kthread_exit() to fix panics when using INVARIANTS.
2007-11-15 21:45:17 +00:00
Randall Stewart
b209f88986 - Adds event handlers for process_ctor,process_dtor, process_init,
process_fini, thread_ctor, thread_dtor, thread_init, thread_fini. This
  will allow us to extend dynamically areas in proc/thread for dtrace ;-)
Reviewed by:    rwatson
2007-11-15 14:20:07 +00:00
Gleb Smirnoff
d8410b8edf Fix build. 2007-11-15 14:16:20 +00:00
Randall Stewart
4a62a3e556 Adds an event handler for:
- process_ctor,dtor, init and fini
  - thread_ctor,dtor, init and fini
This allows the ability to add on additional things
during construction/destruction of threads and processes.

Reviewed by:	rwatson
2007-11-15 13:28:54 +00:00
Julian Elischer
c67ddc21e7 This time REALLY copy the name from the proc to the thread as a default. 2007-11-15 06:35:26 +00:00
Julian Elischer
4b9322aee8 When forking, the new thread deserves a name too. Don't just use the
td_startcopy section as it is not the right thing to do
in other cases (e.g. if starting a new thread from one that is already named).
2007-11-15 02:13:44 +00:00
Attilio Rao
6f5c319c12 Remove a bogus KASSERT which will prevent rwlock to be acquired
recursively in exclusive mode with debugging kernels.

Submitted by: kmacy
Approved by: jeff
2007-11-14 21:21:48 +00:00
Marcel Moolenaar
0c3967e7fe o Rename cpu_thread_setup() to cpu_thread_alloc() to better
communicate that it relates to (is called by) thread_alloc()
o  Add cpu_thread_free() which is called from thread_free()
   to counter-act cpu_thread_alloc().

i386:	Have cpu_thread_free() call cpu_thread_clean() to
	preserve behaviour.
ia64:	Have cpu_thread_free() call mtx_destroy() for the
	mutex initialized in cpu_thread_alloc().

PR: ia64/118024
2007-11-14 20:21:54 +00:00
Julian Elischer
e01eafef2a A bunch more files that should probably print out a thread name
instead of a process name.
2007-11-14 06:51:33 +00:00
Julian Elischer
431f890614 generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.
2007-11-14 06:21:24 +00:00
Julian Elischer
ca081fdbc5 Make sure there is a good default thread name for all threads. 2007-11-14 06:04:57 +00:00
Robert Watson
433ea89af4 Add rm_wowned(9) function to test whether the current thread owns an
exclusive lock on the passed rmlock.

Reviewed by:	ups
2007-11-10 15:06:30 +00:00
John Baldwin
87a194514b A couple of optimizations to the last commit.
Submitted by:	Christoph Mallon christoph mallon of gmx de
2007-11-08 21:45:56 +00:00
Stephan Uphoff
dda7aec745 Use VM_FAULT_DIRTY to fault in pages for write access in
proc_rwmen.
Otherwise copy on write may create an anonymous page that is
not marked as dirty. Since  writing data to these pages
in this function also does not dirty these pages they may be
later discarded by the pagedaemon.
2007-11-08 19:35:36 +00:00
John Baldwin
db27a9dac7 Make it easier to add more ptys to the pty(4) driver:
- Use unit2minor() and minor2unit() to generate minor numbers to support
  unit numbers higher than 255.
- Use simple string operations on the 'names' array rather than hard-coded
  constants and switch statements so that more ptys can be added by simply
  expanding the 'names' array.

MFC after:	1 week
2007-11-08 15:51:52 +00:00
Stephan Uphoff
f53d15fe1b Initial checkin for rmlock (read mostly lock) a multi reader single writer
lock optimized for almost exclusive reader access. (see also rmlock.9)

TODO:
    Convert to per cpu variables linkerset as soon as it is available.
    Optimize UP (single processor)  case.
2007-11-08 14:47:55 +00:00
Robert Watson
088f584961 Remove unused variable td from sched_idletd().
MFC after:	3 days
Found with:	Coverity Prevent(tm)
CID:		3561
2007-11-05 12:01:12 +00:00
Konstantin Belousov
89b57fcf01 Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with:	Peter Holm
Reviewed by:	jhb
2007-11-05 11:36:16 +00:00
Julian Elischer
f4bb4fc8f3 Completely remove the code for single threading the mainline fork code.
Put in a little comment explaining why it went away.
Re-enable it in the case there an exisiting process is just splitting
off its address space and file descriptors.
(I donpt think anything uses that code but it needs some sort of locking
and this does the job.

Reviewed by:	Davidxu, alc, others
MFC after:	3 days
2007-11-02 19:40:36 +00:00
Nate Lawson
a15e947d54 If we're on an SMP kernel and there is more than 1 CPU, reject any attempts
to change the freq before the other CPUs are active.  The current code
always attempts to change all CPUs to match each other, and the requisite
sched_bind() call won't work before APs are launched.
2007-10-30 22:18:08 +00:00
Julian Elischer
539976ffdf fix typo in code normally not compiled in. 2007-10-29 20:45:31 +00:00
Julian Elischer
3c1ffc320f Fix typo in code obviously not being compiled on any of my machines.
found by: rdivacky@
2007-10-28 23:11:57 +00:00
John Baldwin
9dddab6fc1 Change the roundrobin implementation in the 4BSD scheduler to trigger a
userland preemption directly from hardclock() via sched_clock() when a
thread uses up a full quantum instead of using a periodic timeout to cause
a userland preemption every so often.  This fixes a potential deadlock
when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
by a thread pinned or bound to another CPU.  The current thread on that
CPU will never be preempted while softclock is blocked.

Note that ULE already drives its round-robin userland preemption from
sched_clock() as well and always enables IPI_PREEMPT.

MFC after:	1 week
2007-10-27 22:07:40 +00:00
Craig Rodrigues
b4b5bf359b In nmount(), if MNT_ROOT is in the mount flags, filter it
out instead of returning an error.
(1)  This makes the behavior consistent with mount(2).
(2)  This makes update mounts on the root file system work properly.
(3)  The explicit checks for MNT_ROOTFS in src/sbin/fsck_ffs/main.c
     and src/usr.sbin/mountd/mountd.c which were put in to
     eliminate errors during update mounts on the root file system
     can be removed.

The only place were MNT_ROOTFS can be validly set
is inside the kernel, i.e. with vfs_mountroot_try().

Reviewed by:	phk
MFC after:	3 days
2007-10-27 15:59:18 +00:00
Julian Elischer
6a564b46b6 Add support for the pre-exisiting module shutdoen handshake.
Fix some comments.
2007-10-27 00:54:16 +00:00
Julian Elischer
9ef95d0105 rename the process to 'idle' and 'intr' as per jhb. 2007-10-27 00:52:26 +00:00
Julian Elischer
fbf7046447 Initialise the initial process pointer to NULL so that we know we don't
have an idle process yet.
I'm guessing that on my system this was always 0 already.

found by: Ed Schouten
2007-10-27 00:42:40 +00:00
Julian Elischer
6bc3d1dc09 If kthread_exit() is called on the last kthread in a kproc, then
all the work in kproc_exit must be done.
We don't actually have a user of this yet but why leave it to chance.
2007-10-26 22:18:20 +00:00
Julian Elischer
ca9a0ddf31 if one changes a function's arguments, one must also change the callers. 2007-10-26 22:03:19 +00:00
Julian Elischer
5f66cfca51 oops, over optimised and broke non-SMP builds 2007-10-26 20:32:33 +00:00
Julian Elischer
dd1b3ff97e kthread_exit needs no stinkin argument. 2007-10-26 17:03:22 +00:00
David E. O'Brien
ef44c8d2a3 style(9) 2007-10-26 16:33:47 +00:00
Julian Elischer
7ab24ea3b9 Introduce a way to make pure kernal threads.
kthread_add() takes the same parameters as the old kthread_create()
plus a pointer to a process structure, and adds a kernel thread
to that process.

kproc_kthread_add() takes the parameters for kthread_add,
plus a process name and a pointer to a pointer to a process instead of just
a pointer, and if the proc * is NULL, it creates the process to the
specifications required, before adding the thread to it.

All other old kthread_xxx() calls return, but act on (struct thread *)
instead of (struct proc *). One reason to change the name is so that
any old kernel modules that are lying around and expect kthread_create()
to make a process will not just accidentally link.

fix top to show  kernel threads by their thread name in -SH mode
add a tdnam formatting option to ps to show thread names.

make all idle threads actual kthreads and put them into their own idled process.
make all interrupt threads kthreads and put them in an interd process
(mainly for aesthetic and accounting reasons)
rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper'

man page fixes to follow.
2007-10-26 08:00:41 +00:00
Christian S.J. Peron
57274c513c Implement AUE_CORE, which adds process core dump support into the kernel.
This change introduces audit_proc_coredump() which is called by coredump(9)
to create an audit record for the coredump event.  When a process
dumps a core, it could be security relevant.  It could be an indicator that
a stack within the process has been overflowed with an incorrectly constructed
malicious payload or a number of other events.

The record that is generated looks like this:

header,111,10,process dumped core,0,Thu Oct 25 19:36:29 2007, + 179 msec
argument,0,0xb,signal
path,/usr/home/csjp/test.core
subject,csjp,csjp,staff,csjp,staff,1101,1095,50457,10.37.129.2
return,success,1
trailer,111

- We allocate a completely new record to make sure we arent clobbering
  the audit data associated with the syscall that produced the core
  (assuming the core is being generated in response to SIGABRT  and not
  an invalid memory access).
- Shuffle around expand_name() so we can use the coredump name at the very
  beginning of the coredump call.  Make sure we free the storage referenced
  by "name" if we need to bail out early.
- Audit both successful and failed coredump creation efforts

Obtained from:	TrustedBSD Project
Reviewed by:	rwatson
MFC after:	1 month
2007-10-26 01:23:07 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Christian S.J. Peron
5ff3816d82 Move where we audit the PID argument such that we unconditionally
audit it at the beginning of the syscall.  This fixes a problem
where the user supplies an invalid process ID which is > 0 which
results in the PID argument not being audited.

Obtained from:	TrustedBSD Project
MFC after:	1 week
2007-10-24 00:14:19 +00:00
Julian Elischer
e9271f5376 Take out the single-threading code in fork.
After discussions with jeff, alc, (various Ironport people), david Xu,
and mostly Alfred (who found the problem) it has been demonstrated that this
is not needed for our implementations of threads and represents a real
(as in we've seen it happen a lot) deadlock danger.

Several points:
 Since forking multiple threads is not allowed, and posix states that
 any mutexes owned by othre threads wilol be owned in the child by
 phantom threads, and therads shouldn't ba accessing shared structures without
 protection, It can be proved that if this leads to the child process accessing
 inconsistent data, it's a programming error.

 The mode of thread_single() being used in fork() is the wrong one.
 It is using SINGLE_NO_EXIT when it should be using SINGLE_BOUNDARY.

 Even if this we used, System processes have no need to do it as they have
 no userland to get inconsistent.

  This commmit first fixes the above bugs to get tehm correct in CVS.
  then removes them with #ifdef.
  This is so that history contains the corrected version should it
  be needed in the future.
  This code may be needed if we implement the forkall() syscall from
  Solaris. It may be needed for other non-posix thread libraries
  at some time in the future, so let the code sit for a short while
  while I do some work on it anyhow.

This removes a reproducible lockup in NFS.
It may be argued that maybe doing a fork while holding a vnode lock may
not be the best idea in th efirst place but it shouldn't cause a deadlock.
The removal has been running under soak test for several days now.

This removal should be seriously considered for 7.0 and RELENG_6.

Note. There is code in the core-dumping code that may have a similar problem
with coredumping threaded processes

MFC After: 4 days
2007-10-23 17:54:15 +00:00
Peter Grehan
cbdd62ad04 Cut over to ULE on PowerPC
kern/sched_ule.c - Add __powerpc__ to the list of supported architectures

powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE

powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct

powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by
 updating the old thread's lock. Note: uniprocessor-only, will require
 modification for MP support.

powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of
old thread's lock, making the call a no-op.

Reviewed by:	marcel, jeffr (slightly older version)
2007-10-23 00:52:25 +00:00
John Birrell
1676805c18 Add the full module path name to the kld_file_stat structure
for kldstat(2).

This allows libdtrace to determine the exact file from which
a kernel module was loaded without having to guess.

The kldstat(2) API is versioned with the size of the
kld_file_stat structure, so this change creates version 2.

Add the pathname to the verbose output of kldstat(8) too.

MFC: 3 days
2007-10-22 04:12:57 +00:00
Robert Watson
e41966dc35 Add PRIV_VFS_STAT privilege, which will allow overriding policy limits on
the right to stat() a file, such as in mac_bsdextended.

Obtained from:	TrustedBSD Project
MFC after:	3 months
2007-10-21 22:50:11 +00:00
Julian Elischer
3745c395ec Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
Ed Maste
7188e3c834 Put comments about syscalls by the correct ones, and use the correct syscall
number in the comment.
2007-10-19 19:17:53 +00:00
Sam Leffler
58590eb06b ULE works fine on arm; allow it to be used
Reviewed by:	jeff, cognet, imp
MFC after:	1 week
2007-10-16 19:25:26 +00:00
Alfred Perlstein
7c45a9c446 Export maxswzone, maxbcache, maxtsiz, dfldsiz, maxdsiz, dflssiz, maxssiz,
and sgrowsiz via sysctl.

MFC after: 1 week
2007-10-16 10:40:53 +00:00
Alexander Leidinger
9f05d312b3 Backout sensors framework.
Requested by:	phk
Discussed on:	cvs-all
2007-10-15 20:00:24 +00:00
Alexander Leidinger
99f6b270e3 Import OpenBSD's sysctl hardware sensors framework.
This commit includes the following core components:

 * sample configuration file for sensorsd
 * rc(8) script and glue code for sensorsd(8)
 * sysctl(3) doc fixes for CTL_HW tree
 * sysctl(3) documentation for hardware sensors
 * sysctl(8) documentation for hardware sensors
 * support for the sensor structure for sysctl(8)
 * rc.conf(5) documentation for starting sensorsd(8)
 * sensor_attach(9) et al documentation
 * /sys/kern/kern_sensors.c
   o sensor_attach(9) API for drivers to register ksensors
   o sensor_task_register(9) API for the update task
   o sysctl(3) glue code
   o hw.sensors shadow tree for sysctl(8) internal magic
 * <sys/sensors.h>
 * HW_SENSORS definition for <sys/sysctl.h>
 * sensors display for systat(1), including documentation
 * sensorsd(8) and all applicable documentation

The userland part of the framework is entirely source-code
compatible with OpenBSD 4.1, 4.2 and  -current as of today.

All sensor readings can be viewed with `sysctl hw.sensors`,
monitored in semi-realtime with `systat -sensors` and also
logged with `sensorsd`.

Submitted by:	Constantine A. Murenin <cnst@FreeBSD.org>
Sponsored by:	Google Summer of Code 2007 (GSoC2007/cnst-sensors)
Mentored by:	syrinx
Tested by:	many
OKed by:	kensmith
Obtained from:	OpenBSD (parts)
2007-10-14 10:45:31 +00:00
Dag-Erling Smørgrav
d302c56d9b I don't know what I was smoking when I wrote these three years ago; the
return value is an error code, hence always an int.

While I'm here, add getenv_uint() for completeness.
2007-10-13 11:30:19 +00:00
Mohan Srinivasan
58d14dae6d Set the NFS server sockbuf high watermarks to the system defaults
(up form 32KB). The low highwatermark setting caused UDP fullsock
request drops, throttling thruput greatly.
Reported by: Kris Kennaway
Approved by: re@ (Ken Smith)
2007-10-12 03:56:27 +00:00
Jeff Roberson
8753688f03 - Fix from pr kern/115469; Don't redeliver a signal once it has been
handled by the target process.

Contributed by:	Tijl Coosemans <tijl@ulyssis.org>
Approved by:	re
2007-10-09 00:03:39 +00:00
Jeff Roberson
88f530cc25 - Bail out of tdq_idled if !smp_started or idle stealing is disabled. This
fixes a bug on UP machines with SMP kernels where the idle thread
   constantly switches after trying to steal work from the local cpu.
 - Make the idle stealing code more robust against self selection.
 - Prefer to steal from the cpu with the highest load that has at least one
   transferable thread.  Before we selected the cpu with the highest
   transferable count which excludes bound threads.

Collaborated with:	csjp
Approved by:		re
2007-10-08 23:50:39 +00:00
Jeff Roberson
05dc0eb204 - Restore historical sched_yield() behavior by changing sched_relinquish()
to simply switch rather than lowering priority and switching.  This allows
   threads of equal priority to run but not lesser priority.

Discussed with:	davidxu
Reported by:	NIIMI Satoshi <sa2c@sa2c.net>
Approved by:	re
2007-10-08 23:45:24 +00:00
Jeff Roberson
40a940af86 - Restore historical yield() behavior by manually lowering priority and
switching.

Approved by:	re
2007-10-08 23:40:40 +00:00
Jeff Roberson
5bce4ae3be - Fix ULE in kernels without PREEMPTION compiled in by always enabling the
critical_exit() owepreempt check.  ULE will always use owepreempt to
   preempt the idle thread.  This change does not effect 4BSD since it will
   never set owepreempt without PREEMPTION enabled.
 - Remove some unused code from choosethread().

Discussed with:	jhb
Approved by:	re
2007-10-08 23:37:28 +00:00
Kip Macy
457869b973 This patch adds an M_NOFREE flag which allows one to mark an mbuf as
not being independently freeable. This allows one to embed an mbuf in
the cluster itself. This confers the benefits of the packet zone on
all cluster sizes. Embedded mbufs currently suffer from the same
limitation that packet zone mbufs do in that one cannot disconnect
them and pass them around independently of the cluster. It would
likely be possible to eliminate this limitation in the future by
adding a second reference for the mbuf itself.

Approved by: re(gnn)
2007-10-06 21:42:39 +00:00
Kip Macy
629b9e0853 Allow drivers to free an mbuf without having the mbuf be touched if
the driver has already freed any attached tags

Approved by: re(gnn)
2007-10-06 21:13:55 +00:00
Pawel Jakub Dawidek
764a938b11 Fix sx_try_slock(), so it only fails when there is an exclusive owner.
Before that fix, it was possible for the function to fail if number
of sharers changes between 'x = sx->sx_lock' step and atomic_cmpset_acq_ptr()
call.

This fixes ZFS problem when ZFS returns strange EIO errors under load.
In ZFS there is a code that depends on the fact that sx_try_slock() can
only fail if there is an exclusive owner.

Discussed with:	attilio
Reviewed by:	jhb
Approved by:	re (kensmith)
2007-10-02 14:48:48 +00:00
Jeff Roberson
59c6813475 - Reassign the thread queue lock to newtd prior to switching. Assigning
after the switch leads to a race where the outgoing thread still owns
   the local queue lock while another cpu may switch it in.  This race
   is only possible on machines where cpu_switch can take significantly
   longer on different cpus which in practice means HTT machines with
   unfair thread scheduling algorithms.

Found by:	kris (of course)
Approved by:	re
2007-10-02 01:30:18 +00:00
Jeff Roberson
7fcf154aef - Move the rebalancer back into hardclock to prevent potential softclock
starvation caused by unbalanced interrupt loads.
 - Change the rebalancer to work on stathz ticks but retain randomization.
 - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than
   complex sequences of locks to avoid deadlock.

Reported by:	kris
Approved by:	re
2007-10-02 00:36:06 +00:00
Jeff Roberson
02e2d6b445 - Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default
value for kern.sched.preempt_thresh appropriately.  It can still by
   adjusted at runtime.  ULE will still use IPI_PREEMPT in certain
   migration situations.
 - Assert that we're not trying to compile ULE on an unsupported
   architecture.  To date, I believe only i386 and amd64 have implemented
   the third cpu switch argument required.

Approved by:	re
2007-09-27 16:39:27 +00:00
Ruslan Ermilov
718a600b20 Fix the description of the formula used to autosize the number of
buffers in the buffer cache.

Approved by:	re (kensmith)
2007-09-26 11:22:23 +00:00
Alan Cox
7bfda801a8 Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq.  Instead, they are kept in a separate per-object
splay tree of cached pages.  However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock.  Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE).  The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held.  Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case.  Cached pages
are reclaimed far, far more often than they are reactivated.  Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated.  Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page.  Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
   Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
Jeff Roberson
e270652ba3 - Bound the interactivity score so that it cannot become negative.
Approved by:	re
2007-09-24 00:28:54 +00:00
Jeff Roberson
a5423ea313 - Improve grammar. s/it's/its/.
- Improve load long-term load balancer by always IPIing exactly once.
   Previously the delay after rebalancing could cause problems with
   uneven workloads.
 - Allow nice to have a linear effect on the interactivity score.  This
   allows negatively niced programs to stay interactive longer.  It may be
   useful with very expensive Xorg servers under high loads.  In general
   it should not be necessary to alter the nice level to improve interactive
   response.  We may also want to consider never allowing positively niced
   processes to become interactive at all.
 - Initialize ccpu to 0 rather than 0.0.  The decimal point was leftover
   from when the code was copied from 4bsd.  ccpu is 0 in ULE because ULE
   only exports weighted cpu values.

Reported by:	Steve Kargl (Load balancing problem)
Approved by:	re
2007-09-22 02:20:14 +00:00
Pawel Jakub Dawidek
b4d7e2983c Fix some locking cases where we ask for exclusively locked vnode, but we get
shared locked vnode in instead when vfs.lookup_shared is set to 1.

Discussed with:	kib, kris
Tested by:	kris
Approved by:	re (kensmith)
2007-09-21 10:16:56 +00:00
Jeff Roberson
54b0e65f84 - Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
   in/out.  ULE does not have a periodic timer that scans all threads in
   the system and as such maintaining a per-second counter is difficult.
 - Change computations requiring the unit in seconds to subtract ticks
   and divide by hz.  This does make the wraparound condition hz times
   more frequent but this is still in the range of several months to
   years and the adverse effects are minimal.

Approved by:	re
2007-09-21 04:10:23 +00:00
Jeff Roberson
f462501739 - Call sched_sleep() before we suspend threads. sched_wakeup() is already
called via setrunnable().  This allows time slept while suspended to
   be accounted for swap.

Approved by:	re
2007-09-21 04:04:22 +00:00
Attilio Rao
c8790f5d09 Fix some entries in the locks static table of witness.
In particular:
- smp_tlb_mtx is no longer used, so it is axed.
- smp rendezvous lock isn't really a leaf spin-mutex. Its bad placement in
  the table, however, has been the source of a false positive LOR reporting
  with the dt_lock.  However, smp rendezvous lock would have had sched_lock
  there for older lock, so it wasn't still a leaf lock.
- allpmaps is only used in ia32 architecture, so it is inserted in the
  appropriate stub.

Addictionally:
- kse_zombie_lock is no longer present, so its definition is axed out.
- zombie_lock doesn't need to have an exported symbol, so just let's it be
  declared as static.

Tested by: kris
Approved by: jeff (mentor)
Approved by: re
2007-09-20 20:38:43 +00:00
Jeff Roberson
b61ce5b0e6 - Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
   previously the sched_lock.  These bugs have existed for some time.
 - Allow swapout to try each thread in a process individually and then
   swapin the whole process if any of these fail.  This allows us to move
   most scheduler related swap flags into td_flags.
 - Keep ki_sflag for backwards compat but change all in source tools to
   use the new and more correct location of P_INMEM.

Reported by:	pho
Reviewed by:	attilio, kib
Approved by:	re (kensmith)
2007-09-17 05:31:39 +00:00
Robert Watson
dce5df0dfc Remove the definition and implementation of 'CALLOUT_NETGIANT', a now- (and
possibly always-) unused define.

Reported by:	kmacy
Approved by:	re (kensmith)
2007-09-15 12:33:24 +00:00
Attilio Rao
4486adc51f Currently the LO_NOPROFILE flag (which is masked on upper level code by
per-primitive macros like MTX_NOPROFILE, SX_NOPROFILE or RW_NOPROFILE) is
not really honoured. In particular lock_profile_obtain_lock_failure() and
lock_profile_obtain_lock_success() are naked respect this flag.
The bug leads to locks marked with no-profiling to be profiled as well.
In the case of the clock_lock, used by the timer i8254 this leads to
unpredictable behaviour both on amd64 and ia32 (double faults panic,
sudden reboots, etc.). The amd64 clock_lock is also not marked as
not profilable as it should be.
Fix these bugs adding proper checks in the lock profiling code and at
clock_lock initialization time.

i8254 bug pointed out by: kris
Tested by: matteo, Giuseppe Cocomazzi <sbudella at libero dot it>
Approved by: jeff (mentor)
Approved by: re
2007-09-14 01:12:39 +00:00
Attilio Rao
c7fb7ce53a subr_sleepqueue.c presents a thread lock missing which leads to dangerous
races for some struct thread members.
More specifically, this bug seems responsible for some memory dumping
problems people were experiencing.

Fix this adding correct thread locking.

Tested by: rwatson
Submitted by: tegge
Approved by: jeff
Approved by: re
2007-09-13 09:12:36 +00:00
Konstantin Belousov
245b204491 When restoring the mount after umount failed, the MNTK_UNMOUNT flag
prevents insmntque() from placing reallocated syncer vnode on mount
list, that causes panic in vfs_allocate_syncvnode().

Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is
not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is
set and cleared simultaneously with MNTK_UNMOUNT, except on umount error
path, where it is cleaned just before the syncer vnode is going to be
allocated.

Reported by:	Peter Jeremy <peterjeremy optushome com au>
Suggested by:	tegge
Approved by:	re (rwatson)
2007-09-12 16:31:32 +00:00
Attilio Rao
0b2e598c14 This is a follow-up, cleaning-up commit about recent changes involving
topology foo functions.
Working at the patch for topology problems in ia32/amd64 evicted some
problems regarding functions ordering in the SI_SUB_CPU family of
SYSINIT'ed subsystems.
In order to avoid problems with new modified to involved functions, a
correct ordering is not semantically specified for SI_SUB_CPU functions
(for a larger view of the issue please visit:
http://lists.freebsd.org/pipermail/freebsd-current/2007-July/075409.html )

Discussed with: peter
Tested by: kris, Rui Paulo <rpaulo@FreeBSD.org>
Approved by: jeff
Approved by: re
2007-09-11 22:54:09 +00:00
Robert Watson
45e0f3d63d Rename mac_check_vnode_delete() MAC Framework and MAC Policy entry
point to mac_check_vnode_unlink(), reflecting UNIX naming conventions.

This is the first of several commits to synchronize the MAC Framework
in FreeBSD 7.0 with the MAC Framework as it will appear in Mac OS X
Leopard.

Reveiwed by:    csjp, Samy Bahra <sbahra at gwu dot edu>
Submitted by:   Jacques Vidrine <nectar at apple dot com>
Obtained from:  Apple Computer, Inc.
Sponsored by:   SPARTA, SPAWAR
Approved by:    re (bmah)
2007-09-10 00:00:18 +00:00
Robert Watson
70ffc2fb53 In userland_sysctl(), call useracc() with the actual newlen value to be
used, rather than the one passed via 'req', which may not reflect a
rewrite.  This call to useracc() is redundant to validation performed by
later copyin()/copyout() calls, so there isn't a security issue here,
but this could technically lead to excessive validation of addresses if
the length in newlen is shorter than req.newlen.

Approved by:	re (kensmith)
Reviewed by:	jhb
Submitted by:	Constantine A. Murenin <cnst+freebsd@bugmail.mojo.ru>
Sponsored by:	Google Summer of Code 2007
2007-09-02 09:59:33 +00:00
John Baldwin
67b158d888 Close a race that snuck in with the recent changes to fix a LOR between
the callout_lock spin lock and the sleepqueue spin locks.  In the fix,
callout_drain() has to drop the callout_lock so it can acquire the
sleepqueue lock.  The state of the callout can change while the
callout_lock is held however (for example, it can be rescheduled via
callout_reset()).  The previous code assumed that the only state change
that could happen is that the callout could finish executing.  This change
alters callout_drain() to effectively restart and recheck everything
after it acquires the sleepqueue lock thus handling all the possible
states that the callout could be in after any changes while callout_lock
was dropped.

Approved by:	re (kensmith)
Tested by:	kris
2007-08-31 19:01:30 +00:00
Diomidis Spinellis
d5b6981e69 Add missing newline in the log message of the previous commit.
Approved by:	re (kensmith) - implied
2007-08-31 13:56:26 +00:00
Diomidis Spinellis
72de1b3709 Don't panic. When encountering a negative value call log(LOG_NOTICE, ...)
and record LONG_MAX, instead of calling KASSERT(...).

Reported by:	rwatson
Approved by:	re (kensmith)
2007-08-31 13:36:58 +00:00
John Baldwin
57b7fe337e Partially revert the previous change. I failed to notice that where
ktruserret() is invoked, an unlocked check of  the per-process queue
is performed inline, thus, we don't lock the ktrace_sx on every userret().

Pointy hat to:	jhb
Approved by:	re (kensmith)
Pointy hat recovered from:	rwatson
2007-08-29 21:17:11 +00:00
John Baldwin
cc479dda4a Rework the routines to convert a 5.x+ statfs structure (with fixed-size
64-bit counters) to a 4.x statfs structure (with long-sized counters).
- For block counters, we scale up the block size sufficiently large so
  that the resulting block counts fit into a the long-sized (long for the
  ABI, so 32-bit in freebsd32) counters.  In 4.x the NFS client's statfs
  VOP did this already.  This can lie about the block size to 4.x binaries,
  but it presents a more accurate picture of the ratios of free and
  available space.
- For non-block counters, fix the freebsd32 stats converter to cap the
  values at INT32_MAX rather than losing the upper 32-bits to match the
  behavior of the 4.x statfs conversion routine in vfs_syscalls.c

Approved by:	re (kensmith)
2007-08-28 20:28:12 +00:00
Randall Stewart
2afb3e849f - During shutdown pending, when the last sack came in and
the last message on the send stream was "null" but still
  there, a state we allow, we could get hung and not clean
  it up and wait for the shutdown guard timer to clear the
  association without a graceful close. Fix this so that
  that we properly clean up.
- Added support for Multiple ASCONF per new RFC. We only
  (so far) accept input of these and cannot yet generate
  a multi-asconf.
- Sysctl'd support for experimental Fast Handover feature. Always
  disabled unless sysctl or socket option changes to enable.
- Error case in add-ip where the peer supports AUTH and ADD-IP
  but does NOT require AUTH of ASCONF/ASCONF-ACK. We need to
  ABORT in this case.
- According to the Kyoto summit of socket api developers
  (Solaris, Linux, BSD). We need to have:
   o non-eeor mode messages be atomic - Fixed
   o Allow implicit setup of an assoc in 1-2-1 model if
     using the sctp_**() send calls - Fixed
   o Get rid of HAVE_XXX declarations - Done
   o add a sctp_pr_policy in hole in sndrcvinfo structure - Done
   o add a PR_SCTP_POLICY_VALID type flag - yet to-do in a future patch!
- Optimize sctp6 calls to reuse code in sctp_usrreq. Also optimize
  when we close sending out the data and disabling Nagle.
- Change key concatenation order to match the auth RFC
- When sending OOTB shutdown_complete always do csum.
- Don't send PKT-DROP to a PKT-DROP
- For abort chunks just always checksums same for
  shutdown-complete.
- inpcb_free front state had a bug where in queue
  data could wedge an assoc. We need to just abandon
  ones in front states (free_assoc).
- If a peer sends us a 64k abort, we would try to
  assemble a response packet which may be larger than
  64k. This then would be dropped by IP. Instead make
  a "minimum" size for us 64k-2k (we want at least
  2k for our initack). If we receive such an init
  discard it early without all the processing.
- When we peel off we must increment the tcb ref count
  to keep it from being freed from underneath us.
- handling fwd-tsn had bugs that caused memory overwrites
  when given faulty data, fixed so can't happen and we
  also stop at the first bad stream no.
- Fixed so comm-up generates the adaption indication.
- peeloff did not get the hmac params copied.
- fix it so we lock the addr list when doing src-addr selection
  (in future we need to use a multi-reader/one writer lock here)
- During lowlevel output, we could end up with a _l_addr set
  to null if the iterator is calling the output routine. This
  means we would possibly crash when we gather the MTU info.
  Fix so we only do the gather where we have a src address
  cached.
- we need to be sure to set abort flag on conn state when
  we receive an abort.
- peeloff could leak a socket. Moved code so the close will
  find the socket if the peeloff fails (uipc_syscalls.c)

Approved by:	re@freebsd.org(Ken Smith)
2007-08-27 05:19:48 +00:00
Konstantin Belousov
5114048b63 Destroy the kaio_mtx on the freeing the struct kaioinfo in the
aio_proc_rundown.

Do not allow for zero-length read to be passed to the fo_read file method
by aio.

Reported and tested by:	Peter Holm
Approved by:	re (kensmith)
2007-08-20 11:53:26 +00:00
Jeff Roberson
67e20930bd - Improve runq_findbit_from() which is used by ULE's circular queue. Mask
of the bits we want to ignore on the first pass rather than doing a
   linear scan.  This puts us within a few instructions of the cost of
   runq_findbit() and removes this function from the top of profiling output
   for context switch heavy workloads.

Approved by:	re
2007-08-20 06:36:12 +00:00
Jeff Roberson
9862717afe - Set steal_thresh to log2(ncpus). This improves idle-time load balancing
on 2cpu machines by reducing it to 1 by default.  This improves loaded
   operation on 8cpu machines by increasing it to 3 where the extra idle
   time is not as critical.

Approved by:	re
2007-08-20 06:34:20 +00:00
Nate Lawson
62db376af3 Always call sched_bind(), even if on the CPU in question. It is wrong to
check if we're already on that cpu and skip the bind since the thread could
be migrated off in the meantime.

Suggested by:	jeff
Approved by:	re
2007-08-20 06:28:26 +00:00
Nate Lawson
2145b9d207 Use a different loop variable for the inner loop. This previous reuse could
have caused a hang, but we got lucky with the available multi-CPU states
on actual hardware.

Submitted by:	Bjorn Koenig <bkoenig / alpha-tierchen.de>
Approved by:	re
MFC after:	3 days
2007-08-19 20:34:13 +00:00
David Xu
6ec46f7aa8 Regenerate.
Approved by: re(kensmith)
2007-08-16 05:32:26 +00:00
David Xu
0b1f0611b4 Add thr_kill2 syscall which sends a signal to a thread in another process.
Submitted by: Tijl Coosemans tijl at ulyssis dot org
Approved by: re (kensmith)
2007-08-16 05:26:42 +00:00
John Baldwin
1dc5b1cc56 On 6.x this works:
% mount | grep home
/dev/ad4s1e on /home (ufs, local, noatime, soft-updates)
% mount -u -o atime /home
% mount | grep home
/dev/ad4s1e on /home (ufs, local, soft-updates)

Restore this behavior for on 7.x for the following mount options:
noatime, noclusterr, noclusterw, noexec, nosuid, nosymfollow

In addition, on 7.x, the following are equivalent:
mount -u -o atime /home
mount -u -o nonoatime /home

Ideally, when we introduce new mount options, we should avoid
options starting with "no". :)

Requested by:	jhb
Reported by:	Karol Kwiat <karol.kwiat gmail com>, Scott Hetzel <swhetzel gmail com>
Approved by:	re (bmah)
Proxy commit for:	rodrigc
2007-08-15 17:40:09 +00:00
Pawel Jakub Dawidek
354eb80141 Improve vn_printf() by:
- adding missing vnode flags,
- printing unknown flags as numbers,
- using strlcat() instead of strcat().

Approved by:	re (bmah)
2007-08-13 21:23:30 +00:00
Konstantin Belousov
004e08be60 Do not call free() while holding vnode interlock.
Reported and tested by:	Peter Holm
Reviewed by:	jeff
Approved by:	re (kensmith)
2007-08-07 09:04:50 +00:00
Robert Watson
0bf686c125 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
Jeff Roberson
3a78f9658b - Fix one line that erroneously crept in my last commit.
Approved by:	re
2007-08-04 01:21:28 +00:00
Jeff Roberson
c47f202b45 - Share scheduler locks between hyper-threaded cores to protect the
tdq_group structure.  Hyper-threaded cores won't really benefit from
   seperate locks anyway.
 - Seperate out the migration case from sched_switch to simplify the main
   switch code.  We only migrate here if called via sched_bind().
 - When preempted place the preempted thread back in the same queue at
   the head.
 - Improve the cpu group and topology infrastructure.

Tested by:	many on current@
Approved by:	re
2007-08-03 23:38:46 +00:00
Jeff Roberson
413ea6f543 - Set SW_PREEMPT when we preempt in critical_exit().
Approved by:	re
2007-08-03 23:35:35 +00:00
Robert Watson
33d2bb9ca3 First in a series of changes to remove the now-unused Giant compatibility
framework for non-MPSAFE network protocols:

- Remove debug_mpsafenet variable, sysctl, and tunable.
- Remove NET_NEEDS_GIANT() and associate SYSINITSs used by it to force
  debug.mpsafenet=0 if non-MPSAFE protocols are compiled into the kernel.
- Remove logic to automatically flag interrupt handlers as non-MPSAFE if
  debug.mpsafenet is set for an INTR_TYPE_NET handler.
- Remove logic to automatically flag netisr handlers as non-MPSAFE if
  debug.mpsafenet is set.
- Remove references in a few subsystems, including NFS and Cronyx drivers,
  which keyed off debug_mpsafenet to determine various aspects of their own
  locking behavior.
- Convert NET_LOCK_GIANT(), NET_UNLOCK_GIANT(), and NET_ASSERT_GIANT into
  no-op's, as their entire behavior was determined by the value in
  debug_mpsafenet.
- Alias NET_CALLOUT_MPSAFE to CALLOUT_MPSAFE.

Many remaining references to NET_.*_GIANT() and NET_CALLOUT_MPSAFE are still
present in subsystems, and will be removed in followup commits.

Reviewed by:	bz, jhb
Approved by:	re (kensmith)
2007-07-27 11:59:57 +00:00
Attilio Rao
34ed040030 Actually, upcalls cannot be freed while destroying the thread because we
should call uma_zfree() with various spinlock helds.  Rearranging the
code would not help here because we cannot break atomicity respect
prcess spinlock, so the only one choice we have is to defer the operation.
In order to do this use a global queue synchronized through the kse_lock
spinlock which is freed at any thread_alloc() / thread_wait() through a
call to thread_reap().
Note that this approach is not ideal as we should want a per-process
list of zombie upcalls, but it follows initial guidelines of KSE authors.

Tested by: jkim, pav
Approved by: jeff, julian
Approved by: re
2007-07-27 09:21:18 +00:00
Pawel Jakub Dawidek
57fd3d5572 When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with:	kib, ups
Approved by:	re (rwatson)
2007-07-26 16:58:09 +00:00
Pawel Jakub Dawidek
68c1a246ae The v_mountedhere field is protected by the vnode lock, not vnode's internal
lock.

Approved by:	re (rwatson)
2007-07-26 16:52:57 +00:00
Attilio Rao
758b17a100 upcall_free() was only used in kse_GC() which has been removed so it now
results unused; this, with -Werror option of gcc, rise a warning for gcc
which let the buildkernel to be busted.
Fix this removing upcall_free().

Reported by: various
Approved by: jeff
Approved by: re
Pointy hat to: attilio
2007-07-23 23:16:53 +00:00
Attilio Rao
ac8094e4e3 Actually, KSE kernel bits locking is broken and can lead likely to
dangerous races.
Fix this problems adding correct locking for the members of 'struct
kse_upcall' and other struct proc/struct thread related members.
For the moment, just leave ku_mflag and ku_flags "lazy" locked.
While here, cleanup the code removing the function kse_GC() (unused),
and merging upcall_link(), upcall_unlink(), upcall_stash() in their
respective callers (static functions, very short and only called in one
place).

Reported by: pav
Tested by: pav (on some pointyhat cluster nodes)
Approved by: jeff
Approved by: re
Sponsorized by: NGX Italy (http://www.ngx.it)
2007-07-23 14:52:22 +00:00
David Malone
6d8617d42a If clock_ct_to_ts fails to convert time time from the real time clock,
print a one line error message. Add some comments on not being able to
trust the day of week field (I'll act on these comments in a follow up
commit).

Approved by:	re
MFC after:	3 weeks
2007-07-23 09:42:32 +00:00
Konstantin Belousov
e69aee3117 ttyfree() frees the cdev(). But if there are pending kevents,
filt_ttyrdetach() etc would later attempt to dereference cdev->si_tty,
causing a 0xdeadc0de dereference.  Change kn_hook value from cdev to
struct tty to avoid dereferencing freed cdev.

In ttygone(), wake up select(), sigio and kevent() users in addition
to the queue sleepers.

Return EV_EOF from kevent filters if TS_GONE is set.

Submitted by:	peter
Tested by:	Peter Holm
Approved by:	re (kensmith)
MFC after:	2 weeks
2007-07-20 09:41:54 +00:00
Attilio Rao
6aa294be2c Fix some problems with lock profiling in rw locks:
- Adjust lock_profiling stubs semantic in the hard functions in order to be
  more accurate and trustable
- As for sx locks, disable shared paths for lock_profiling.  Actually,
  lock_profiling has a subtle race which makes results caming from shared
  paths not completely trustable. A macro stub (LOCK_PROFILING_SHARED) can
  be actually used for re-enabling this paths, but is currently intended
  for developing use only.
- style(9) fixes

Approved by: jeff, kmacy, jhb[1]
Approved by: re

[1] Had initial reservations not shared by others, conceded
    in the end.
2007-07-20 08:43:42 +00:00
Jeff Roberson
28994a5852 - Refine the load balancer to improve buildkernel times on dual core
machines.
 - Leave the long-term load balancer running by default once per second.
 - Enable stealing load from the idle thread only when the remote processor
   has more than two transferable tasks.  Setting this to one further
   improves buildworld.  Setting it higher improves mysql.
 - Remove the bogus pick_zero option.  I had not intended to commit this.
 - Entirely disallow migration for threads with SRQ_YIELDING set.  This
   balances out the extra migration allowed for with the load balancers.
   It also makes pick_pri perform better as I had anticipated.

Tested by:	Dmitry Morozovsky <marck@rinet.ru>
Approved by:	re
2007-07-19 20:03:15 +00:00
Jeff Roberson
08c9a16c4f - When newtd is specified to sched_switch() it was not being initialized
properly.  We have to temporarily unlock the TDQ lock so we can lock
   the thread and add it to the run queue.  This is used only for KSE.
 - When we add a thread from the tdq_move() via sched_balance() we need to
   ipi the target if it's sitting in the idle thread or it'll never run.

Reported by:	Rene Landan
Approved by:	re
2007-07-19 19:51:45 +00:00
Jeff Roberson
56696bd1ab - Remove explicit references to sched_lock. A simpler assert will do.
Approved by:	re
2007-07-19 08:58:40 +00:00
Jeff Roberson
6eeb364b4c - Calling sched_nice() in tdsigwakeup() is no longer required by ULE and
actually causes LORs and other panics.

Reported by:	mlaier
Approved by:	re
2007-07-19 08:49:16 +00:00
Jeff Roberson
6ea38de8aa - Remove the global definition of sched_lock in mutex.h to break
new code and third party modules which try to depend on it.
 - Initialize sched_lock in sched_4bsd.c.
 - Declare sched_lock in sparc64 pmap.c and assert that we're compiling
   with SCHED_4BSD to prevent accidental crashes from running ULE.  This
   is the sole remaining file outside of the scheduler that uses the
   global sched_lock.

Approved by:	re
2007-07-18 20:46:06 +00:00
Jeff Roberson
773890b9a8 - Add the proper lock profiling calls to _thread_lock().
Obtained from:	kipmacy
Approved by:	re
2007-07-18 20:38:13 +00:00
Jeff Roberson
ae7a6b38d5 ULE 3.0: Fine grain scheduler locking and affinity improvements. This has
been in development for over 6 months as SCHED_SMP.
 - Implement one spin lock per thread-queue.  Threads assigned to a
   run-queue point to this lock via td_lock.
 - Improve the facility for assigning threads to CPUs now that sched_lock
   contention no longer dominates scheduling decisions on larger SMP
   machines.
 - Re-write idle time stealing in an attempt to make it less damaging to
   general performance.  This is still disabled by default. See
   kern.sched.steal_idle.
 - Call the long-term load balancer from a callout rather than sched_clock()
   so there are no locks held.  This is disabled by default.  See
   kern.sched.balance.
 - Parameterize many scheduling decisions via sysctls.  Try to document
   these via sysctl descriptions.
 - General structural and naming cleanups.
 - Document each function with comments.

Tested by:	current@ amd64, x86, UP, SMP.
Approved by:	re
2007-07-17 22:53:23 +00:00
Jeff Roberson
fb62eea266 - Use ruxagg() in calcru() to make sure we have current tick information
from all threads.

Discussed with:	bde, attilio
Approved by:	re
2007-07-17 01:08:09 +00:00
Craig Rodrigues
d7f81adbd4 Revert previous commits which I committed by mistake.
Approved by:	re (implicit)
Pointy hat to:	me
2007-07-14 21:23:31 +00:00
Craig Rodrigues
d678780e60 The last entry in the ext2_opts array must be NULL,
otherwise the kernel with crash in vfs_filteropt() if an invalid
mount option is passed to ext2fs.

Approved by:	re (kensmith)
2007-07-14 21:18:19 +00:00
John Baldwin
59d8f3ff08 Fix a couple of issues with the stack limit for 32-bit processes on 64-bit
kernels exposed by the recent fixes to resource limits for 32-bit processes
on 64-bit kernels:
- Let ABIs expose their maximum stack size via a new pointer in sysentvec
  and use that in preference to maxssiz during exec() rather than always
  using maxssiz for all processses.
- Apply the ABI's limit fixup to the previous stack size when adjusting
  RLIMIT_STACK to determine if the existing mapping for the stack needs to
  be grown or shrunk (as well as how much it should be grown or shrunk).

Approved by:	re (kensmith)
2007-07-12 18:01:31 +00:00
Attilio Rao
c1a6d9fa42 Fix some problems with lock_profiling in sx locks:
- Adjust lock_profiling stubs semantic in the hard functions in order to be
  more accurate and trustable
- Disable shared paths for lock_profiling.  Actually, lock_profiling has a
  subtle race which makes results caming from shared paths not completely
  trustable. A macro stub (LOCK_PROFILING_SHARED) can be actually used for
  re-enabling this paths, but is currently intended for developing use only.
- Use homogeneous names for automatic variables in hard functions regarding
  lock_profiling
- Style fixes
- Add a CTASSERT for some flags building

Discussed with: kmacy, kris
Approved by: jeff (mentor)
Approved by: re
2007-07-06 13:20:44 +00:00
Konstantin Belousov
196a7385ac Revert destroy_dev() to the state before destroy_dev_sched() was introduced.
Attempt to spawn destroy_dev_sched() from it causes inadmissible races.

Requested by:	tegge
Approved by:	re (kensmith)
2007-07-05 13:04:59 +00:00
Bjoern A. Zeeb
f43455fd89 Remove netkey directory from cscope/TAGs generation and replace
it with netipsec now that KAME IPsec is gone.
While here add missing netinet6 directories.

Add comments about the ports needed to be able to run those targets.

Reviewed by:	philip
Approved by:	re (rwatson)
2007-07-05 08:55:14 +00:00
Peter Wemm
22af4cab91 Fix bad function type passed to destroy_dev_sched_cb().
Approved by:  re (rwatson)
2007-07-05 05:54:47 +00:00
Peter Wemm
c2815ad564 Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate
Approved by: re (kensmith)
2007-07-04 22:57:21 +00:00
Peter Wemm
552fbe752f Regenerate after mmap/lseek/etc syscall changes.
Approved by:  re (kensmith)
2007-07-04 22:49:55 +00:00
Peter Wemm
51504d9ac4 Create new syscalls for mmap(), lseek(), pread(), pwrite(), truncate() and
ftruncate(), but without the pad arg.

There are several reasons for this.  Consider 'mmap()'.  On AMD64, the
function call (and syscall) ABI allow for 6 register arguments.  Additional
arguments go on the stack.  mmap(2) has 6 arguments.  However, the syscall
definition has an extra 'int pad' argument.  This pushes it to 7 arguments,
which means one must spill into the memory stack.  Since the kernel API
doesn't match userland API, we have a hack in libc - libc/sys/mmap.c.
This implements the userland API by calling __syscall() with an extra
argument and the pad argument, for a total of 8 args.  This is all
unnecessary and inconvenient for several things, including the kernel's
syscall handler code which now has to handle merging stack arguments with
register arguments.  It is a big deal for certain 3rd party code.

I'm adding libc glue to make the transition totally painless.  I had
intended to mark the old syscalls as COMPAT6, but the potential to shoot
your feet by building a new kernel without COMPAT_FREEBSD6 but with a
slighly older userland was too great.  For now, they have manual
"freebsd6_" prefixes rather than being COMPAT6.  They will go back to
being marked 'COMPAT6' after 7-stable starts.

Approved by: re (kensmith)
2007-07-04 22:47:37 +00:00
Peter Wemm
9f0482e515 Add support for COMPAT6 syscalls.
Also, change the visibility of compat syscalls a slightly.  Compat
syscalls were missing from 'syscalls.h' entirely.  This additionally adds
them with their compat prefix.  eg: SYS_freebsd6_mmap.

Also, the syscalls.c names strings have different prefixes to differentiate
syscalls. Instead of several "old.mmap" strings, there will now be a
"compat.mmap" and "compat6.mmap" etc.  Before, both would have had the
same "old.mmap" label.

Approved by:  re
2007-07-04 22:38:28 +00:00
Konstantin Belousov
09828ba947 Since cdev mutex is after system map mutex in global lock order, free()
shall not be called while holding cdev mutex. devfs_inos unrhdr has cdev as
mutex, thus creating this LOR situation.

Postpone calling free() in kern/subr_unit.c:alloc_unr() and nested functions
until the unrhdr mutex is dropped. Save the freed items on the ppfree list
instead, and provide the clean_unrhdrl() and clean_unrhdr() functions to
clean the list.
Call clean_unrhdrl() after devfs_create() calls immediately before
dropping cdev mutex. devfs_create() is the only user of the alloc_unrl()
in the tree.

Reviewed by:	phk
Tested by:	Peter Holm
LOR:	80
Approved by:	re (kensmith)
2007-07-04 06:56:58 +00:00
Jeff Roberson
f6c1ecca50 - Use explicit locking in the various fcntl case statements so that we
can acquire shared filedescriptor locks in the appropriate cases.
 - Remove Giant from calls that issue ioctls.  The ioctl path has been
   mpsafe for some time now.
 - Only acquire giant for VOP_ADVLOCK when the filesystem requires giant.
   advlock is now mpsafe.

Reviewed by:	rwatson
Approved by:	re
2007-07-03 21:26:06 +00:00
Jeff Roberson
bc02f1d98d - Remove explicit Giant protection from lockf. Use the vnode interlock
to protect this datastructure instead.
 - Preallocate an extra lockf structure in case we want to split a lock
   on insert or delete.
 - msleep() on the vnode interlock when blocking on a lock.

Reviewed by:	rwatson
Approved by:	re
2007-07-03 21:22:58 +00:00
John Baldwin
fb1faf2082 Tweak the low-level MI SMP code some:
- Use cpu_spinwait() in the spin loops in stop_cpus(), restart_cpus(), and
  smp_rendezvous_action().
- Remove unneeded acq memory barriers in stop_cpus(), restart_cpus(), and
  smp_rendezvous_action().
- Add an additional synch point in smp_rendezvous() to ensure that all the
  CPUs will always see an up-to-date value of smp_rv_setup_func.

Reviewed by:	attilio
Approved by:	re (kensmith)
Tested on:	alpha, amd64, i386, sparc64 SMP (for several years)
2007-07-03 18:37:06 +00:00
Konstantin Belousov
9d53363bc8 Rev. 1.204 and 1.205 got an erronous version of destroy_dev() that
calls destroy_dev_sched() with cdev mutex locked. Commit the code
that was actually tested.

Pointy hat to:	kib
Approved by:	re (implicit)
2007-07-03 18:18:30 +00:00
Konstantin Belousov
f5baf8d66b Lock Giant and proctree lock around dereferencing p_session->s_ttyvp->v_rdev.
Lock cdev mutex too to close the race with tty being freed.
Relock clone_drain_lock to prevent the LOR with proctree lock, thus
add #include <fs/devfs/devfs_int.h>.

Suggested by:	tegge
Debugging help and testing by:	Peter Holm
Approved by:	re (kensmith)
2007-07-03 17:46:37 +00:00
Konstantin Belousov
8a5d7ef25c Use make_dev_credf(MAKEDEV_REF) instead of make_dev() from pty clone handler.
Debugging help and testing by:	Peter Holm
Approved by:	re (kensmith)
2007-07-03 17:45:52 +00:00
Konstantin Belousov
0a9c2b6db8 Use make_dev_credf(MAKEDEV_REF) instead of make_dev() from the clone handler.
Lock Giant in the clone handler.
Use destroy_dev_sched() explicitely from pty_maybecleanup() and postpone
pty_release() until both master and slave cdevs are destroyed by setting
it as callback for destroy_dev_sched().

Debugging help and testing by:	Peter Holm
Approved by:	re (kensmith)
2007-07-03 17:44:59 +00:00
Konstantin Belousov
6f0281937b Automatically detect deadlock condition in destroy_dev(), that is, if
destroy_dev() is called from csw method, and no d_purge driver method is
provided. Transform the direct call to destroy_dev() into destroy_dev_sched().

Reviewed by:	njl (programming interface)
Debugging help and testing by:	Peter Holm
Approved by:	re (kensmith)
2007-07-03 17:43:20 +00:00
Konstantin Belousov
de10ffa527 Since rev. 1.199 of sys/kern/kern_conf.c, the thread that calls
destroy_dev() from d_close() cdev method would self-deadlock.
devfs_close() bump device thread reference counter, and destroy_dev()
sleeps, waiting for si_threadcount to reach zero for cdev without
d_purge method.

destroy_dev_sched() could be used instead from d_close(), to
schedule execution of destroy_dev() in another context. The
destroy_dev_sched_drain() function can be used to drain the scheduled
calls to destroy_dev_sched(). Similarly, drain_dev_clone_events() drains
the events clone to make sure no lingering devices are left after
dev_clone event handler deregistered.

make_dev_credf(MAKEDEV_REF) function should be used from dev_clone
event handlers instead of make_dev()/make_dev_cred() to ensure that created
device has reference counter bumped before cdev mutex is dropped inside
make_dev().

Reviewed by:	tegge (early versions), njl (programming interface)
Debugging help and testing by:	Peter Holm
Approved by:	re (kensmith)
2007-07-03 17:42:37 +00:00
Konstantin Belousov
7aee5992a5 Relock the sema_mtxp unconditionally after copyin() for SETALL case in
kern_semctl. Otherwise, later mtx_unlock() can operate on unlocked mutex.

Submitted by:	rdivacky
MFC after:	3 days
Approved by:	re (kensmith)
2007-07-03 15:58:47 +00:00
Robert Watson
bc6eca2432 Continue kernel privilege cleanup for 7.0: unstaticize suser_enabled and
stop declaring it in systm.h -- it's used only in kern_priv.c and is not
required elsewhere.

Approved by:	re (kensmith)
2007-07-02 14:03:29 +00:00
Randall Stewart
b8709d23c5 - Add some needed error checking on bad fd passing in the sctp
syscalls.
Approved by:	re@freebsd.org (Ken Smith)
Obtained from:	Weongyo Jeong (weongyo.jeong@gmail.com)
2007-07-02 12:50:53 +00:00
Jeff Roberson
03d03260b2 - Use rufetchcalc() rather than calcru() in ttyinfo so that we get
correct system and user time stats.

Approved by:	re
Reported by:	kris
Discussed with:	Attilio
2007-07-01 00:17:59 +00:00
Robert Watson
dc2e1e3fae Use vm_offset_t for kmembase and kmemlimit rather than char *, avoiding
unnecessary casts, and making it possible to compile kern_malloc.c with
strict aliasing.

Submitted by:	rdivacky
Approved by:	re (kensmith)
2007-06-27 13:39:38 +00:00
Attilio Rao
6a0ce57d10 Fix an old standing LOR between callout_lock and sleepqueues chain (which
could lead to a deadlock).
- sleepq_set_timeout acquires callout_lock (via callout_reset()) only
  with sleepq chain lock held
- msleep_spin in _callout_stop_safe lock the sleepqueue chain with
  callout_lock held

In order to solve this don't use msleep_spin in _callout_stop_safe() but
use directly sleepqueues as inline msleep_spin code. Rearrange the
wakeup path in order to have it consistent too.

Reported by: kris (via stress2 test suite)
Tested by: Timothy Redaelli <drizzt@gufi.org>
Reviewed by: jhb
Approved by: jeff (mentor)
Approved by: re
2007-06-26 21:42:01 +00:00
Attilio Rao
f08945a7d2 Introduce a new rwlocks initialization function: rw_init_flags.
This is very similar to sx_init_flags: it initializes the rwlock using
special flags passed as third argument (RW_DUPOK, RW_NOPROFILE,
RW_NOWITNESS, RW_QUIET, RW_RECURSE).
Among these, the most important new feature is probabilly that rwlocks
can be acquired recursively now (for both shared and exclusive paths).

Because of the recursion counter, the ABI is changed.

Tested by: Timothy Redaelli <drizzt@gufi.org>
Reviewed by: jhb
Approved by: jeff (mentor)
Approved by: re
2007-06-26 21:31:56 +00:00
Rong-En Fan
534046e301 - Remove UMAP filesystem. It was disconnected from build three years ago,
and it is seriously broken.

Discussed on:   freebsd-arch@
Approved by:	re (mux)
2007-06-25 05:06:57 +00:00
Konstantin Belousov
9bc911d4a2 devfs_free() calls free_unr(), that may sleep.
Postpone call to devfs_free() after cdev mutex is dropped. Reuse
cdp_list link for queuing devices awaiting deletion in the
cdevp_free_list.

Reported by:	Hans Petter Selasky <hselasky c2i net>
Tested by:	Peter Holm
Approved by:	re (kensmith)
MFC after:	2 weeks
2007-06-19 13:19:23 +00:00
Konstantin Belousov
7550e3eac4 Add the witness warning for free_unr. Function could sleep, thus callers
shall not have any non-sleepable locks held.

Submitted by:	Hans Petter Selasky <hselasky c2i net>
Approved by:	re (kensmith)
2007-06-19 13:13:17 +00:00
Pawel Jakub Dawidek
dfe97ff4a5 We only flush entries related to the given file system. Currently there are
no 'invalid' cache entires - file system is responsible for keeping it that
way. The comment should have been updated in rev.1.25.
2007-06-18 09:28:24 +00:00
Robert Watson
7251b7863c Rather than passing SUSER_RUID into priv_check_cred() to specify when
a privilege is checked against the real uid rather than the effective
uid, instead decide which uid to use in priv_check_cred() based on the
privilege passed in.  We use the real uid for PRIV_MAXFILES,
PRIV_MAXPROC, and PRIV_PROC_LIMIT.  Remove the definition of
SUSER_RUID; there are now no flags defined for priv_check_cred().

Obtained from:	TrustedBSD Project
2007-06-16 23:41:43 +00:00
Marius Strobl
79be8b5082 - Remove zstty spin lock for no longer existing zs(4).
- Move the rtc_mtx spin lock out from under #ifdef SMP as it's just
  not SMP-specific.
- Add a new spin lock pcib_mtx for locking "fast" interrupt handlers
  of host-to-PCI bridge drivers on sparc64.
2007-06-16 23:30:57 +00:00
Jeff Roberson
dda713dfb8 - Fix an off by one error in sched_pri_range.
- In tdq_choose() only assert that a thread does not have too high a
   priority (low value) for the queue we removed it from.  This will catch
   bugs in priority elevation.  It's not a serious error for the thread
   to have too low a priority as we don't change queues in this case as
   an optimization.

Reported by:	kris
2007-06-15 19:33:58 +00:00
Robert Watson
7e273744a6 Remove the restriction that rtprio(2) cannot be used to set the realtime
or idle priority of another process owned by the same user.  This means
that privilege in rtprio(2) (and rtprio_thread(2)) is required indirectly
via p_cansched(9) or directly to set realtime/idle privilege, rather than
directly affecting target process authorization.
2007-06-14 23:31:52 +00:00
Robert Watson
b4be6ef22f Only require privilege to set the current time adjustment, not in order to
query it.
2007-06-14 18:37:58 +00:00
Robert Watson
3805385e3d Spell statistics more correctly in comments. 2007-06-14 03:02:33 +00:00
John Baldwin
34a9edafbc Improve the ktrace locking somewhat to reduce overhead:
- Depessimize userret() in kernels where KTRACE is enabled by doing an
  unlocked check of the per-process queue of pending events before
  acquiring any locks.  Previously ktr_userret() unconditionally acquired
  the global ktrace_sx lock on every return to userland for every thread,
  even if ktrace wasn't enabled for the thread.
- Optimize the locking in exit() to first perform an unlocked read of
  p_traceflag to see if ktrace is enabled and only acquire locks and
  teardown ktrace if the test succeeds.  Also, explicitly disable tracing
  before draining any pending events so the pending events actually get
  written out.  The unlocked read is safe because proc lock is acquired
  earlier after single-threading so p_traceflag can't change between then
  and this check (well, it can currently due to a bug in ktrace I will fix
  next, but that race existed prior to this change as well).

Reviewed by:	rwatson
2007-06-13 20:01:42 +00:00
John Baldwin
ce0be64687 Conditionally acquire Giant when dropping a reference on the ktrace vnode
during execve() when turning off tracing due to executing a setuid binary
as non-root.  Previously this could fail to acquire Giant and fail an
assertion if the ktrace file was on a non-MPSAFE filesystem and the
executable was on an MPSAFE filesystem.

MFC after:	3 days
Reported by:	kris
2007-06-13 19:41:47 +00:00
Jeff Roberson
3036ab79e3 - Include opt_sched.h for SCHED_STATS. 2007-06-12 23:27:31 +00:00
Jeff Roberson
671f2709ae - Garbage collect unused concurrency functions. 2007-06-12 19:50:31 +00:00
Jeff Roberson
e7c8d2e9fe - Garbage collect unused concurrency functions.
- Remove unused kse fields from struct proc.
 - Group remaining fields and #ifdef KSE them.
 - Move some kern_kse.c only prototypes out of proc and into kern_kse.

Discussed with:	Julian
2007-06-12 19:49:39 +00:00
Jeff Roberson
fe54587ffa - Move some common code out of sched_fork_exit() and back into fork_exit(). 2007-06-12 07:47:09 +00:00
Jeff Roberson
ff8fbcffcb Solve a complex exit race introduced with thread_lock:
- Add a count of exiting threads, p_exitthreads, to struct proc.
 - Increment p_exithreads when we set the deadthread in thread_exit().
 - When we thread_stash() a deadthread use an atomic to drop the count.
 - Spin until the p_exithreads count reaches 0 in thread_wait().
 - Lock the last exiting thread momentarily to be certain that it has
   exited cpu_throw().
 - Restructure thread_wait().  It does not need a loop as there will only
   ever be one thread.

Tested by:	moose@opera.com
Reported by:	kris, moose@opera.com
2007-06-12 07:24:46 +00:00
Robert Watson
32f9753cfb Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths.  Do, however, move those prototypes to priv.h.

Reviewed by:	csjp
Obtained from:	TrustedBSD Project
2007-06-12 00:12:01 +00:00
Jeff Roberson
efe641b939 - Add a missing PROC_SUNLOCK() in tdsignal() 2007-06-11 23:27:03 +00:00
Olivier Houchard
e411ce026a Re-acquire the PROC_SLOCK before calling calcru(), and release it after,
since calcru() expects it to be locked.

Reviewed by:	attilio
2007-06-11 21:05:41 +00:00
Sam Leffler
68e8e04e93 Update 802.11 wireless support:
o major overhaul of the way channels are handled: channels are now
  fully enumerated and uniquely identify the operating characteristics;
  these changes are visible to user applications which require changes
o make scanning support independent of the state machine to enable
  background scanning and roaming
o move scanning support into loadable modules based on the operating
  mode to enable different policies and reduce the memory footprint
  on systems w/ constrained resources
o add background scanning in station mode (no support for adhoc/ibss
  mode yet)
o significantly speedup sta mode scanning with a variety of techniques
o add roaming support when background scanning is supported; for now
  we use a simple algorithm to trigger a roam: we threshold the rssi
  and tx rate, if either drops too low we try to roam to a new ap
o add tx fragmentation support
o add first cut at 802.11n support: this code works with forthcoming
  drivers but is incomplete; it's included now to establish a baseline
  for other drivers to be developed and for user applications
o adjust max_linkhdr et. al. to reflect 802.11 requirements; this eliminates
  prepending mbufs for traffic generated locally
o add support for Atheros protocol extensions; mainly the fast frames
  encapsulation (note this can be used with any card that can tx+rx
  large frames correctly)
o add sta support for ap's that beacon both WPA1+2 support
o change all data types from bsd-style to posix-style
o propagate noise floor data from drivers to net80211 and on to user apps
o correct various issues in the sta mode state machine related to handling
  authentication and association failures
o enable the addition of sta mode power save support for drivers that need
  net80211 support (not in this commit)
o remove old WI compatibility ioctls (wicontrol is officially dead)
o change the data structures returned for get sta info and get scan
  results so future additions will not break user apps
o fixed tx rate is now maintained internally as an ieee rate and not an
  index into the rate set; this needs to be extended to deal with
  multi-mode operation
o add extended channel specifications to radiotap to enable 11n sniffing

Drivers:
o ath: add support for bg scanning, tx fragmentation, fast frames,
       dynamic turbo (lightly tested), 11n (sniffing only and needs
       new hal)
o awi: compile tested only
o ndis: lightly tested
o ipw: lightly tested
o iwi: add support for bg scanning (well tested but may have some
       rough edges)
o ral, ural, rum: add suppoort for bg scanning, calibrate rssi data
o wi: lightly tested

This work is based on contributions by Atheros, kmacy, sephe, thompsa,
mlaier, kevlo, and others.  Much of the scanning work was supported by
Atheros.  The 11n work was supported by Marvell.
2007-06-11 03:36:55 +00:00
Attilio Rao
393a081d42 Optimize vmmeter locking.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments

Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)
2007-06-10 21:59:14 +00:00
Matt Jacob
a659386c7e Remove unused variable. 2007-06-10 01:50:05 +00:00
Matt Jacob
26756b7a58 The new compiler can't quite follow the logic of has_stime and
complains about using uninitialized tags in stime.
2007-06-10 01:49:17 +00:00
Matt Jacob
9b73d2396a Initialized ets to zero. This is arguably a gcc bug in that ets is always
set to rts when timeout is non-NULL and then timevalid is set and ets is
only checked later when timervalid is set.
2007-06-10 01:43:11 +00:00
Attilio Rao
bdf08be439 Fix a bug caming from the committing a pre-merge version of the patch
instead than a post-merge version (respect to another rusage fix).

Reported by: marcel
Approved by: jeff(mentor)
2007-06-10 00:28:41 +00:00
Marcel Moolenaar
55b5660de4 Work around an integer overflow in expression `3 * maxbufspace / 4',
when maxbufspace is larger than INT_MAX / 3. The overflow causes a
hard hang on ia64 when physical memory is sufficiently large (8GB).
2007-06-09 23:41:14 +00:00
Attilio Rao
a1fe14bc33 rufetch and calcru sometimes should be called atomically together.
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.

Reviewed by: jeff
Approved by: jeff (mentor)
2007-06-09 21:48:44 +00:00
Attilio Rao
86a49dea5b Since locking in kern/subr_prof.c is changed a bit, we need nomore of
time_lock spinlock exported.

Approved by: jeff (mentor)
2007-06-09 19:41:14 +00:00
Attilio Rao
a140976eb4 The current rusage code show peculiar problems:
- Unsafeness on ruadd() in thread_exit()
- Unatomicity of thread_exiit() in the exit1() operations

This patch addresses these problems allocating p_fd as part of the
process and modifying the way it is accessed.

A small chunk of this patch, resolves a race about p_state in kern_wait(),
since we have to be sure about the zombif-ing process.

Submitted by: jeff
Approved by: jeff (mentor)
2007-06-09 18:56:11 +00:00
Matt Jacob
65d32cd8fb Propagate volatile qualifier to make gcc4.2 happy. 2007-06-09 18:09:37 +00:00
Attilio Rao
e682569165 Remove the MUTEX_WAKE_ALL option and make it the default behaviour for our
mutexes.
Currently we alredy force MUTEX_WAKE_ALL beacause of some problems with the
!MUTEX_WAKE_ALL case (unavioidable priority inversion).
2007-06-08 21:36:52 +00:00
Poul-Henning Kamp
7acfb0af82 Double the WITNESS and DIAGNOSTIC benchmark warnings right before we
go into userland to improve the chances of people noticing them.
2007-06-08 11:47:36 +00:00
Xin LI
7b8c8b858c In getblk(), before gbincore(), use BO_LOCK directly when locking
the bufobj, rather than using VI_LOCK, like what was done with
revision 1.453.
2007-06-08 07:05:08 +00:00
Robert Watson
faef53711b Move per-process audit state from a pointer in the proc structure to
embedded storage in struct ucred.  This allows audit state to be cached
with the thread, avoiding locking operations with each system call, and
makes it available in asynchronous execution contexts, such as deep in
the network stack or VFS.

Reviewed by:	csjp
Approved by:	re (kensmith)
Obtained from:	TrustedBSD Project
2007-06-07 22:27:15 +00:00
John Baldwin
a66fde8d35 - Remove unused variable from create_thread().
- Move kern_thr_*() prototype to <sys/syscallsubr.h> where all the other
  kern_*() prototypes live.
2007-06-07 19:45:19 +00:00
David Xu
42ce445fed Backout experimental adaptive-spin umtx code. 2007-06-06 07:35:08 +00:00
Jeff Roberson
710eacdc5f - Placing the 'volatile' on the right side of the * in the td_lock
declaration removes the need for __DEVOLATILE().

Pointed out by:	tegge
2007-06-06 03:40:47 +00:00
Attilio Rao
d301eb10c7 Fix a problem with not-preemptive kernels caming from mis-merging of
existing code with the new thread_lock patch.
This also cleans up a bit unlock operation for mutexes.

Approved by: jhb, jeff(mentor)
2007-06-05 18:57:09 +00:00
Konstantin Belousov
b95b98b0bd Restore non-SMP build.
Reviewed by:	attilio
2007-06-05 14:20:13 +00:00
Jeff Roberson
95e3a0bca3 - Better fix for previous error; use DEVOLATILE on the td_lock pointer
it can actually sometimes be something other than sched_lock even on
   schedulers which rely on a global scheduler lock.

Tested by:	kan
2007-06-05 04:12:46 +00:00
Jeff Roberson
c219b097af - Pass &sched_lock as the third argument to cpu_switch() as this will
always be the correct lock and we don't get volatile warnings this
   way.

Pointed out by:	kan
2007-06-05 03:46:54 +00:00
Jeff Roberson
36b369163b - Define TDQ_ID() for the !SMP case.
- Default pick_pri to off.  It is not faster in most cases.
2007-06-05 02:53:51 +00:00
Jeff Roberson
8e0185f604 - Remove sched_core.c. The maintainer has lost interest in pursuing this
and it has been neglected in the recent ksegrp removal as well as
   the thread_lock() changes.

Discussed with:	davidxu
2007-06-05 00:12:37 +00:00
Jeff Roberson
982d11f836 Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
   sychronization.
 - Use the per-process spinlock rather than the sched_lock for per-process
   scheduling synchronization.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00