10140 Commits

Author SHA1 Message Date
Robert Watson
088f584961 Remove unused variable td from sched_idletd().
MFC after:	3 days
Found with:	Coverity Prevent(tm)
CID:		3561
2007-11-05 12:01:12 +00:00
Konstantin Belousov
89b57fcf01 Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with:	Peter Holm
Reviewed by:	jhb
2007-11-05 11:36:16 +00:00
Julian Elischer
f4bb4fc8f3 Completely remove the code for single threading the mainline fork code.
Put in a little comment explaining why it went away.
Re-enable it in the case there an exisiting process is just splitting
off its address space and file descriptors.
(I donpt think anything uses that code but it needs some sort of locking
and this does the job.

Reviewed by:	Davidxu, alc, others
MFC after:	3 days
2007-11-02 19:40:36 +00:00
Nate Lawson
a15e947d54 If we're on an SMP kernel and there is more than 1 CPU, reject any attempts
to change the freq before the other CPUs are active.  The current code
always attempts to change all CPUs to match each other, and the requisite
sched_bind() call won't work before APs are launched.
2007-10-30 22:18:08 +00:00
Julian Elischer
539976ffdf fix typo in code normally not compiled in. 2007-10-29 20:45:31 +00:00
Julian Elischer
3c1ffc320f Fix typo in code obviously not being compiled on any of my machines.
found by: rdivacky@
2007-10-28 23:11:57 +00:00
John Baldwin
9dddab6fc1 Change the roundrobin implementation in the 4BSD scheduler to trigger a
userland preemption directly from hardclock() via sched_clock() when a
thread uses up a full quantum instead of using a periodic timeout to cause
a userland preemption every so often.  This fixes a potential deadlock
when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
by a thread pinned or bound to another CPU.  The current thread on that
CPU will never be preempted while softclock is blocked.

Note that ULE already drives its round-robin userland preemption from
sched_clock() as well and always enables IPI_PREEMPT.

MFC after:	1 week
2007-10-27 22:07:40 +00:00
Craig Rodrigues
b4b5bf359b In nmount(), if MNT_ROOT is in the mount flags, filter it
out instead of returning an error.
(1)  This makes the behavior consistent with mount(2).
(2)  This makes update mounts on the root file system work properly.
(3)  The explicit checks for MNT_ROOTFS in src/sbin/fsck_ffs/main.c
     and src/usr.sbin/mountd/mountd.c which were put in to
     eliminate errors during update mounts on the root file system
     can be removed.

The only place were MNT_ROOTFS can be validly set
is inside the kernel, i.e. with vfs_mountroot_try().

Reviewed by:	phk
MFC after:	3 days
2007-10-27 15:59:18 +00:00
Julian Elischer
6a564b46b6 Add support for the pre-exisiting module shutdoen handshake.
Fix some comments.
2007-10-27 00:54:16 +00:00
Julian Elischer
9ef95d0105 rename the process to 'idle' and 'intr' as per jhb. 2007-10-27 00:52:26 +00:00
Julian Elischer
fbf7046447 Initialise the initial process pointer to NULL so that we know we don't
have an idle process yet.
I'm guessing that on my system this was always 0 already.

found by: Ed Schouten
2007-10-27 00:42:40 +00:00
Julian Elischer
6bc3d1dc09 If kthread_exit() is called on the last kthread in a kproc, then
all the work in kproc_exit must be done.
We don't actually have a user of this yet but why leave it to chance.
2007-10-26 22:18:20 +00:00
Julian Elischer
ca9a0ddf31 if one changes a function's arguments, one must also change the callers. 2007-10-26 22:03:19 +00:00
Julian Elischer
5f66cfca51 oops, over optimised and broke non-SMP builds 2007-10-26 20:32:33 +00:00
Julian Elischer
dd1b3ff97e kthread_exit needs no stinkin argument. 2007-10-26 17:03:22 +00:00
David E. O'Brien
ef44c8d2a3 style(9) 2007-10-26 16:33:47 +00:00
Julian Elischer
7ab24ea3b9 Introduce a way to make pure kernal threads.
kthread_add() takes the same parameters as the old kthread_create()
plus a pointer to a process structure, and adds a kernel thread
to that process.

kproc_kthread_add() takes the parameters for kthread_add,
plus a process name and a pointer to a pointer to a process instead of just
a pointer, and if the proc * is NULL, it creates the process to the
specifications required, before adding the thread to it.

All other old kthread_xxx() calls return, but act on (struct thread *)
instead of (struct proc *). One reason to change the name is so that
any old kernel modules that are lying around and expect kthread_create()
to make a process will not just accidentally link.

fix top to show  kernel threads by their thread name in -SH mode
add a tdnam formatting option to ps to show thread names.

make all idle threads actual kthreads and put them into their own idled process.
make all interrupt threads kthreads and put them in an interd process
(mainly for aesthetic and accounting reasons)
rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper'

man page fixes to follow.
2007-10-26 08:00:41 +00:00
Christian S.J. Peron
57274c513c Implement AUE_CORE, which adds process core dump support into the kernel.
This change introduces audit_proc_coredump() which is called by coredump(9)
to create an audit record for the coredump event.  When a process
dumps a core, it could be security relevant.  It could be an indicator that
a stack within the process has been overflowed with an incorrectly constructed
malicious payload or a number of other events.

The record that is generated looks like this:

header,111,10,process dumped core,0,Thu Oct 25 19:36:29 2007, + 179 msec
argument,0,0xb,signal
path,/usr/home/csjp/test.core
subject,csjp,csjp,staff,csjp,staff,1101,1095,50457,10.37.129.2
return,success,1
trailer,111

- We allocate a completely new record to make sure we arent clobbering
  the audit data associated with the syscall that produced the core
  (assuming the core is being generated in response to SIGABRT  and not
  an invalid memory access).
- Shuffle around expand_name() so we can use the coredump name at the very
  beginning of the coredump call.  Make sure we free the storage referenced
  by "name" if we need to bail out early.
- Audit both successful and failed coredump creation efforts

Obtained from:	TrustedBSD Project
Reviewed by:	rwatson
MFC after:	1 month
2007-10-26 01:23:07 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Christian S.J. Peron
5ff3816d82 Move where we audit the PID argument such that we unconditionally
audit it at the beginning of the syscall.  This fixes a problem
where the user supplies an invalid process ID which is > 0 which
results in the PID argument not being audited.

Obtained from:	TrustedBSD Project
MFC after:	1 week
2007-10-24 00:14:19 +00:00
Julian Elischer
e9271f5376 Take out the single-threading code in fork.
After discussions with jeff, alc, (various Ironport people), david Xu,
and mostly Alfred (who found the problem) it has been demonstrated that this
is not needed for our implementations of threads and represents a real
(as in we've seen it happen a lot) deadlock danger.

Several points:
 Since forking multiple threads is not allowed, and posix states that
 any mutexes owned by othre threads wilol be owned in the child by
 phantom threads, and therads shouldn't ba accessing shared structures without
 protection, It can be proved that if this leads to the child process accessing
 inconsistent data, it's a programming error.

 The mode of thread_single() being used in fork() is the wrong one.
 It is using SINGLE_NO_EXIT when it should be using SINGLE_BOUNDARY.

 Even if this we used, System processes have no need to do it as they have
 no userland to get inconsistent.

  This commmit first fixes the above bugs to get tehm correct in CVS.
  then removes them with #ifdef.
  This is so that history contains the corrected version should it
  be needed in the future.
  This code may be needed if we implement the forkall() syscall from
  Solaris. It may be needed for other non-posix thread libraries
  at some time in the future, so let the code sit for a short while
  while I do some work on it anyhow.

This removes a reproducible lockup in NFS.
It may be argued that maybe doing a fork while holding a vnode lock may
not be the best idea in th efirst place but it shouldn't cause a deadlock.
The removal has been running under soak test for several days now.

This removal should be seriously considered for 7.0 and RELENG_6.

Note. There is code in the core-dumping code that may have a similar problem
with coredumping threaded processes

MFC After: 4 days
2007-10-23 17:54:15 +00:00
Peter Grehan
cbdd62ad04 Cut over to ULE on PowerPC
kern/sched_ule.c - Add __powerpc__ to the list of supported architectures

powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE

powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct

powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by
 updating the old thread's lock. Note: uniprocessor-only, will require
 modification for MP support.

powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of
old thread's lock, making the call a no-op.

Reviewed by:	marcel, jeffr (slightly older version)
2007-10-23 00:52:25 +00:00
John Birrell
1676805c18 Add the full module path name to the kld_file_stat structure
for kldstat(2).

This allows libdtrace to determine the exact file from which
a kernel module was loaded without having to guess.

The kldstat(2) API is versioned with the size of the
kld_file_stat structure, so this change creates version 2.

Add the pathname to the verbose output of kldstat(8) too.

MFC: 3 days
2007-10-22 04:12:57 +00:00
Robert Watson
e41966dc35 Add PRIV_VFS_STAT privilege, which will allow overriding policy limits on
the right to stat() a file, such as in mac_bsdextended.

Obtained from:	TrustedBSD Project
MFC after:	3 months
2007-10-21 22:50:11 +00:00
Julian Elischer
3745c395ec Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
Ed Maste
7188e3c834 Put comments about syscalls by the correct ones, and use the correct syscall
number in the comment.
2007-10-19 19:17:53 +00:00
Sam Leffler
58590eb06b ULE works fine on arm; allow it to be used
Reviewed by:	jeff, cognet, imp
MFC after:	1 week
2007-10-16 19:25:26 +00:00
Alfred Perlstein
7c45a9c446 Export maxswzone, maxbcache, maxtsiz, dfldsiz, maxdsiz, dflssiz, maxssiz,
and sgrowsiz via sysctl.

MFC after: 1 week
2007-10-16 10:40:53 +00:00
Alexander Leidinger
9f05d312b3 Backout sensors framework.
Requested by:	phk
Discussed on:	cvs-all
2007-10-15 20:00:24 +00:00
Alexander Leidinger
99f6b270e3 Import OpenBSD's sysctl hardware sensors framework.
This commit includes the following core components:

 * sample configuration file for sensorsd
 * rc(8) script and glue code for sensorsd(8)
 * sysctl(3) doc fixes for CTL_HW tree
 * sysctl(3) documentation for hardware sensors
 * sysctl(8) documentation for hardware sensors
 * support for the sensor structure for sysctl(8)
 * rc.conf(5) documentation for starting sensorsd(8)
 * sensor_attach(9) et al documentation
 * /sys/kern/kern_sensors.c
   o sensor_attach(9) API for drivers to register ksensors
   o sensor_task_register(9) API for the update task
   o sysctl(3) glue code
   o hw.sensors shadow tree for sysctl(8) internal magic
 * <sys/sensors.h>
 * HW_SENSORS definition for <sys/sysctl.h>
 * sensors display for systat(1), including documentation
 * sensorsd(8) and all applicable documentation

The userland part of the framework is entirely source-code
compatible with OpenBSD 4.1, 4.2 and  -current as of today.

All sensor readings can be viewed with `sysctl hw.sensors`,
monitored in semi-realtime with `systat -sensors` and also
logged with `sensorsd`.

Submitted by:	Constantine A. Murenin <cnst@FreeBSD.org>
Sponsored by:	Google Summer of Code 2007 (GSoC2007/cnst-sensors)
Mentored by:	syrinx
Tested by:	many
OKed by:	kensmith
Obtained from:	OpenBSD (parts)
2007-10-14 10:45:31 +00:00
Dag-Erling Smørgrav
d302c56d9b I don't know what I was smoking when I wrote these three years ago; the
return value is an error code, hence always an int.

While I'm here, add getenv_uint() for completeness.
2007-10-13 11:30:19 +00:00
Mohan Srinivasan
58d14dae6d Set the NFS server sockbuf high watermarks to the system defaults
(up form 32KB). The low highwatermark setting caused UDP fullsock
request drops, throttling thruput greatly.
Reported by: Kris Kennaway
Approved by: re@ (Ken Smith)
2007-10-12 03:56:27 +00:00
Jeff Roberson
8753688f03 - Fix from pr kern/115469; Don't redeliver a signal once it has been
handled by the target process.

Contributed by:	Tijl Coosemans <tijl@ulyssis.org>
Approved by:	re
2007-10-09 00:03:39 +00:00
Jeff Roberson
88f530cc25 - Bail out of tdq_idled if !smp_started or idle stealing is disabled. This
fixes a bug on UP machines with SMP kernels where the idle thread
   constantly switches after trying to steal work from the local cpu.
 - Make the idle stealing code more robust against self selection.
 - Prefer to steal from the cpu with the highest load that has at least one
   transferable thread.  Before we selected the cpu with the highest
   transferable count which excludes bound threads.

Collaborated with:	csjp
Approved by:		re
2007-10-08 23:50:39 +00:00
Jeff Roberson
05dc0eb204 - Restore historical sched_yield() behavior by changing sched_relinquish()
to simply switch rather than lowering priority and switching.  This allows
   threads of equal priority to run but not lesser priority.

Discussed with:	davidxu
Reported by:	NIIMI Satoshi <sa2c@sa2c.net>
Approved by:	re
2007-10-08 23:45:24 +00:00
Jeff Roberson
40a940af86 - Restore historical yield() behavior by manually lowering priority and
switching.

Approved by:	re
2007-10-08 23:40:40 +00:00
Jeff Roberson
5bce4ae3be - Fix ULE in kernels without PREEMPTION compiled in by always enabling the
critical_exit() owepreempt check.  ULE will always use owepreempt to
   preempt the idle thread.  This change does not effect 4BSD since it will
   never set owepreempt without PREEMPTION enabled.
 - Remove some unused code from choosethread().

Discussed with:	jhb
Approved by:	re
2007-10-08 23:37:28 +00:00
Kip Macy
457869b973 This patch adds an M_NOFREE flag which allows one to mark an mbuf as
not being independently freeable. This allows one to embed an mbuf in
the cluster itself. This confers the benefits of the packet zone on
all cluster sizes. Embedded mbufs currently suffer from the same
limitation that packet zone mbufs do in that one cannot disconnect
them and pass them around independently of the cluster. It would
likely be possible to eliminate this limitation in the future by
adding a second reference for the mbuf itself.

Approved by: re(gnn)
2007-10-06 21:42:39 +00:00
Kip Macy
629b9e0853 Allow drivers to free an mbuf without having the mbuf be touched if
the driver has already freed any attached tags

Approved by: re(gnn)
2007-10-06 21:13:55 +00:00
Pawel Jakub Dawidek
764a938b11 Fix sx_try_slock(), so it only fails when there is an exclusive owner.
Before that fix, it was possible for the function to fail if number
of sharers changes between 'x = sx->sx_lock' step and atomic_cmpset_acq_ptr()
call.

This fixes ZFS problem when ZFS returns strange EIO errors under load.
In ZFS there is a code that depends on the fact that sx_try_slock() can
only fail if there is an exclusive owner.

Discussed with:	attilio
Reviewed by:	jhb
Approved by:	re (kensmith)
2007-10-02 14:48:48 +00:00
Jeff Roberson
59c6813475 - Reassign the thread queue lock to newtd prior to switching. Assigning
after the switch leads to a race where the outgoing thread still owns
   the local queue lock while another cpu may switch it in.  This race
   is only possible on machines where cpu_switch can take significantly
   longer on different cpus which in practice means HTT machines with
   unfair thread scheduling algorithms.

Found by:	kris (of course)
Approved by:	re
2007-10-02 01:30:18 +00:00
Jeff Roberson
7fcf154aef - Move the rebalancer back into hardclock to prevent potential softclock
starvation caused by unbalanced interrupt loads.
 - Change the rebalancer to work on stathz ticks but retain randomization.
 - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than
   complex sequences of locks to avoid deadlock.

Reported by:	kris
Approved by:	re
2007-10-02 00:36:06 +00:00
Jeff Roberson
02e2d6b445 - Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default
value for kern.sched.preempt_thresh appropriately.  It can still by
   adjusted at runtime.  ULE will still use IPI_PREEMPT in certain
   migration situations.
 - Assert that we're not trying to compile ULE on an unsupported
   architecture.  To date, I believe only i386 and amd64 have implemented
   the third cpu switch argument required.

Approved by:	re
2007-09-27 16:39:27 +00:00
Ruslan Ermilov
718a600b20 Fix the description of the formula used to autosize the number of
buffers in the buffer cache.

Approved by:	re (kensmith)
2007-09-26 11:22:23 +00:00
Alan Cox
7bfda801a8 Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq.  Instead, they are kept in a separate per-object
splay tree of cached pages.  However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock.  Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE).  The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held.  Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case.  Cached pages
are reclaimed far, far more often than they are reactivated.  Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated.  Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page.  Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
   Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
Jeff Roberson
e270652ba3 - Bound the interactivity score so that it cannot become negative.
Approved by:	re
2007-09-24 00:28:54 +00:00
Jeff Roberson
a5423ea313 - Improve grammar. s/it's/its/.
- Improve load long-term load balancer by always IPIing exactly once.
   Previously the delay after rebalancing could cause problems with
   uneven workloads.
 - Allow nice to have a linear effect on the interactivity score.  This
   allows negatively niced programs to stay interactive longer.  It may be
   useful with very expensive Xorg servers under high loads.  In general
   it should not be necessary to alter the nice level to improve interactive
   response.  We may also want to consider never allowing positively niced
   processes to become interactive at all.
 - Initialize ccpu to 0 rather than 0.0.  The decimal point was leftover
   from when the code was copied from 4bsd.  ccpu is 0 in ULE because ULE
   only exports weighted cpu values.

Reported by:	Steve Kargl (Load balancing problem)
Approved by:	re
2007-09-22 02:20:14 +00:00
Pawel Jakub Dawidek
b4d7e2983c Fix some locking cases where we ask for exclusively locked vnode, but we get
shared locked vnode in instead when vfs.lookup_shared is set to 1.

Discussed with:	kib, kris
Tested by:	kris
Approved by:	re (kensmith)
2007-09-21 10:16:56 +00:00
Jeff Roberson
54b0e65f84 - Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
   in/out.  ULE does not have a periodic timer that scans all threads in
   the system and as such maintaining a per-second counter is difficult.
 - Change computations requiring the unit in seconds to subtract ticks
   and divide by hz.  This does make the wraparound condition hz times
   more frequent but this is still in the range of several months to
   years and the adverse effects are minimal.

Approved by:	re
2007-09-21 04:10:23 +00:00
Jeff Roberson
f462501739 - Call sched_sleep() before we suspend threads. sched_wakeup() is already
called via setrunnable().  This allows time slept while suspended to
   be accounted for swap.

Approved by:	re
2007-09-21 04:04:22 +00:00