167 Commits

Author SHA1 Message Date
rwatson
300d4098cf Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
2007-03-04 22:36:48 +00:00
delphij
8cc8bccf58 Close race conditions between fork() and [sg]etpriority()'s
PRIO_USER case, possibly also other places that deferences
p_ucred.

In the past, we insert a new process into the allproc list right
after PID allocation, and release the allproc_lock sx.  Because
most content in new proc's structure is not yet initialized,
this could lead to undefined result if we do not handle PRS_NEW
with care.

The problem with PRS_NEW state is that it does not provide fine
grained information about how much initialization is done for a
new process.  By defination, after PRIO_USER setpriority(), all
processes that belongs to given user should have their nice value
set to the specified value.  Therefore, if p_{start,end}copy
section was done for a PRS_NEW process, we can not safely ignore
it because p_nice is in this area.  On the other hand, we should
be careful on PRS_NEW processes because we do not allow non-root
users to lower their nice values, and without a successful copy
of the copy section, we can get stale values that is inherted
from the uninitialized area of the process structure.

This commit tries to close the race condition by grabbing proc
mutex *before* we release allproc_lock xlock, and do copy as
well as zero immediately after the allproc_lock xunlock.  This
guarantees that the new process would have its p_copy and p_zero
sections, as well as user credential informaion initialized.  In
getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW
process, we just skip the process in question, because it does
not affect the final result of the call, as the p_nice value
would be copied from its parent, and we will see it during
allproc traverse.

Other potential solutions are still under evaluation.

Discussed with:	davidxu, jhb, rwatson
PR:		kern/108071
MFC after:	2 weeks
2007-02-26 03:38:09 +00:00
rwatson
af558a567b Use priv_check(9) instead of suser(9) for checking the privilege to
set real-time priority on a thread.  It looks like this suser(9)
call was introduced after my first pass through replacing superuser
checks with named privilege checks.
2007-02-19 13:22:36 +00:00
delphij
2e20bff54b Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form. 2007-01-17 14:58:53 +00:00
julian
396ed947f6 Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
davidxu
a25887447d Use scheduler API sched_user_prio() to adjust thread's userland priority,
use td_base_user_prio to get real userland priority since POSIX priority
mutex may adjust td_user_pri which is an effective priority.
2006-11-20 05:50:59 +00:00
rwatson
10d0d9cf47 Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
jb
f82c799735 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
davidxu
bac7c2b79d Replace system call thr_getscheduler, thr_setscheduler, thr_setschedparam
with rtprio_thread, while rtprio system call is for process only, the new
system call rtprio_thread is responsible for LWP.
2006-09-21 04:18:46 +00:00
yar
209e4786e7 Commit the results of the typo hunt by Darren Pilgrim.
This change affects documentation and comments only,
no real code involved.

PR:		misc/101245
Submitted by:	Darren Pilgrim <darren pilgrim bitfreak org>
Tested by:	md5(1)
MFC after:	1 week
2006-08-04 07:56:35 +00:00
phk
9052347556 Go over calcru and friends once more.
Reintroduce the monotonicity for the normal case and make the two
special cases behave in what is belived to be the most sensible fasion.
2006-03-11 10:48:19 +00:00
phk
15ba3472e1 Add slop to "backwards" cpu accounting messages, 3 usec or 1% whichever
triggers.

This should eliminate all the trivial messages which result from minor
increases in cpu_tick frequency.

Machines which don't du cpu clock fiddling shouldn't issue "backwards"
messages now.

Laptops and other machines where the initial estimate of cputicks may be
waaaay off will still issue warnings.
2006-03-09 09:33:17 +00:00
jhb
ddd6459377 Various style and comment fixes.
Submitted by:	bde
2006-02-22 16:58:48 +00:00
jhb
e9c312bcac Split calcru() back into a calcru1() function shared with calccru() and
a calcru() wrapper that passes a local rusage_ext on the stack that is
a snapshot to do the calculations on.  Now we can pass p->p_crux to
calcru1() in calccru() again which fixes the issues with runtime going
backwards messages when dead processes are harvested by init.

Reviewed by:	phk
Tested by:	Stefan Ehmann shoesoft at gmx dot net
2006-02-21 21:47:46 +00:00
phk
79081baaf0 CPU time accounting speedup (step 2)
Keep accounting time (in per-cpu) cputicks and the statistics counts
in the thread and summarize into struct proc when at context switch.

Don't reach across CPUs in calcru().

Add code to calibrate the top speed of cpu_tickrate() for variable
cpu_tick hardware (like TSC on power managed machines).

Don't enforce monotonicity (at least for now) in calcru.  While the
calibrated cpu_tickrate ramps up it may not be true.

Use 27MHz counter on i386/Geode.

Use TSC on amd64 & i386 if present.

Use tick counter on sparc64
2006-02-11 09:33:07 +00:00
phk
bb2f62f536 Modify the way we account for CPU time spent (step 1)
Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.

For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines.  (The main reason for this first milestone commit is
to verify that hypothesis.)

On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark.  On more modern hardware no change
in performance is seen.
2006-02-07 21:22:02 +00:00
ups
181ac11e19 Back out changes made in rev. 1.151.
They were bogus.

Cluebat applied by: jhb@
2006-01-25 02:05:47 +00:00
ups
18ba9270dc Hopefully fix the "calcru: runtime went backwards from ..." problem by
keeping the resource values locked (where needed) while we use them
for calculations.

MFC after:	3 days
2006-01-23 19:15:13 +00:00
ps
e0951fe504 Calling setrlimit from 32bit apps could potentially increase certain
limits beyond what should be capiable in a 32bit process, so we
must fixup the limits.

Reviewed by:	jhb
2005-11-02 21:18:07 +00:00
jhb
7a9ff8857d Use the reference count API to manage the reference counts for process
limit structures rather than using pool mutexes to protect the reference
counts.

Tested on:	i386, alpha, sparc64
2005-09-27 18:07:05 +00:00
alc
47a9b57f58 Giant is no longer required in kern_setrlimit(); remove its acquisition and
release.

Reviewed by: jhb
2005-06-01 17:52:51 +00:00
jhb
7b611b0cb2 Stop explicitly touching td_base_pri outside of the scheduler and simply
set a thread's priority via sched_prio() when that is the desired action.
The schedulers will start managing td_base_pri internally shortly.
2004-12-30 20:29:58 +00:00
jhb
ce2d3f89af Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
  pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
  don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
  times it needs rather than calling getrusage() twice with associated
  stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
  for user, system, and interrupt time as well as a bintime of the total
  runtime.  A new p_rux field in struct proc replaces the same inline fields
  from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime).  A new p_crux
  field in struct proc contains the "raw" child time usage statistics.
  ruadd() has been changed to handle adding the associated rusage_ext
  structures as well as the values in rusage.  Effectively, the values in
  rusage_ext replace the ru_utime and ru_stime values in struct rusage.  These
  two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
  calculates appropriate timevals for user and system time as well as updating
  the rux_[isu]u fields of a passed in rusage_ext structure.  calcru() uses a
  copy of the process' p_rux structure to compute the timevals after updating
  the runtime appropriately if any of the threads in that process are
  currently executing.  It also now only locks sched_lock internally while
  doing the rux_runtime fixup.  calcru() now only requires the caller to
  hold the proc lock and calcru1() only requires the proc lock internally.
  calcru() also no longer allows callers to ask for an interrupt timeval
  since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
  calling calcru1() on p_crux.  Note that this means that any code that wants
  child times must now call this function rather than reading from p_cru
  directly.  This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
  in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
  proctree lock is used to protect the process group rather than the process
  group lock.  By holding this lock until the end of the function we now
  ensure that the process/thread that we pick to dump info about will no
  longer vanish while we are trying to output its info to the console.

Submitted by:	bde (mostly)
MFC after:	1 month
2004-10-05 18:51:11 +00:00
jhb
7b666b4137 A modest collection of various and sundry style, spelling, and whitespace
fixes.

Submitted by:	bde (mostly)
2004-09-24 00:38:15 +00:00
jhb
3956303607 Various small style fixes. 2004-09-22 15:24:33 +00:00
rwatson
d6384e3daf Push UIDINFO_UNLOCK() slightly earlier in chgsbize(), as it's not
needed if we print the local variable version of the limit rather
than the shared version.
2004-08-06 22:04:33 +00:00
rwatson
76535adbaa Remove spl's from kern_resource.c. 2004-08-04 18:19:09 +00:00
cperciva
d9fecc83c8 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
bde
459a8e3950 Turned off the "calcru: negative time" warning for certain SMP cases
where it is known to detect a problem but the problem is not very easy
to fix.  The warning became very common recently after a call to calcru()
was added to fill_kinfo_thread().

Another (much older) cause of "negative times" (actually non-monotonic
times) was fixed in rev.1.237 of kern_exit.c.

Print separate messages for non-monotonic and negative times.
2004-06-21 17:46:27 +00:00
julian
6c9d81ae0d Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.
2004-06-16 00:26:31 +00:00
phk
86602fc06c Deorbit COMPAT_SUNOS.
We inherited this from the sparc32 port of BSD4.4-Lite1.  We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.
2004-06-11 11:16:26 +00:00
julian
98d03d5d06 Fix rtprio() to do sensible things when called from threaded processes.
It's not quite correct from a posix Point Of view, but it is a lot better
than what was there before. This will be revisited later
when we decide what form our priority extensions will take. Posix doesn't
specify  how a system scope thread can change its priority so you need to
add non-standard extensions to be able to do it..
For now make this slightly non standard to allow it to be done.

Submitted by:	Dan Eischen originally, changed by myself.
2004-05-08 08:56:05 +00:00
mux
74cb325f5e Remove a comment that complains about the lack of %qd, to justify
truncating a rlim_t to a long.  We have %qd since some time now.
However, the correct format to use here is %jd and a cast to
intmax_t, so do this.
2004-04-10 11:08:16 +00:00
imp
74cf37bd00 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
jhb
920753994a Argh! Fix a bogon. lim_cur() was returning the hard (max) limit rather
than the soft (cur) limit.

Submitted by:	bde
2004-02-11 18:04:13 +00:00
jhb
ad9ddd0d33 - Convert the plimit lock to a pool mutex lock.
- Hide struct plimit from userland.

Submitted by:	bde (2)
2004-02-06 19:35:14 +00:00
jhb
3618d8780b - Correct the translation of old rlimit values to properly handle the old
RLIM_INFINITY case for ogetrlimit().
- Use %jd and intmax_t to output negative time in usec in calcru().
- Rework getrusage() to make a copy of the rusage struct into a local
  variable while holding Giant and then do the copyout from the local
  variable to avoid having to have the original process rusage struct
  locked while doing the copyout (which would not be safe).  This also
  includes a few style fixes from Bruce to getrusage().

Submitted by:	bde (1, parts of 3)
Suggested by:	bde (2)
2004-02-06 19:30:12 +00:00
jhb
ade2bf85bb A few more style fixes from Bruce including a few I missed last time.
Submitted by:	bde
2004-02-06 19:25:34 +00:00
jhb
4f63b6f6e0 - A lot of style and whitespace fixes.
- Update a few comments regarding locking notes.

Submitted by:	bde (1, mostly)
2004-02-05 20:53:25 +00:00
jhb
279b2b8278 Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count.  The plimit
  structure is treated similarly to struct ucred in that is is always copy
  on write, so having a reference to a structure is sufficient to read from
  it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
  limits from a process to keep the limit structure from changing out from
  under you while reading from it.
- Various global limits that are ints are not protected by a lock since
  int writes are atomic on all the archs we support and thus a lock
  wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
  behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
  either an rlimit, or the current or max individual limit of the specified
  resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
  other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
  (it didn't used the stackgap when it should have) but uses lim_rlimit()
  and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
  but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits.  It
  also no longer uses the stackgap for accessing sysctl's for the
  ibcs2_sysconf() syscall but uses kernel_sysctl() instead.  As a result,
  ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by:	mtm (mostly, I only did a few cleanups and catchups)
Tested on:	i386
Compiled on:	alpha, amd64
2004-02-04 21:52:57 +00:00
jeff
d74e8d0250 - Don't set td_priority directly here, use sched_prio(). 2003-10-27 07:15:47 +00:00
truckman
f985965120 Extend the mutex pool implementation to permit the creation and use of
multiple mutex pools with different options and sizes.  Mutex pools can
be created with either the default sleep mutexes or with spin mutexes.
A dynamically created mutex pool can now be destroyed if it is no longer
needed.

Create two pools by default, one that matches the existing pool that
uses the MTX_NOWITNESS option that should be used for building higher
level locks, and a new pool with witness checking enabled.

Modify the users of the existing mutex pool to use the appropriate pool
in the new implementation.

Reviewed by:	jhb
2003-07-13 01:22:21 +00:00
obrien
3b8fff9e4c Use __FBSDID(). 2003-06-11 00:56:59 +00:00
jhb
2958cf621b Remove Giant from [gs]etpriority(). 2003-04-23 18:48:55 +00:00
jhb
ced60d737a Lock both the proc lock and sched_lock when calling sched_nice since
kg_nice is now protected by both.  Being protected by both means that
other places in the kernel that want to read kg_nice only need one of the
two locks.
2003-04-22 20:45:38 +00:00
jhb
de4c9711d0 Add a couple of sched_lock asserts. 2003-04-18 20:17:47 +00:00
jeff
a033a84006 - Adjust sched hooks for fork and exec to take processes as arguments instead
of ksegs since they primarily operation on processes.
 - KSEs take ticks so pass the kse through sched_clock().
 - Add a sched_class() routine that adjusts a ksegrp pri class.
 - Define a sched_fork_{kse,thread,ksegrp} and sched_exit_{kse,thread,ksegrp}
   that will be used to tell the scheduler about new instances of these
   structures within the same process.  These will be used by THR and KSE.
 - Change sched_4bsd to reflect this API update.
2003-04-11 03:39:07 +00:00
tjr
a9d877b4c2 Back out previous. The locking here needs a rethink. 2003-03-13 00:54:53 +00:00
tjr
679efe569a Acquire sched_lock around use of FOREACH_KSEGRP_IN_PROC, accesses
to kg_nice and calls to sched_nice() in getpriority() and setpriority()
(really donice()).
2003-03-12 11:24:41 +00:00
tjr
569e9d1a86 Remove the PL_SHAREMOD flag from struct plimit, which could have been
used to share resource limits between rfork threads, but never was.
Removing it makes resource limit locking much simpler -- only the current
process can change the contents of the structure that p_limit points to.
2003-02-20 04:18:42 +00:00