Instead introduce the [M] prefix to existing keywords. e.g.
MSTD is the MP SAFE version of STD. This is prepatory for a
massive Giant lock pushdown. The old MPSAFE keyword made
syscalls.master too messy.
Begin comments MP-Safe procedures with the comment:
/*
* MPSAFE
*/
This comments means that the procedure may be called without
Giant held (The procedure itself may still need to obtain
Giant temporarily to do its thing).
sv_prepsyscall() is now MP SAFE and assumed to be MP SAFE
sv_transtrap() is now MP SAFE and assumed to be MP SAFE
ktrsyscall() and ktrsysret() are now MP SAFE (Giant Pushdown)
trapsignal() is now MP SAFE (Giant Pushdown)
Places which used to do the if (mtx_owned(&Giant)) mtx_unlock(&Giant)
test in syscall[2]() in */*/trap.c now do not. Instead they
explicitly unlock Giant if they previously obtained it, and then
assert that it is no longer held to catch broken system calls.
Rebuild syscall tables.
This paniced my one of my machines one time too many :-( and there is
no sign of a solution in the pipeline. The deltas are still easily
available in cvs. The problem is that if the parent has been swapped
out, the child process cannot grope around in the parent's UPAGES to
see the sigact[] array or it will fault. This probably is a showstopper
for this implementation anyway.
an unexpected user-visible side effect with the sigaction flags. Also cleanup
a minor union issue.
Submitted by: Rudolf Cejka <cejkar@dcse.fee.vutbr.cz>
MFC addendum: MFC will be combined w/ original commit
MFC after: 3 days
rather than grabbing it and releasing it themselves. This allows callers
of these functions to get the lock to close race conditions.
- Grab Giant around ktrace in postsig.
- Count the switches performed on SIGSTOP's as involuntary context switches
in the resource usage stats.
Reported by: tegge (signal race), bde (missing csw stats)
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
- Require the proc lock be held for killproc() to allow for the vmdaemon to
kill a process when memory is exhausted while holding the lock of the
process to kill.
process on fork(2).
It is the supposed behavior stated in the manpage of sigaction(2), and
Solaris, NetBSD and FreeBSD 3-STABLE correctly do so.
The previous fix against libc_r/uthread/uthread_fork.c fixed the
problem only for the programs linked with libc_r, so back it out and
fix fork(2) itself to help those not linked with libc_r as well.
PR: kern/26705
Submitted by: KUROSAWA Takahiro <fwkg7679@mb.infoweb.ne.jp>
Tested by: knu, GOTOU Yuuzou <gotoyuzo@notwork.org>,
and some other people
Not objected by: hackers
MFC in: 3 days
been made machine independent and various other adjustments have been made
to support Alpha SMP.
- It splits the per-process portions of hardclock() and statclock() off
into hardclock_process() and statclock_process() respectively. hardclock()
and statclock() call the *_process() functions for the current process so
that UP systems will run as before. For SMP systems, it is simply necessary
to ensure that all other processors execute the *_process() functions when the
main clock functions are triggered on one CPU by an interrupt. For the alpha
4100, clock interrupts are delievered in a staggered broadcast fashion, so
we simply call hardclock/statclock on the boot CPU and call the *_process()
functions on the secondaries. For x86, we call statclock and hardclock as
usual and then call forward_hardclock/statclock in the MD code to send an IPI
to cause the AP's to execute forwared_hardclock/statclock which then call the
*_process() functions.
- forward_signal() and forward_roundrobin() have been reworked to be MI and to
involve less hackery. Now the cpu doing the forward sets any flags, etc. and
sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically
return so that they can execute ast() and don't bother with setting the
astpending or needresched flags themselves. This also removes the loop in
forward_signal() as sched_lock closes the race condition that the loop worked
around.
- need_resched(), resched_wanted() and clear_resched() have been changed to take
a process to act on rather than assuming curproc so that they can be used to
implement forward_roundrobin() as described above.
- Various other SMP variables have been moved to a MI subr_smp.c and a new
header sys/smp.h declares MI SMP variables and API's. The IPI API's from
machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h.
- The globaldata_register() and globaldata_find() functions as well as the
SLIST of globaldata structures has become MI and moved into subr_smp.c.
Also, the globaldata list is only available if SMP support is compiled in.
Reviewed by: jake, peter
Looked over by: eivind
from signal authorization checking.
o p_cansignal() takes three arguments: subject process, object process,
and signal number, unlike p_cankill(), which only took into account
the processes and not the signal number, improving the abstraction
such that CANSIGNAL() from kern_sig.c can now also be eliminated;
previously CANSIGNAL() special-cased the handling of SIGCONT based
on process session. privused is now deprecated.
o The new p_cansignal() further limits the set of signals that may
be delivered to processes with P_SUGID set, and restructures the
access control check to allow it to be extended more easily.
o These changes take into account work done by the OpenBSD Project,
as well as by Robert Watson and Thomas Moestl on the TrustedBSD
Project.
Obtained from: TrustedBSD Project
avoid silly lock contention on sched_lock since in 2 out of the 3 places
that we call stop(), we get sched_lock right after calling it and we were
locking sched_lock inside of stop() anyways.
SIGCHLD to our parent process. Otherwise, we could block while obtaining
the process lock for our parent process and switch out while we were
in SSTOP. Even worse, when we try to resume from the mutex being blocked
on our p_stat will be SRUN, not SSTOP.
- Fix a comment above stop() to indicate that it requires that the proc lock
be held, not a proctree lock.
Reported by: markm
Sleuthing by: jake
Giant. The only exception is the CANSIGNAL() macro. Unlocking the proc
lock around sendsig() in trapsignal() is also questionable. Note that
the functions sigexit(), psignal(), and issignal() must be called with
the proc lock of the process in question held. postsig() and
trapsignal() should not be called with the proc lock held, but they
also do not require Giant anymore either.
- Remove spl's that are now no longer needed as they are fully replaced.
is sent to a process, psignal() needs to schedule an AST for the
process if the process is runnable, not just if it is current, so that
pending signals get checked for on the next return of the process to
user mode. This wasn't practical until recently because the AST flag
was per-cpu so setting it for a non-current process would usually just
cause a bogus AST for the current process.
For non-current processes looping in user mode, it took accidental
(?) magic to deliver signals at all. Signals were usually delivered
late as a side effect of rescheduling (need_resched() sets astpending,
etc.). In pre-SMPng, delivery was delayed by at most 1 quantum (the
need_resched() call in roundrobin() is certain to occur within 1
quantum for looping processes). In -current, things are complicated
by normal interrupt handlers being threads. Missing handling of the
complications makes roundrobin() a bogus no-op, but preemptive
scheduling sort of works anyway due to even larger bogons elsewhere.
- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.
attributes. This is needed for AST's to be properly posted in a preemptive
kernel. They are backed by two new flags in p_sflag: PS_ASTPENDING and
PS_NEEDRESCHED. They are still accesssed by their old macros:
aston(), astoff(), etc. For completeness, an astpending() macro has been
added to check for a pending AST, and clear_resched() has been added to
clear need_resched().
- Rename syscall2() on the x86 back to syscall() to be consistent with
other architectures.
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
out: label in psignal() did not grab sched_lock before trying to release
it. Also, the previous version had several cases where it grabbed
sched_lock before jumping to out: unneccessarily, so rework this a bit.
The runfast: and out: labels must be called with sched_lock released, and
the run: label must be called with it held. Appropriate mtx_assert()'s
have been added that should catch any bugs that may still be in this
code.
Noticed by: bde
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
process is on the alternate stack or not. For compatibility
with sigstack(2) state is being updated if such is needed.
We now determine whether the process is on the alternate
stack by looking at its stack pointer. This allows a process
to siglongjmp from a signal handler on the alternate stack
to the place of the sigsetjmp on the normal stack. When
maintaining state, this would have invalidated the state
information and causing a subsequent signal to be delivered
on the normal stack instead of the alternate stack.
PR: 22286
may block on a mutex while on the sleep queue without corrupting
it.
- Move dropping of Giant to after the acquire of sched_lock.
Tested by: John Hay <jhay@icomtek.csir.co.za>
jhb
acquire Giant as needed in functions that call mi_switch(). The releases
need to be done outside of the sched_lock to avoid potential deadlocks
from trying to acquire Giant while interrupts are disabled.
Submitted by: witness
syscall compare against a variable sv_minsigstksz in struct
sysentvec as to properly take the size of the machine- and
ABI dependent struct sigframe into account.
The SVR4 and iBCS2 modules continue to have a minsigstksz of
8192 to preserve behavior. The real values (if different) are
not known at this time. Other ABI modules use the real
values.
The native MINSIGSTKSZ is now defined as follows:
Arch MINSIGSTKSZ
---- -----------
alpha 4096
i386 2048
ia64 12288
Reviewed by: mjacob
Suggested by: bde
the 128-bit sigset_t changes by moving conditionally (rarely) executed
code to the beginning where it is always executed, and since this code
now involves 3 128-bit operations, the pessimization was relatively
large. This change speeds up lmbench's pipe latency benchmark by
3.5%.
Fixed style bugs in CURSIG().
very bloated, first with 128-bit sigset_t's, then with locking in the
SMP case, then with locking in all cases. The space bloat was probably
also time bloat, partly because the fast path through CURSIG() was
pessimized by the sigset_t changes. This change speeds up lmbench's
pipe-based latency benchmark by 4% on a Celeron. <sys/signalvar.h>
had become very polluted to support the bloat.
include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
int p_can(p1, p2, operation, privused)
which allows specification of subject process, object process,
inter-process operation, and an optional call-by-reference privused
flag, allowing the caller to determine if privilege was required
for the call to succeed. This allows jail, kern.ps_showallprocs and
regular credential-based interaction checks to occur in one block of
code. Possible operations are P_CAN_SEE, P_CAN_SCHED, P_CAN_KILL,
and P_CAN_DEBUG. p_can currently breaks out as a wrapper to a
series of static function checks in kern_prot, which should not
be invoked directly.
o Commented out capabilities entries are included for some checks.
o Update most inter-process authorization to make use of p_can() instead
of manual checks, PRISON_CHECK(), P_TRESPASS(), and
kern.ps_showallprocs.
o Modify suser{,_xxx} to use const arguments, as it no longer modifies
process flags due to the disabling of ASU.
o Modify some checks/errors in procfs so that ENOENT is returned instead
of ESRCH, further improving concealment of processes that should not
be visible to other processes. Also introduce new access checks to
improve hiding of processes for procfs_lookup(), procfs_getattr(),
procfs_readdir(). Correct a bug reported by bp concerning not
handling the CREATE case in procfs_lookup(). Remove volatile flag in
procfs that caused apparently spurious qualifier warnigns (approved by
bde).
o Add comment noting that ktrace() has not been updated, as its access
control checks are different from ptrace(), whereas they should
probably be the same. Further discussion should happen on this topic.
Reviewed by: bde, green, phk, freebsd-security, others
Approved by: bde
Obtained from: TrustedBSD Project
defined. This boils down to conditionally compile the
old signal syscalls.
We might want to extend the types in syscalls.master to
make these syscalls conditionally on something more
appropriate than COMPAT_43.
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.
Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.
Obtained from: BSD/OS