out: label in psignal() did not grab sched_lock before trying to release
it. Also, the previous version had several cases where it grabbed
sched_lock before jumping to out: unneccessarily, so rework this a bit.
The runfast: and out: labels must be called with sched_lock released, and
the run: label must be called with it held. Appropriate mtx_assert()'s
have been added that should catch any bugs that may still be in this
code.
Noticed by: bde
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
process is on the alternate stack or not. For compatibility
with sigstack(2) state is being updated if such is needed.
We now determine whether the process is on the alternate
stack by looking at its stack pointer. This allows a process
to siglongjmp from a signal handler on the alternate stack
to the place of the sigsetjmp on the normal stack. When
maintaining state, this would have invalidated the state
information and causing a subsequent signal to be delivered
on the normal stack instead of the alternate stack.
PR: 22286
may block on a mutex while on the sleep queue without corrupting
it.
- Move dropping of Giant to after the acquire of sched_lock.
Tested by: John Hay <jhay@icomtek.csir.co.za>
jhb
acquire Giant as needed in functions that call mi_switch(). The releases
need to be done outside of the sched_lock to avoid potential deadlocks
from trying to acquire Giant while interrupts are disabled.
Submitted by: witness
syscall compare against a variable sv_minsigstksz in struct
sysentvec as to properly take the size of the machine- and
ABI dependent struct sigframe into account.
The SVR4 and iBCS2 modules continue to have a minsigstksz of
8192 to preserve behavior. The real values (if different) are
not known at this time. Other ABI modules use the real
values.
The native MINSIGSTKSZ is now defined as follows:
Arch MINSIGSTKSZ
---- -----------
alpha 4096
i386 2048
ia64 12288
Reviewed by: mjacob
Suggested by: bde
the 128-bit sigset_t changes by moving conditionally (rarely) executed
code to the beginning where it is always executed, and since this code
now involves 3 128-bit operations, the pessimization was relatively
large. This change speeds up lmbench's pipe latency benchmark by
3.5%.
Fixed style bugs in CURSIG().
very bloated, first with 128-bit sigset_t's, then with locking in the
SMP case, then with locking in all cases. The space bloat was probably
also time bloat, partly because the fast path through CURSIG() was
pessimized by the sigset_t changes. This change speeds up lmbench's
pipe-based latency benchmark by 4% on a Celeron. <sys/signalvar.h>
had become very polluted to support the bloat.
include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
int p_can(p1, p2, operation, privused)
which allows specification of subject process, object process,
inter-process operation, and an optional call-by-reference privused
flag, allowing the caller to determine if privilege was required
for the call to succeed. This allows jail, kern.ps_showallprocs and
regular credential-based interaction checks to occur in one block of
code. Possible operations are P_CAN_SEE, P_CAN_SCHED, P_CAN_KILL,
and P_CAN_DEBUG. p_can currently breaks out as a wrapper to a
series of static function checks in kern_prot, which should not
be invoked directly.
o Commented out capabilities entries are included for some checks.
o Update most inter-process authorization to make use of p_can() instead
of manual checks, PRISON_CHECK(), P_TRESPASS(), and
kern.ps_showallprocs.
o Modify suser{,_xxx} to use const arguments, as it no longer modifies
process flags due to the disabling of ASU.
o Modify some checks/errors in procfs so that ENOENT is returned instead
of ESRCH, further improving concealment of processes that should not
be visible to other processes. Also introduce new access checks to
improve hiding of processes for procfs_lookup(), procfs_getattr(),
procfs_readdir(). Correct a bug reported by bp concerning not
handling the CREATE case in procfs_lookup(). Remove volatile flag in
procfs that caused apparently spurious qualifier warnigns (approved by
bde).
o Add comment noting that ktrace() has not been updated, as its access
control checks are different from ptrace(), whereas they should
probably be the same. Further discussion should happen on this topic.
Reviewed by: bde, green, phk, freebsd-security, others
Approved by: bde
Obtained from: TrustedBSD Project
defined. This boils down to conditionally compile the
old signal syscalls.
We might want to extend the types in syscalls.master to
make these syscalls conditionally on something more
appropriate than COMPAT_43.
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.
Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.
Obtained from: BSD/OS
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).
A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.
Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.
This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.
This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.
Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch
p_trespass(struct proc *p1, struct proc *p2)
which returns zero or an errno depending on the legality of p1 trespassing
on p2.
Replace kern_sig.c:CANSIGNAL() with call to p_trespass() and one
extra signal related check.
Replace procfs.h:CHECKIO() macros with calls to p_trespass().
Only show command lines to process which can trespass on the target
process.
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.
This batch of changes compile to the same object files.
Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914
can provide the correct context to each signal handler.
Fix broken sigsuspend(): don't use p_oldsigmask as a flag, use SAS_OLDMASK
as we did before the linuxthreads support merge (submitted by bde).
Move ps_sigstk from to p_sigacts to the main proc structure since signal
stack should not be shared among threads.
Move SAS_OLDMASK and SAS_ALTSTACK flags from sigacts::ps_flags to proc::p_flag.
Move PS_NOCLDSTOP and PS_NOCLDWAIT flags from proc::p_flag to procsig::ps_flag.
Reviewed by: marcel, jdp, bde
-----------------------------
The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.
The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.
struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.
signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.
Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.
NOTE: kdump (and port linux_kdump) must be recompiled.
Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.
than a review, this was a nice puzzle.
This is supposed to be binary and source compatible with older
applications that access the old FreeBSD-style three arguments to a
signal handler.
Except those applications that access hidden signal handler arguments
bejond the documented third one. If you have applications that do,
please let me know so that we take the opportunity to provide the
functionality they need in a documented manner.
Also except application that use 'struct sigframe' directly. You need
to recompile gdb and doscmd. `make world` is recommended.
Example program that demonstrates how SA_SIGINFO and old-style FreeBSD
handlers (with their three args) may be used in the same process is at
http://www3.cons.org/tmp/fbsd-siginfo.c
Programs that use the old FreeBSD-style three arguments are easy to
change to SA_SIGINFO (although they don't need to, since the old style
will still work):
Old args to signal handler:
void handler_sn(int sig, int code, struct sigcontext *scp)
New args:
void handler_si(int sig, siginfo_t *si, void *third)
where:
old:code == new:second->si_code
old:scp == &(new:si->si_scp) /* Passed by value! */
The latter is also pointed to by new:third, but accessing via
si->si_scp is preferred because it is type-save.
FreeBSD implementation notes:
- This is just the framework to make the interface POSIX compatible.
For now, no additional functionality is provided. This is supposed
to happen now, starting with floating point values.
- We don't use 'sigcontext_t.si_value' for now (POSIX meant it for
realtime-related values).
- Documentation will be updated when new functionality is added and
the exact arguments passed are determined. The comments in
sys/signal.h are meant to be useful.
Reviewed by: BDE
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/