Fix a race condition between kern_wait() and thread_stopped().
Problem is in kern_wait(), parent process steps through children list,
once a child process is skipped, and later even if the child is stopped,
parent process still sleeps in msleep(), the race happens if parent
masked SIGCHLD.
Submitted by : Peter Edwards peadar.edwards at gmail dot com
MFC after : 4 days
The main reason for doing this is that the ELF dump handler expects
the thread list to be fixed while the dump header is generated, so an
upcall that occurs at the wrong time can lead to buffer overruns and
other Bad Things.
Another solution would be to grab sched_lock in the ELF dump handler,
but we might as well single-thread, since the process is about to die.
Furthermore, I think this should ensure that the register sets in the
core file are sequentially consistent.
for a signal, because kernel stack is swappable, this causes page fault
in kernel under heavy swapping case. Fix this bug by eliminating unneeded
code.
sleeping, so in do_tdsignal, we no longer need to test td_waitset.
now td_waitset is only used to give a thread higher priority when
delivering signal to multithreads process.
This also fixes a bug:
when a thread in sigwait states was suspended and later resumed
by SIGCONT, it can no longer receive signals belong to waitset.
former is callable from user space and the latter from the kernel one. Make
kernel version take additional argument which tells if the respective call
should check for additional restrictions for sending signals to suid/sugid
applications or not.
Make all emulation layers using non-checked version, since signal numbers in
emulation layers can have different meaning that in native mode and such
protection can cause misbehaviour.
As a result remove LIBTHR from the signals allowed to be delivered to a
suid/sugid application.
Requested (sorta) by: rwatson
MFC after: 2 weeks
nice value above 0, set it to 0 so that it may proceed with haste.
This is especially important on ULE, where adjusting the priority
does not guarantee that a thread will be granted a greater time slice.
so would cause kernel to produce an unkillable process in some cases,
especially, P_STOPPED_SINGLE has a singling thread, turning off the
bit would mess the state.
The removed argument could trivially be derived from the remaining one.
That in turn should be the same as curthread, but it is possible that curthread could be expensive to derive on some syste,s so leave it as an argument.
Having both proc and thread as an argumen tjust gives an opportunity for
them to get out sync.
MFC after: 3 days
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
thread current user thread is running on. Add tm_dflags into
kse_thr_mailbox, the flags is written by debugger, it tells
UTS and kernel what should be done when the process is being
debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.
TMDF_SSTEP is used to tell kernel to turn on single stepping,
or turn off if it is not set.
TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
whenever possible, to UTS, it means do not run the user thread
until debugger clears it, this behaviour is necessary because
gdb wants to resume only one thread when the thread's pc is
at a breakpoint, and thread needs to go forward, in order to
avoid other threads sneak pass the breakpoints, it needs to remove
breakpoint, only wants one thread to go. Also, add km_lwp to
kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
switch time when process is not being debugged, so when process
is attached, debugger can map kernel thread to user thread.
2. Add p_xthread to proc strcuture and td_xsig to thread structure.
p_xthread is used by a thread when it wants to report event
to debugger, every thread can set the pointer, especially, when
it is used in ptracestop, it is the last thread reporting event
will win the race. Every thread has a td_xsig to exchange signal
with debugger, thread uses TDF_XSIG flag to indicate it is reporting
signal to debugger, if the flag is not cleared, thread will keep
retrying until it is cleared by debugger, p_xthread may be
used by debugger to indicate CURRENT thread. The p_xstat is still
in proc structure to keep wait() to work, in future, we may
just use td_xsig.
3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
a thread. When process stops, debugger can set the flag for
thread, thread will check the flag in thread_suspend_check,
enters a loop, unless it is cleared by debugger, process is
detached or process is existing. The flag is also checked in
ptracestop, so debugger can temporarily suspend a thread even
if the thread wants to exchange signal.
4. Current, in ptrace, we always resume all threads, but if a thread
has already a TDF_DBSUSPEND flag set by debugger, it won't run.
Encouraged by: marcel, julian, deischen
tracing process to obtain information about the LWP that caused the
traced process to stop. Debuggers can use this information to select
the thread currently running on the LWP as the current thread.
The request has been made compatible with NetBSD for as much as
possible. This implementation differs from NetBSD in the following
ways:
1. The data argument is allowed to be smaller than the size of the
ptrace_lwpinfo structure known to the kernel, but not 0. This
is opposite to what NetBSD allows. The reason for this is that
we can extend the structure without affecting older binaries.
2. On NetBSD the tracing process is to set the pl_lwpid field to
the Id of the LWP it wants information of. We don't do that.
Our ptrace interface allows passing the LWP Id instead of the
PID. The tracing process is to set the PID to the LWP Id it
wants information of.
3. When the PID is actually the PID of the tracing process, this
request returns the information about the LWP that caused the
process to stop. This was the whole purpose of the request in
the first place.
When the traced process has exited, this request will return the
LWP Id 0, indicating that the process state is not the result of
an event specific to a LWP.
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.
a LOR against sleepq. Fix the comment, and fix ptracestop() to pick up
sched_lock after stop() rather than before.
Reported by: Scott Sipe <cscotts@mindspring.com>
Reviewed by: rwatson, jhb
- Push Giant down a bit in coredump() and call coredump() with the proc
lock already held rather than unlocking it only to turn around and
relock it.
Requested by: peter
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.
a maximum dump size of 0, return a size-related error, rather
than returning success. Otherwise, waitpid() will incorrectly
return a status indicating that a core dump was created. Note
that the specific error doesn't actually matter, since it's lost.
MFC after: 2 weeks
PR: 60367
Submitted by: Valentin Nechayev <netch@netch.kiev.ua>
holding the mutex. Because the sigacts pointer can't change while
the process is "live" (proc locking (x)), we know our pointer is still
valid.
In communication with: truckman
Reviewed by: jhb
is useless for threaded programs, multiple threads can not share same
stack.
The alternative signal stack is private for thread, no lock is needed,
the orignal P_ALTSTACK is now moved into td_pflags and renamed to
TDP_ALTSTACK.
For single thread or Linux clone() based threaded program, there is no
semantic changed, because those programs only have one kernel thread
in every process.
Reviewed by: deischen, dfr
if we do acquire an advisory lock, great! We'll release it later.
However, if we fail to acquire a lock, we perform the coredump
anyway. This problem became particularly visible with NFS after
the introduction of rpc.lockd: if the lock manager isn't running,
then locking calls will fail, aborting the core dump (resulting in
a zero-byte dump file).
Reported by: Yogeshwar Shenoy <ynshenoy@alumni.cs.ucsb.edu>
psignal()/tdsignal(). The test was historically in psignal(). It was
changed into a KASSERT, and then later moved to tdsignal() when the
latter was introduced.
Reviewed by: iedowse, jhb