critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.
Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.
This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).
Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
session in tprintf(). SESSRELE() needs to properly dispose of the
sessions mutex.
Add sessrele() which does the proper cleanup and have SESSRELE() call it.
Use SESSRELE also in pgdelete().
Found by: Coverity (ID:526)
the raw values including for child process statistics and only compute the
system and user timevals on demand.
- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.
Submitted by: bde (mostly)
MFC after: 1 month
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called. Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code. I have retained
vm_proc_dispose(), since I consider its disuse a bug.
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
threads consuming the result of pfind() will not need to check for a NULL
credential pointer or other signs of an incompletely created process.
However, this also means that pfind() cannot be used to test for the
existence or find such a process. Annotate pfind() to indicate that this
is the case. A review of curent consumers seems to indicate that this is
not a problem for any of them. This closes a number of race conditions
that could result in NULL pointer dereferences and related failure modes.
Other related races continue to exist, especially during iteration of the
allproc list without due caution.
Discussed with: tjr, green
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.
Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
ki_childutime, and ki_emul. Also uses the timevaladd() routine to
correct the calculation of ki_childtime. That will correct the value
returned when ki_childtime.tv_usec > 1,000,000.
This also implements a new KERN_PROC_GID option for kvm_getprocs().
(there will be a similar update to lib/libkvm/kvm_proc.c)
Submitted by: Cyrille Lefevre
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
Add two new functions: ttyref() and ttyrel(). ttymalloc() creates a struct
tty with a reference count of one. when ttyrel sees the count go to zero,
struct tty is freed.
Hold references for open ttys and for ttys which are controlling terminal
for sessions.
Until drivers start using ttyrel(), this commit will make no difference.
KERN_PROC_SESSION option which had been previously defined but
never implemented.
PR: bin/65803 (a very tiny piece of the PR)`
Submitted by: Cyrille Lefevre
not quite well by me - if kern.ps_argsopen was set to 0, users weren't
permitted to see arguments of even own processes.
But kern.ps_argsopen is going away, so just remove this check and leave
security checks for p_cansee() function.
Without this fix it is possible to cheat policies like:
- sysctl security.bsd.see_other_[gu]ids=0,
- mac_seeotheruids(4),
- jail(2)
and get full processes list with their arguments.
This problem exists from revision 1.62 of kern_proc.c when it was
introduced.
Reviewed by: nectar, rwatson.
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
an appropriate error number after a failure condition.
In particular, three of the changed statements return ESRCH for a
failed pfind(), and in also three places a non-zero return
from p_cansee() will be passed back,
Also noticed by: rwatson
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.
Reviewed by: arch@
Approved by: re (rwatson)
fini routines instead of in fork() and wait(). This has the nice side
benefit that the proc lock of any process on the allproc list is always
valid and sched_lock doesn't have to be used to test against PRS_NEW
anymore.
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.
Instead of applying the adjustment to processes with a start time of 1,
apply it to all processes with a start time of less than 3600.
None of this would be necessary if the start times were recorded in ticks
instead of seconds and microseconds.
don't include the kernel stacks of swapped-out threads in the page count,
but do include the alternate kernel stack. jhb provided some helpful
comments on this.
PR: 49102
whose p_stats->p_start has the magic value 1, replace it with boottime.
Some users were apparently confused by the fact that ps(1) reported a
start time in early 1970 for system processes.
a process group.
- Call pgadjustjobc() twice in fixjobc() to avoid code duplication and
improve readability.
- Use the proc lock to protect P_SHOULDSTOP() instead of sched_lock.
- Check to see if a process is PRS_NEW with sched_lock before trying to
lock its proc lock since the lock may not be constructed yet.
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.
Requested by: rwatson (1)
- If SYSCTL_OUT() fails in sysctl_kern_proc_args(), return the error
instead of ignoring it if we have new arguments for the process.
- If the new arguments for a process are too long, return ENOMEM instead of
returning success but not doing the actual copy.
Submitted by: bde
hold hold it across the check to avoid extra lock operations in the
common case.
- Copy in the new args to a temporary pargs structure before we drop the
reference to the old one. Thus, if the copyin() fails, the process
arguments are unchanged rather than being deleted. Also, p_args is no
longer NULL during the sysctl operation.
- Provide a routine in sched_4bsd to add this functionality.
- Use sched_pctcpu() in kern_proc, which is the one place outside of
sched_4bsd where the old pctcpu value was accessed directly.
Approved by: re
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.
Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha
Add code to free KSEs and KSEGRPs on exit.
Sort KSE prototypes in proc.h.
Add the missing kse_exit() syscall.
ksetest now does not leak KSEs and KSEGRPS.
Submitted by: (parts) davidxu
processes forked with RFTHREAD.
- Use a goto to a label for common code when exiting from fork1() in case
of an error.
- Move the RFTHREAD linkage setup code later in fork since the ppeers_lock
cannot be locked while holding a proc lock. Handle the race of a task
leader exiting and killing its peers while a peer is forking a new child.
In that case, go ahead and let the peer process proceed normally as the
parent is about to kill it. However, the task leader may have already
gone to sleep to wait for the peers to die, so the new child process may
not receive a SIGKILL from the task leader. Rather than try to destruct
the new child process, just go ahead and send it a SIGKILL directly and
add it to the p_peers list. This ensures that the task leader will wait
until both the peer process doing the fork() and the new child process
have received their KILL signals and exited.
Discussed with: truckman (earlier versions)
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.
Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".
DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)
Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.
name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of
TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that
will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will
be used internally by the turnstile code and will not be specific to
mutexes. Making the change now ensures that turnstiles can be dropped
in at a later date without affecting the ABI of userland applications.
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.
Reviewed by: jake, peter, jhb
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.
After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.
CODAFS is unfinished in this regard because the logic is unclear in
some places.
Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]
from stopping another thread from completing a syscall, and this allows it to
release its resources etc. Probably more related commits to follow (at least
one I know of)
Initial concept by: julian, dillon
Submitted by: davidxu
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Deliver signals to KSE-aware processes via upcall.
- Rename kse mailbox structure fields to be more BSD-like.
- Store the UTS's stack in struct proc in a stack_t.
Reviewed by: bde, deischen, julian
Approved by: -arch
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)
While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.
PS_STRINGS and USRSTACK is. This is necessary in order to decode a.out
core dumps. kern_proc.c was already referring to both of these values
but was missing the #include "opt_kstack_pages.h". Make the sysctl
variables visible so that certain kld modules can see how their parent
kernel was configured.
The process allocator now caches and hands out complete process structures
*including substructures* .
i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.
For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.
Reviewed by: davidxu@freebsd.org, peter@freebsd.org
(I skipped those in contrib/, gnu/ and crypto/)
While I was at it, fixed a lot more found by ispell that I
could identify with certainty to be errors. All of these
were in comments or text, not in actual code.
Suggested by: bde
MFC after: 3 days
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
formulated. The correct states should be:
IDLE: On the idle KSE list for that KSEG
RUNQ: Linked onto the system run queue.
THREAD: Attached to a thread and slaved to whatever state the thread is in.
This means that most places where we were adjusting kse state can go away
as it is just moving around because the thread is..
The only places we need to adjust the KSE state is in transition to and from
the idle and run queues.
Reviewed by: jhb@freebsd.org
pmap_swapin_proc/pmap_swapout_proc functions from the MD pmap code
and use a single equivalent MI version. There are other cleanups
needed still.
While here, use the UMA zone hooks to keep a cache of preinitialized
proc structures handy, just like the thread system does. This eliminates
one dependency on 'struct proc' being persistent even after being freed.
There are some comments about things that can be factored out into
ctor/dtor functions if it is worth it. For now they are mostly just
doing statistics to get a feel of how it is working.
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
pointer instead of a proc pointer and require the process pointed to
by the second argument to be locked. We now use the thread ucred reference
for the credential checks in p_can*() as a result. p_canfoo() should now
no longer need Giant.
be done internally.
Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.
Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.
Seigo Tanimura helped with this.
sx lock. Trying to get the lock order between these locks was getting
too complicated as the locking in wait1() was being fixed.
- leavepgrp() now requires an exclusive lock of proctree_lock to be held
when it is called.
- fixjobc() no longer gets a shared lock of proctree_lock now that it
requires an xlock be held by the caller.
- Locking notes in sys/proc.h are adjusted to note that everything that
used to be protected by the pgrpsess_lock is now protected by the
proctree_lock.
is called.
- Change sysctl_out_proc() to require that the process is locked when it
is called and to drop the lock before it returns. If this proves too
complex we can change sysctl_out_proc() to simply acquire the lock at
the very end and have the calling code drop the lock right after it
returns.
- Lock the process we are going to export before the p_cansee() in the
loop in sysctl_kern_proc() and hold the lock until we call
sysctl_out_proc().
- Don't call p_cansee() on the process about to be exported twice in
the aforementioned loop.