critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.
Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.
This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).
Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
session in tprintf(). SESSRELE() needs to properly dispose of the
sessions mutex.
Add sessrele() which does the proper cleanup and have SESSRELE() call it.
Use SESSRELE also in pgdelete().
Found by: Coverity (ID:526)
the raw values including for child process statistics and only compute the
system and user timevals on demand.
- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.
Submitted by: bde (mostly)
MFC after: 1 month
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called. Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code. I have retained
vm_proc_dispose(), since I consider its disuse a bug.
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
threads consuming the result of pfind() will not need to check for a NULL
credential pointer or other signs of an incompletely created process.
However, this also means that pfind() cannot be used to test for the
existence or find such a process. Annotate pfind() to indicate that this
is the case. A review of curent consumers seems to indicate that this is
not a problem for any of them. This closes a number of race conditions
that could result in NULL pointer dereferences and related failure modes.
Other related races continue to exist, especially during iteration of the
allproc list without due caution.
Discussed with: tjr, green
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.
Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
ki_childutime, and ki_emul. Also uses the timevaladd() routine to
correct the calculation of ki_childtime. That will correct the value
returned when ki_childtime.tv_usec > 1,000,000.
This also implements a new KERN_PROC_GID option for kvm_getprocs().
(there will be a similar update to lib/libkvm/kvm_proc.c)
Submitted by: Cyrille Lefevre
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
Add two new functions: ttyref() and ttyrel(). ttymalloc() creates a struct
tty with a reference count of one. when ttyrel sees the count go to zero,
struct tty is freed.
Hold references for open ttys and for ttys which are controlling terminal
for sessions.
Until drivers start using ttyrel(), this commit will make no difference.
KERN_PROC_SESSION option which had been previously defined but
never implemented.
PR: bin/65803 (a very tiny piece of the PR)`
Submitted by: Cyrille Lefevre
not quite well by me - if kern.ps_argsopen was set to 0, users weren't
permitted to see arguments of even own processes.
But kern.ps_argsopen is going away, so just remove this check and leave
security checks for p_cansee() function.
Without this fix it is possible to cheat policies like:
- sysctl security.bsd.see_other_[gu]ids=0,
- mac_seeotheruids(4),
- jail(2)
and get full processes list with their arguments.
This problem exists from revision 1.62 of kern_proc.c when it was
introduced.
Reviewed by: nectar, rwatson.
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
an appropriate error number after a failure condition.
In particular, three of the changed statements return ESRCH for a
failed pfind(), and in also three places a non-zero return
from p_cansee() will be passed back,
Also noticed by: rwatson
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.
Reviewed by: arch@
Approved by: re (rwatson)
fini routines instead of in fork() and wait(). This has the nice side
benefit that the proc lock of any process on the allproc list is always
valid and sched_lock doesn't have to be used to test against PRS_NEW
anymore.
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.
Instead of applying the adjustment to processes with a start time of 1,
apply it to all processes with a start time of less than 3600.
None of this would be necessary if the start times were recorded in ticks
instead of seconds and microseconds.
don't include the kernel stacks of swapped-out threads in the page count,
but do include the alternate kernel stack. jhb provided some helpful
comments on this.
PR: 49102