freebsd-skq

Author	SHA1	Message	Date
kib	16d32c998e	When a zombie gets reparented due to the parent exit, send SIGCHLD to the reaper. The traditional reaper init(8) is aware of zombies silently reparented to it after the parents exit, it loops around waitpid(2) to collect them. For other reapers, the silent reparenting is surprising and collecting zombies requires a thread blocking in waitpid(2) just for that purpose. It seems that sending second SIGCHLD is a better workaround than forcing all reapers to obey the setup. Reported by: Michael Zuo <muh.muhten@gmail.com>, jilles PR: 213928 Reviewed by: jilles (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-12 11:11:50 +00:00
kib	aefd35d8c2	Restructure the code to handle reporting of non-exited processes from wait(2). - Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED options were passed [1]. - Extract the code to report alive process into a new helper report_alive_proc() and use it for trapped, stopped and continued childrens. Note that the process spinlock is required around the WTRAPPED and WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are set before other threads are stopped at the suspension point, and that threads increment p_suspcount while owning only the process spinlock, the process lock is dropped by them. If the spinlock is not taken for tests, the syscall thread might miss both p_suspcount increment and wakeup in wakeup in thread_suspend_switch(). Based on the submission by: mjg [1] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-04 20:44:58 +00:00
mjg	cbc182c172	wait: avoid relocking the child if proc_to_reap returns 1 proc_to_reap would always unlock. However, if it returned 1, kern_wait6 would immediately lock it again. Save the dance. Reviewed by: kib	2016-11-24 18:21:48 +00:00
emaste	00b67b15b9	Renumber license clauses in sys/kern to avoid skipping #3	2016-09-15 13:16:20 +00:00
jilles	b8f715c74f	wait: Do not copyout uninitialized status/rusage/wrusage. If wait4() or wait6() return 0 because of WNOHANG, the status, rusage and wrusage information should not be returned. PR: 212048 Reported by: Casey Lucas MFC after: 2 weeks	2016-09-09 21:58:48 +00:00
kib	6ebb9a02fc	When a debugger attaches to the process, SIGSTOP is sent to the target. Due to a way issignal() selects the next signal to deliver and report, if the simultaneous or already pending another signal exists, that signal might be reported by the next waitpid(2) call. This causes minor annoyance for debuggers, which must be prepared to take any signal as the first event, then filter SIGSTOP later. More importantly, for tools like gcore(1), which attach and then detach without processing events, SIGSTOP might leak to be delivered after PT_DETACH. This results in the process being unintentionally stopped after detach, which is fatal for automatic tools. The solution is to force SIGSTOP to be the first signal reported after the attach. Attach code is modified to set P2_PTRACE_FSTP to indicate that the attaching ritual was not yet finished, and issignal() prefers SIGSTOP in that condition. Also, the thread which handles P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first waitpid(2). All that ensures that SIGSTOP is consumed first. Additionally, if P2_PTRACE_FSTP is still set on detach, which means that waitpid(2) was not called at all, SIGSTOP is removed from the queue, ensuring that the process is resumed on detach. In issignal(), when acting on STOPing signals, remove the signal from queue before suspending. Otherwise parallel attach could result in ptracestop() acting on that STOP as if it was the STOP signal from the attach. Then SIGSTOP from attach leaks again. As a minor refactoring, some bits of the common attach code is moved to new helper proc_set_traced(). Reported by: markj Reviewed by: jhb, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7256	2016-07-28 08:41:13 +00:00
kib	368a94767c	Fix another bug after r302350. Reported and tested by: pho PR: 210884 Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-07-18 04:30:34 +00:00
jhb	91d07047c4	Add a mask of optional ptrace() events. ptrace() now stores a mask of optional events in p_ptevents. Currently this mask is a single integer, but it can be expanded into an array of integers in the future. Two new ptrace requests can be used to manipulate the event mask: PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK sets the current event mask. The current set of events include: - PTRACE_EXEC: trace calls to execve(). - PTRACE_SCE: trace system call entries. - PTRACE_SCX: trace syscam call exits. - PTRACE_FORK: trace forks and auto-attach to new child processes. - PTRACE_LWP: trace LWP events. The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS. The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for compatibility but now simply toggle corresponding flags in the event mask. While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both modify the event mask and continue the traced process. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7044	2016-07-15 15:32:09 +00:00
kib	15841a7afe	When filt_proc() removes event from the knlist due to the process exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen: - knote kn_knlist pointer is reset - INFLUX knote is removed from the process knlist. And, there are two consequences: - KN_LIST_UNLOCK() on such knote is nop - there is nothing which would block exit1() from processing past the knlist_destroy() (and knlist_destroy() resets knlist lock pointers). Both consequences result either in leaked process lock, or dereferencing NULL function pointers for locking. Handle this by stopping embedding the process knlist into struct proc. Instead, the knlist is allocated together with struct proc, but marked as autodestroy on the zombie reap, by knlist_detach() function. The knlist is freed when last kevent is removed from the list, in particular, at the zombie reap time if the list is empty. As result, the knlist_remove_inevent() is no longer needed and removed. Other changes: In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events from kn_sfflags for knote registered by kernel to only get NOTE_CHILD notifications. The flags leak resulted in excessive NOTE_EXEC/NOTE_FORK reports. Fix immediate note activation in filt_procattach(). Condition should be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT report for the exiting process. In knote_fork(), do not perform racy check for KN_INFLUX before kq lock is taken. Besides being racy, it did not accounted for notes just added by scan (KN_SCAN). Some minor and incomplete style fixes. Analyzed and tested by: Eric Badger <eric@badgerio.us> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6859	2016-06-27 21:52:17 +00:00
kib	8da898f26c	Add implementation of robust mutexes, hopefully close enough to the intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013. A robust mutex is guaranteed to be cleared by the system upon either thread or process owner termination while the mutex is held. The next mutex locker is then notified about inconsistent mutex state and can execute (or abandon) corrective actions. The patch mostly consists of small changes here and there, adding neccessary checks for the inconsistent and abandoned conditions into existing paths. Additionally, the thread exit handler was extended to iterate over the userspace-maintained list of owned robust mutexes, unlocking and marking as terminated each of them. The list of owned robust mutexes cannot be maintained atomically synchronous with the mutex lock state (it is possible in kernel, but is too expensive). Instead, for the duration of lock or unlock operation, the current mutex is remembered in a special slot that is also checked by the kernel at thread termination. Kernel must be aware about the per-thread location of the heads of robust mutex lists and the current active mutex slot. When a thread touches a robust mutex for the first time, a new umtx op syscall is issued which informs about location of lists heads. The umtx sleep queues for PP and PI mutexes are split between non-robust and robust. Somewhat unrelated changes in the patch: 1. Style. 2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared pi mutexes. 3. Removal of the userspace struct pthread_mutex m_owner field. 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls the lifetime of the shared mutex associated with a vnode' page. Reviewed by: jilles (previous version, supposedly the objection was fixed) Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects) Tested by: pho Sponsored by: The FreeBSD Foundation	2016-05-17 09:56:22 +00:00
mjg	eabf748a18	session: avoid proctree lock on proc exit when possible We can get away with the common case with only proc lock held. Reviewed by: kib	2016-01-20 23:33:58 +00:00
markj	fa1b8e9a4f	Fix style issues around existing SDT probes. - Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect at the moment, but will be needed for some future changes. - Don't hardcode the module component of the probe identifier. This is set automatically by the SDT framework. MFC after: 1 week	2015-12-16 23:39:27 +00:00
kib	753bf3d282	Enforce the maxproc limitation before allocating struct proc, initial struct thread and kernel stack for the thread. Otherwise, a load similar to a fork bomb would exhaust KVA and possibly kmem, mostly due to the struct proc being type-stable. The nprocs counter is changed from being protected by allproc_lock sx to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs was unsafe before, and is still unsafe, but it seems that the only possible undesired consequence is the harmless warning printed when allproc linked list length does not match nprocs. Diagnosed by: Svatopluk Kraus <onwahe@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-08 11:07:09 +00:00
avg	425c0bb088	save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters where n is typically smaller than 5. Perhaps SDT_PROBE should be made a private implementation detail. MFC after: 20 days	2015-09-28 12:14:16 +00:00
oshogbo	7dae5ce371	When the wait*(2) syscalls wait for any process (P_ALL), they should ignore processes created with the pdfork(2) syscall. PR: 201054 Approved by: pjd (mentor) Discussed with: emaste, rwatson	2015-08-12 20:08:54 +00:00
kib	48ccbdea81	The si_status field of the siginfo_t, provided by the waitid(2) and SIGCHLD signal, should keep full 32 bits of the status passed to the _exit(2). Split the combined p_xstat of the struct proc into the separate exit status p_xexit for normal process exit, and signalled termination information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs old p_xstat from p_xexit and p_xsig. p_xexit contains complete status and copied out into si_status. Requested by: Joerg Schilling Reviewed by: jilles (previous version), pho Tested by: pho Sponsored by: The FreeBSD Foundation	2015-07-18 09:02:50 +00:00
adrian	41db4b88e0	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell	2015-07-11 15:21:37 +00:00
ed	ed913b3a6c	Don't clobber td->td_retval[0] in proc_reap(). While writing tests for CloudABI, I noticed that close() on process descriptors returns the process ID of the child process. This is interesting, as close() is only allowed to return 0 or -1. It turns out that we clobber td->td_retval[0] in proc_reap(), so that wait*() properly returns the process ID. Change proc_reap() to leave td->td_retval[0] alone. Set the return value in kern_wait6() instead, by keeping track of the PID before we (potentially) reap the process. Differential Revision: https://reviews.freebsd.org/D3032 Reviewed by: kib	2015-07-09 12:04:45 +00:00
kib	e2f56205b5	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 13:24:17 +00:00
kib	ff588ae9b0	Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread. Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast. There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup. Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:20:42 +00:00
jhb	588ce3236a	Do not allow a process to reap an orphan (a child currently being traced by another process such as a debugger). The parent process does need to check for matching orphan pids to avoid returning ECHILD if an orphan has exited, but it should not return the exited status for the child until after the debugger has detached from the orphan process either explicitly or implicitly via wait(). Add two tests for for this case: one where the debugger is the direct child (thus the parent has a non-empty children list) and one where the debugger is not a direct child (so the only "child" of the parent is the orphan). Differential Revision: https://reviews.freebsd.org/D2644 Reviewed by: kib MFC after: 2 weeks	2015-05-26 10:29:37 +00:00
jhb	01eefee0f3	Add KTR tracing for some MI ptrace events. Differential Revision: https://reviews.freebsd.org/D2643 Reviewed by: kib	2015-05-25 22:13:22 +00:00
jhb	5ad66fe3b2	Only reparent a traced process to its old parent if the tracing process is not the old parent. Otherwise, proc_reap() will leave the zombie in place resulting in the process' status being returned twice to its parent. Add test cases for PT_TRACE_ME and PT_ATTACH which are fixed by this change. Differential Revision: https://reviews.freebsd.org/D2594 Reviewed by: kib MFC after: 2 weeks	2015-05-22 11:04:54 +00:00
trasz	802017a04b	Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation	2015-04-29 10:23:02 +00:00
mjg	8c033d13b5	proc: get rid of proc lock + unlock pair in proc_reap A comment in the code stated we PROC_LOCK and as a side effect guarantee all writers released process lock. But at that point such lock was already taken while we were removing the process from all lists, so it should be already unreachable.	2015-03-16 01:09:49 +00:00
mjg	054f9cab59	cred: add proc_set_cred helper The goal here is to provide one place altering process credentials. This eases debugging and opens up posibilities to do additional work when such an action is performed.	2015-03-16 00:10:03 +00:00
kib	8e556fe98a	The umtx_lock mutex is used by top-half of the kernel, but is currently a spin lock. Apparently, the only reason for this is that umtx_thread_exit() is called under the process spinlock, which put the requirement on the umtx_lock. Note that the witness static order list is wrong for the umtx_lock, umtx_lock is explicitely before any thread lock, so it is also before sleepq locks. Change umtx_lock to be the sleepable mutex. For the reason above, the calls to umtx_thread_exit() are moved from thread_exit() earlier in each caller, when the process spin lock is not yet taken. Discussed with: jhb Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-02-28 04:19:02 +00:00
kib	c014fd46ec	Add a facility for non-init process to declare itself the reaper of the orphaned descendants. Base of the API is modelled after the same feature from the DragonFlyBSD. Requested by: bapt Reviewed by: jilles (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2014-12-15 12:01:42 +00:00
kib	d41bf48327	Add facility to stop all userspace processes. The supposed use of the feature is to quisce the system before suspend. Stop is implemented by reusing the thread_single(9) with the special mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing single-threading modes by allowing (requiring) caller to operate on other process. Interruptible sleeps for !TDF_SBDRY threads are suspended like SIGSTOP does it, instead of aborting the sleep, like SINGLE_NO_EXIT, to avoid spurious EINTRs on resume. Provide debugging sysctl debug.stop_all_proc, which causes total stop and suspends syncer, while waiting for variable reset for resume. It is used for debugging; should be removed after the real use of the interface is added. In collaboration with: pho Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-13 16:18:29 +00:00
kib	efcd39f691	When process is exiting, check for suspension regardless of multithreaded status of the process. The stopped state must be cleared before P_WEXIT is set. A stop signal delivered just before first PROC_LOCK() block in exit1(9) would put the process into pending stop with P_WEXIT set or assertion triggered. Also recheck for the suspension after failed thread_single(9) call, since process lock could be dropped. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-08 16:02:02 +00:00
kib	11cee2ecf7	The process spin lock currently has the following distinct uses: - Threads lifetime cycle, in particular, counting of the threads in the process, and interlocking with process mutex and thread lock. The main reason of this is that turnstile locks are after thread locks, so you e.g. cannot unlock blockable mutex (think process mutex) while owning thread lock. - Virtual and profiling itimers, since the timers activation is done from the clock interrupt context. Replace the p_slock by p_itimmtx and PROC_ITIMLOCK(). - Profiling code (profil(2)), for similar reason. Replace the p_slock by p_profmtx and PROC_PROFLOCK(). - Resource usage accounting. Need for the spinlock there is subtle, my understanding is that spinlock blocks context switching for the current thread, which prevents td_runtime and similar fields from changing (updates are done at the mi_switch()). Replace the p_slock by p_statmtx and PROC_STATLOCK(). The split is done mostly for code clarity, and should not affect scalability. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-26 14:10:00 +00:00
mjg	e42745632d	Avoid unnecessary ppeers_lock acquisition in exit1. MFC after: 1 week	2014-10-05 07:21:41 +00:00
mjg	25686bf39b	Fix up proc_realparent to always return correct process. Prior to the change it would always return initproc for non-traced processes. This fixes ps apparently always returning 1 as ppid. Pointy hat: mjg Reported by: many MFC after: 1 week	2014-09-03 06:25:34 +00:00
mjg	eda187d10a	Properly reparent traced processes when the tracer dies. Previously they were uncoditionally reparented to init. In effect it was possible that tracee was never returned to original parent. Reviewed by: kib MFC after: 1 week	2014-08-24 09:02:16 +00:00
markj	1bd3bd5a7e	Correct the order of arguments passed to LIST_INSERT_AFTER(). Reviewed by: kib X-MFC-With: r269656	2014-08-15 15:42:58 +00:00
kib	0b059d23c7	Correct the problems with the ptrace(2) making the debuggee an orphan. One problem is inferior(9) looping due to the process tree becoming a graph instead of tree if the parent is traced by child. Another issue is due to the use of p_oppid to restore the original parent/child relationship, because real parent could already exited and its pid reused (noted by mjg). Add the function proc_realparent(9), which calculates the parent for given process. It uses the flag P_TREE_FIRST_ORPHAN to detect the head element of the p_orphan list and than stepping back to its container to find the parent process. If the parent has already exited, the init(8) is returned. Move the P_ORPHAN and the new helper flag from the p_flag* to new p_treeflag field of struct proc, which is protected by proctree lock instead of proc lock, since the orphans relationship is managed under the proctree_lock already. The remaining uses of p_oppid in ptrace(PT_DETACH) and process reapping are replaced by proc_realparent(9). Phabric: D417 Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-07 05:47:53 +00:00
mjg	fec26f1bbf	Eliminate plim and vtmp local vars in exit1. No functional changes. MFC after: 1 week	2014-07-10 22:54:38 +00:00
rwatson	33fdc14c0c	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
mjg	6bec98d395	proc exit: don't take PROC_LOCK while freeing rlimits Code wishing to check rlimits of some process should check whether it is exiting first, which current consumers do. MFC after: 2 weeks	2013-12-15 04:11:43 +00:00
pjd	4ac2e7d8d9	Make process descriptors standard part of the kernel. rwhod(8) already requires process descriptors to work and having PROCDESC in GENERIC seems not enough, especially that we hope to have more and more consumers in the base. MFC after: 3 days	2013-11-30 15:08:35 +00:00
avg	71889a5eff	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks	2013-11-26 08:46:27 +00:00
attilio	7ee4e910ce	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
jilles	10685f105c	Fix siginfo_t.si_status for wait6/waitid/SIGCHLD. Per POSIX, si_status should contain the value passed to exit() for si_code==CLD_EXITED and the signal number for other si_code. This was incorrect for CLD_EXITED and CLD_DUMPED. This is still not fully POSIX-compliant (Austin group issue #594 says that the full value passed to exit() shall be returned via si_status, not just the low 8 bits) but is sufficient for a si_status-related test in libnih (upstart, Debian/kFreeBSD). PR: kern/184002 Reported by: Dmitrijs Ledkovs Tested by: Dmitrijs Ledkovs	2013-11-17 22:31:23 +00:00
markj	6c24c9fb32	Specify SDT probe argument types in the probe definition itself rather than using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API to allow probes with dynamically-translated types. There is no functional change. MFC after: 2 weeks	2013-08-15 04:08:55 +00:00
mjg	d467dfaf08	Remove cr_prison NULL check from proc_to_reap. Userspace processes always have a prison. MFC after: 2 weeks	2013-07-22 02:07:15 +00:00
pjd	f07ebb8888	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
kib	de5bb93ec1	When vforked child is traced, the debugging events are not generated until child performs exec(). The behaviour is reasonable when a debugger is the real parent, because the parent is stopped until exec(), and sending a debugging event to the debugger would deadlock both parent and child. On the other hand, when debugger is not the parent of the vforked child, not sending debugging signals makes it impossible to debug across vfork. Fix the issue by declining generating debug signals only when vfork() was done and child called ptrace(PT_TRACEME). Set a new process flag P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is set, which indicates that the process was created with vfork() and still did not execed. Check P_PPTRACE from issignal(), instead of refusing the trace outright for the P_PPWAIT case. The scope of P_PPTRACE is exactly contained in the scope of P_PPWAIT. Found and tested by: zont Reviewed by: pluknet MFC after: 2 weeks	2013-02-07 15:34:22 +00:00
kib	e81abcec51	The case of pid == WAIT_MYPGRP for the kern_wait() is already handled in kern_wait6(), which is called by kern_wait(). Remove the redundand check, introduced in r243136, and add a comment noting this, to make the code less confusing. The blank lines are added to properly delineate the scope of the preceeding comments. Noted by: "Jukka A. Ukkonen" <jau@iki.fi> MFC after: 1 week	2013-01-30 13:14:15 +00:00
kib	69c365794c	Protect the p->p_pgrp dereference with the process lock. MFC after: 3 days	2013-01-06 15:10:10 +00:00
kib	8123554495	Restore the proper handling of the pid 0 for waitpid(2). Fix the style around. Reported and reviewed by: bde (previous version) MFC after: 28 days	2012-11-16 06:32:38 +00:00

1 2 3 4 5 ...

405 Commits