Commit Graph

430 Commits

Author SHA1 Message Date
mjg
af8321b07f proc: handle sdt exit probe before taking the proc lock
Sponsored by:	The FreeBSD Foundation
2018-12-08 06:31:43 +00:00
mjg
73f834bfbb proc: when exiting move to zombproc before taking proctree
The kernel was already doing this prior to r329615. It was changed
to reduce contention on allproc. However, introduction of pidhash
locks and removal of proctree -> allproc ordering from fork thanks
to bitmaps fixed things enough to make this change pessimal.

waitpid takes proctree on each call and this change (now) causes
avoidable stalls if allproc is held.

Sponsored by:	The FreeBSD Foundation
2018-12-07 12:32:25 +00:00
mjg
6b8dd99954 Manage process-related IDs with bitmaps
Currently unique pid allocation on fork often requires a full walk of
process, group, session lists to make sure it is not used by anything.
This has a side effect of requiring proctree to be held along with allproc,
which adds more contention in poudriere -j 128.

The patch below implements trivial bitmaps which gets rid of the problem.
Dedicated lock is introduced to manage IDs.

While here a bug was discovered: all processes would inherit reap id from
the first process spawned by init. This had a side effect of keeping the
ID used and when allocation rolls over to the beginning it keeps being
skipped.

The patch is loosely based on initial work by mjoras@.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
2018-12-07 12:22:32 +00:00
mjg
db4060a9d0 proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock
waitpid always takes proctree to evaluate the list, but only takes allproc
if it can reap. With this patch allproc is no longer taken, which helps during
poudriere -j 128.

Discussed with: kib
Sponsored by:	The FreeBSD Foundation
2018-11-29 02:52:08 +00:00
mjg
2dadc8e0dd Revert "fork: fix use-after-free with vfork"
This unreliably breaks libc handling of vfork where forking succeded,
but execve did not.

vfork code in libc performs waitpid with WNOHANG in case of failed exec.
With the fix exit codepath was waking up the parent before the child
fully transitioned to a zombie. Woken up parent would waitpid, which
could find a not-yet-zombie child and fail to reap it due to the WNOHANG
flag.

While removing the flag fixes the problem, it is not an option due to older
releases which would still suffer from the kernel change.

Revert the fix until a solution can be worked out.

Note that while use-after-free which gets back due to the revert is a real
bug, it's side-effects are limited due to the fact that struct proc memory
is never released by UMA.
2018-11-23 04:38:50 +00:00
mjg
75deef51a7 fork: fix use-after-free with vfork
The pointer to the child is stored without any reference held. Then it is
blindly used to wait until P_PPWAIT is cleared. However, if the child is
autoreaped it could have exited and get freed before the parent started
waiting.

Use the existing hold mechanism to mitigate the problem. Most common case
of doing exec remains unchanged. The corner case of doing exit performs
wake up before waiting for holds to clear.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18295
2018-11-22 21:08:37 +00:00
mjg
b51585a153 proc: update list manipulation comment on process exit
Processes stay in the hash until they get reaped.

This code does not unlink the child from the parent, so remove
the claim that it does.

Sponsored by:	The FreeBSD Foundation
2018-11-21 22:16:10 +00:00
mjg
71aabf21a1 proc: implement pid hash locks and an iterator
forks, exits and waits are frequently stalled during poudriere -j 128 runs
due to killpg and process list exports performed for each package.

Both uses take the allproc lock. The latter case can be modified to iterate
over the hash with finer grained locking instead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17817
2018-11-21 18:56:15 +00:00
mjg
4493b1d3a8 proc: always store parent pid in p_oppid
Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.

Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17825
2018-11-16 17:07:54 +00:00
kib
a8cb340144 Add PROC_PDEATHSIG_SET to procctl interface.
Allow processes to request the delivery of a signal upon death of
their parent process.  Supposed consumer of the feature is PostgreSQL.

Submitted by:	Thomas Munro
Reviewed by:	jilles, mjg
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D15106
2018-04-18 21:31:13 +00:00
brooks
9d79658aab Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
mjg
25e9d331dc Reduce contention on the proctree lock during heavy package build.
There is a proctree -> allproc ordering established.

Most of the time it is either xlock -> xlock or slock -> slock.

On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.

The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.

Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.

This results in about 50% wait time reduction during -j 128 package build.
2018-02-20 02:18:30 +00:00
mjg
7fcf37e646 Fix process exit vs reap race introduced in r329449
The race manifested itself mostly in terms of crashes with "spin lock
held too long".

Relevant parts of respective code paths:

exit:				reap:
PROC_LOCK(p);
PROC_SLOCK(p);
p->p_state == PRS_ZOMBIE
PROC_UNLOCK(p);
				PROC_LOCK(p);
/* exit work */
				if (p->p_state == PRS_ZOMBIE) /* true */
					proc_reap()
					free proc
/* more exit work */
PROC_SUNLOCK(p);

Thus a still exiting process is reaped.

Prior to the change the zombie check was followed by slock/sunlock trip
which prevented the problem.

Even code prior to this commit has a bug: the proc is still accessed for
statistic collection purposes. However, the severity is rather small and
the bug may be fixed in a future commit.

Reported by:	many
Tested by:	allanjude
2018-02-19 00:54:08 +00:00
mjg
7669ad8f23 exit: get rid of PROC_SLOCK when checking a process to report, take #2
The suspension counter needs synchronisation through slock, but we don't
need it to check if inspecting the counter is necessary to begin with.
In the common case it is not, thus avoid the lock if possible.

Reviewed by:	kib
Tested by:	pho
2018-02-18 21:07:15 +00:00
mjg
075146c37f Revert r329448.
Turns out is is actually racy, reproducible with stress2/misc/truss.sh

Requested by:	kib
2018-02-17 17:23:43 +00:00
mjg
018f299fa7 exit: stop doing PROC_SLOCK just to call proc_reap
It immediately does PROC_SUNLOCK anyway and the lock plays no role.
2018-02-17 09:03:11 +00:00
mjg
fc7474d6fb exit: get rid of PROC_SLOCK when checking a process to report
All accessed fields are protected with already held process lock.
2018-02-17 08:48:45 +00:00
mjg
7616e6b9b0 On process exit signal the parent after dropping the proctree lock. 2018-02-17 00:24:50 +00:00
mjg
6b281480bb Unref the prison after proctree is dropped. 2018-02-17 00:23:56 +00:00
mjg
c1f86f952d Tidy up kern_wait6
- don't relock curproc in msleep
- don't relock proctree if P_STATCHILD is spotted
- reformat the proc_to_reap call in the main loop
2018-02-17 00:21:50 +00:00
jeff
94c7af8ca2 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
mjoras
9c18ca3bd2 Introduce EVENTHANDLER_LIST and some users.
This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.

This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.

The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.

Reviewed by:	bdrewery, markj, mjg
Approved by:	rstone (mentor)
In collaboration with:  ian
MFC after:      4 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12814
2017-11-09 22:51:48 +00:00
kib
34f9433195 Avoid reusing p_ksi while it is on queue.
When sending SIGCHLD informing reaper that a zombie was reparented to
it, we might race with the situation where the previous parent still
not finished delivering SIGCHLD and having its p_ksi structure on the
signal queue.  While on queue, the ksi should not be used for another
send.

Fix this by copying p_ksi into newly allocated ksi, which is directly
put onto reaper sigqueue.  The later ensures that siginfo for reaper
SIGCHLD is always present, similar to guarantees for siginfo of child.

Reported by:	bdrewery
Discussed with:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:58:51 +00:00
kib
9973ae075e Do not leak mount references for dying threads.
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit.  In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1().  softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-25 10:38:18 +00:00
kib
16d32c998e When a zombie gets reparented due to the parent exit, send SIGCHLD to
the reaper.

The traditional reaper init(8) is aware of zombies silently reparented
to it after the parents exit, it loops around waitpid(2) to collect
them.  For other reapers, the silent reparenting is surprising and
collecting zombies requires a thread blocking in waitpid(2) just for
that purpose.  It seems that sending second SIGCHLD is a better
workaround than forcing all reapers to obey the setup.

Reported by:	 Michael Zuo <muh.muhten@gmail.com>, jilles
PR:	213928
Reviewed by:	jilles (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-12 11:11:50 +00:00
kib
aefd35d8c2 Restructure the code to handle reporting of non-exited processes from
wait(2).
- Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED
  options were passed [1].
- Extract the code to report alive process into a new helper
  report_alive_proc() and use it for trapped, stopped and continued
  childrens.

Note that the process spinlock is required around the WTRAPPED and
WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are
set before other threads are stopped at the suspension point, and that
threads increment p_suspcount while owning only the process spinlock,
the process lock is dropped by them.  If the spinlock is not taken for
tests, the syscall thread might miss both p_suspcount increment and
wakeup in wakeup in thread_suspend_switch().

Based on the submission by:	mjg [1]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-04 20:44:58 +00:00
mjg
cbc182c172 wait: avoid relocking the child if proc_to_reap returns 1
proc_to_reap would always unlock. However, if it returned 1, kern_wait6
would immediately lock it again. Save the dance.

Reviewed by:	kib
2016-11-24 18:21:48 +00:00
emaste
00b67b15b9 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
jilles
b8f715c74f wait: Do not copyout uninitialized status/rusage/wrusage.
If wait4() or wait6() return 0 because of WNOHANG, the status, rusage and
wrusage information should not be returned.

PR:		212048
Reported by:	Casey Lucas
MFC after:	2 weeks
2016-09-09 21:58:48 +00:00
kib
6ebb9a02fc When a debugger attaches to the process, SIGSTOP is sent to the
target.  Due to a way issignal() selects the next signal to deliver
and report, if the simultaneous or already pending another signal
exists, that signal might be reported by the next waitpid(2) call.
This causes minor annoyance for debuggers, which must be prepared to
take any signal as the first event, then filter SIGSTOP later.

More importantly, for tools like gcore(1), which attach and then
detach without processing events, SIGSTOP might leak to be delivered
after PT_DETACH.  This results in the process being unintentionally
stopped after detach, which is fatal for automatic tools.

The solution is to force SIGSTOP to be the first signal reported after
the attach.  Attach code is modified to set P2_PTRACE_FSTP to indicate
that the attaching ritual was not yet finished, and issignal() prefers
SIGSTOP in that condition.  Also, the thread which handles
P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first
waitpid(2).  All that ensures that SIGSTOP is consumed first.

Additionally, if P2_PTRACE_FSTP is still set on detach, which means
that waitpid(2) was not called at all, SIGSTOP is removed from the
queue, ensuring that the process is resumed on detach.

In issignal(), when acting on STOPing signals, remove the signal from
queue before suspending.  Otherwise parallel attach could result in
ptracestop() acting on that STOP as if it was the STOP signal from the
attach.  Then SIGSTOP from attach leaks again.

As a minor refactoring, some bits of the common attach code is moved
to new helper proc_set_traced().

Reported by:	markj
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7256
2016-07-28 08:41:13 +00:00
kib
368a94767c Fix another bug after r302350.
Reported and tested by:	pho
PR:	210884
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-07-18 04:30:34 +00:00
jhb
91d07047c4 Add a mask of optional ptrace() events.
ptrace() now stores a mask of optional events in p_ptevents.  Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX.  PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7044
2016-07-15 15:32:09 +00:00
kib
15841a7afe When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
  knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function.  The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty.  As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications.  The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach().  Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken.  Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D6859
2016-06-27 21:52:17 +00:00
kib
8da898f26c Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held.  The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths.  Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive).  Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot.  When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
   pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
   the lifetime of the shared mutex associated with a vnode' page.

Reviewed by:	jilles (previous version, supposedly the objection was fixed)
Discussed with:	brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
mjg
eabf748a18 session: avoid proctree lock on proc exit when possible
We can get away with the common case with only proc lock held.

Reviewed by:	kib
2016-01-20 23:33:58 +00:00
markj
fa1b8e9a4f Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
kib
753bf3d282 Enforce the maxproc limitation before allocating struct proc, initial
struct thread and kernel stack for the thread.  Otherwise, a load
similar to a fork bomb would exhaust KVA and possibly kmem, mostly due
to the struct proc being type-stable.

The nprocs counter is changed from being protected by allproc_lock sx
to be an atomic variable.  Note that ddb/db_ps.c:db_ps() use of nprocs
was unsafe before, and is still unsafe, but it seems that the only
possible undesired consequence is the harmless warning printed when
allproc linked list length does not match nprocs.

Diagnosed by:	Svatopluk Kraus <onwahe@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-08 11:07:09 +00:00
avg
425c0bb088 save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
oshogbo
7dae5ce371 When the wait*(2) syscalls wait for any process (P_ALL), they should
ignore processes created with the pdfork(2) syscall.

PR:		201054
Approved by:	pjd (mentor)
Discussed with:	emaste, rwatson
2015-08-12 20:08:54 +00:00
kib
48ccbdea81 The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig.  Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig.  p_xexit contains complete status
and copied out into si_status.

Requested by:	Joerg Schilling
Reviewed by:	jilles (previous version), pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-07-18 09:02:50 +00:00
adrian
41db4b88e0 Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
ed
ed913b3a6c Don't clobber td->td_retval[0] in proc_reap().
While writing tests for CloudABI, I noticed that close() on process
descriptors returns the process ID of the child process. This is
interesting, as close() is only allowed to return 0 or -1. It turns out
that we clobber td->td_retval[0] in proc_reap(), so that wait*()
properly returns the process ID.

Change proc_reap() to leave td->td_retval[0] alone. Set the return value
in kern_wait6() instead, by keeping track of the PID before we
(potentially) reap the process.

Differential Revision:	https://reviews.freebsd.org/D3032
Reviewed by:	kib
2015-07-09 12:04:45 +00:00
kib
e2f56205b5 Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros.  It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review:	https://reviews.freebsd.org/D2665
Reviewed by:	rodrigc
Looked at by:	bjk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 13:24:17 +00:00
kib
ff588ae9b0 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
jhb
588ce3236a Do not allow a process to reap an orphan (a child currently being
traced by another process such as a debugger). The parent process does
need to check for matching orphan pids to avoid returning ECHILD if an
orphan has exited, but it should not return the exited status for the
child until after the debugger has detached from the orphan process
either explicitly or implicitly via wait().

Add two tests for for this case: one where the debugger is the direct
child (thus the parent has a non-empty children list) and one where
the debugger is not a direct child (so the only "child" of the parent
is the orphan).

Differential Revision:	https://reviews.freebsd.org/D2644
Reviewed by:	kib
MFC after:	2 weeks
2015-05-26 10:29:37 +00:00
jhb
01eefee0f3 Add KTR tracing for some MI ptrace events.
Differential Revision:	https://reviews.freebsd.org/D2643
Reviewed by:	kib
2015-05-25 22:13:22 +00:00
jhb
5ad66fe3b2 Only reparent a traced process to its old parent if the tracing process is
not the old parent. Otherwise, proc_reap() will leave the zombie in place
resulting in the process' status being returned twice to its parent.

Add test cases for PT_TRACE_ME and PT_ATTACH which are fixed by
this change.

Differential Revision:	https://reviews.freebsd.org/D2594
Reviewed by:	kib
MFC after:	2 weeks
2015-05-22 11:04:54 +00:00
trasz
802017a04b Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
mjg
8c033d13b5 proc: get rid of proc lock + unlock pair in proc_reap
A comment in the code stated we PROC_LOCK and as a side effect guarantee
all writers released process lock. But at that point such lock was already
taken while we were removing the process from all lists, so it should be already
unreachable.
2015-03-16 01:09:49 +00:00