Commit Graph

403 Commits

Author SHA1 Message Date
Conrad Meyer
f3bae413e9 random(9): Deprecate random(9), remove meaningless srandom(9)
srandom(9) is meaningless on SMP systems or any system with, say,
interrupts.  One could never rely on random(9) to produce a reproducible
sequence of outputs on the basis of a specific srandom() seed because the
global state was shared by all kernel contexts.  As such, removing it is
literally indistinguishable to random(9) consumers (as compared with
retaining it).

Mark random(9) as deprecated and slated for quick removal.  This is not to
say we intend to remove all fast, non-cryptographic PRNG(s) in the kernel.
It/they just won't be random(9), as it exists today, in either name or
implementation.

Before random(9) is removed, a replacement will be provided and in-tree
consumers will be converted.

Note that despite the name, the random(9) interface does not bear any
resemblance to random(3).  Instead, it is the same crummy 1988 Park-Miller
LCG used in libc rand(3).
2019-12-26 19:41:09 +00:00
Jeff Roberson
61a74c5ccd schedlock 1/4
Eliminate recursion from most thread_lock consumers.  Return from
sched_add() without the thread_lock held.  This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks.  This will eventually allow for lockless remote adds.

Discussed with:	kib
Reviewed by:	jhb
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22626
2019-12-15 21:11:15 +00:00
Gleb Smirnoff
279b9aabe3 Remove epoch tracker from struct thread. It was an ugly crutch to emulate
locking semantics for if_addr_rlock() and if_maddr_rlock().
2019-10-21 18:19:32 +00:00
Konstantin Belousov
b706be23b4 Update comment explaining create_init().
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-08-08 16:42:53 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Edward Tomasz Napierala
1699546def Remove sv_pagesize, originally introduced with r100384.
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.

Reviewed by:	imp (who also contributed the commit message)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D19280
2019-03-01 16:16:38 +00:00
Matt Macy
983ed4f9f1 lkpi: allow late binding of linux_alloc_current
Some consumers may be loosely coupled with the lkpi.
This allows them to call linux_alloc_current without
having a static dependency.

Reviewed by:	hps@
MFC after:	1 week
Sponsored by:	iX Systems
Differential Revision:	https://reviews.freebsd.org/D19257
2019-02-22 23:15:32 +00:00
Mateusz Guzik
628888f0e0 Remove iBCS2, part2: general kernel
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
2018-12-19 21:57:58 +00:00
Brooks Davis
827c3852fe Further simplify arguments to init.
With the removal of BOOTCDROM and fastboot support, this code always
passed "-s" or "--". The latter simply terminates getopt(3) processing
in init so we only need to pass "-s" in the single user case, or nothing
in other cases.

The passing of "--" seems to have been done to ensure that the number of
arguments passed to init was always the same and thus that argc was the
same.

Also GC the write-only variable pathlen (not in reviewed version).

Reviewed by:	kib, jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18441
2018-12-05 19:18:16 +00:00
Brooks Davis
68ea829fe7 Remove never enabled support for "fastboot".
This has been ifdef notyet since the import of BSD 4.4 Lite Kernel
Sources in r1541.

Sponsored by:	DARPA, AFRL
2018-12-05 17:35:15 +00:00
Brooks Davis
7a5db3a770 Remove ifdef BOOTCDROM option to start init.
When BOOTCDROM is defined (via CFLAGS as there is no config option)
it causes -C to be passed to init, but our init and the version of
sysinstall I glanced at in 6.x don't support -C. The last plausibly
related support was removed from the tree in 1995.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18431
2018-12-05 17:29:14 +00:00
Brooks Davis
3a325dec32 Remove a needlessly clever hack to start init with sys_exec().
Construct a struct image_args with the help of new exec_args_*() helper
functions and call kern_execve().

The previous code mapped a page in userspace, copied arguments out
to it one at a time, and then constructed a struct execve_args all so
that sys_execve() can call exec_copyin_args() to copy the data back in
to a struct image_args.

Opencode the part of pre_execve()/post_execve() that releases a
reference to the initial vmspace. We don't need to stop threads like
they do.

Reviewed by:	kib, jhb (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15469
2018-12-04 00:15:47 +00:00
Gleb Smirnoff
905837ebe7 Initialize compatibility epoch tracker for thread0. Fixes
panics for drivers that call if_maddr_lock() during startup.

Reported by:	cy
2018-11-14 19:10:35 +00:00
Bjoern A. Zeeb
0fb9f29bae With the introduction of reapers and reaplists in r275800,
proc0 and init are setup as a circular dependency.

create_init() calls fork1() which calls do_fork(). There the
newproc (initproc) is setup with a reaper of proc0 who's reaper
points to itself. The newproc (initproc) is then put on its
reaper's (proc0) p_reaplist (initproc is a descendants of proc0
for proc0 to reap). Upon return to create_init(), proc0 is
added to initproc's p_reaplist (which would mean proc0 is a
descendant of init, for init to reap). This creates a
circular dependency which eventually leads to LIST corruptions
when trying to kill init and a proc0.

For the base system we never really hit this case during reboot.
The problem only became visible after adding more virtual process
spaces which could go away cleanly (work existing in an experimental
branch).

Reviewed by:	kib
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D15924
2018-07-05 16:16:28 +00:00
Kyle Evans
c7962400c9 Add debug.verbose_sysinit tunable for VERBOSE_SYSINIT
VERBOSE_SYSINIT is currently an all-or-nothing option. debug.verbose_sysinit
adds an option to have the code compiled in but quiet by default so that
getting this information from a device in the field doesn't necessarily
require distributing a recompiled kernel.

Its default is VERBOSE_SYSINIT's value as defined in the kernconf. As such,
the default behavior for simply omitting or including this option is
unchanged.

MFC after:	1 week
2018-06-20 19:23:56 +00:00
Bryan Drewery
03bd1b693e proc0_post: Fix some locking issues
- Filter out PRS_NEW procs as rufetch() tries taking the thread lock
  which may not yet be initialized.
- Hold PROC_LOCK to ensure stability of iterating the threads.
- p_rux fields are protected by the process statlock as well.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D15809
2018-06-15 00:36:41 +00:00
Brooks Davis
dedc82ae26 Use strsep() to parse init_path in start_init().
This simplifies the use of the path variable by making it NUL
terminated.  This is a prerequisite for further cleanups.

Reviewed by:	imp
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15467
2018-05-17 23:07:51 +00:00
Warner Losh
d5292812f8 Remove Giant from init creation and vfs_mountroot.
Sponsored by: Netflix
Discussed with: kib@, mckusick@
Differential Review: https://reviews.freebsd.org/D14712
2018-03-21 14:46:54 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Pedro F. Giffuni
a18a2290cd kern: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:18:04 +00:00
Jeff Roberson
3f289c3fcf Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
Colin Percival
6032e08810 Use the TSLOG framework to record entry/exit timestamps for machine
independent functions with important roles in the early boot process:
mi_startup (with the "exit" recorded when it becomes swapper),
start_init (with the "exit" recorded when the thread is about to
"return" into the newly created init process), vfs_mountroot, and
vfs_mountroot_wait.
2017-12-31 09:22:31 +00:00
Ed Schouten
814629dd64 Don't let cpu_set_syscall_retval() clobber exec_setregs().
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).

Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.

Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.

To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.

Reviewed by:	kib, jhibbits
Differential Revision:	https://reviews.freebsd.org/D13180
2017-11-24 07:35:08 +00:00
Pedro F. Giffuni
df57947f08 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
Matt Joras
2ca45184dc Introduce EVENTHANDLER_LIST and some users.
This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.

This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.

The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.

Reviewed by:	bdrewery, markj, mjg
Approved by:	rstone (mentor)
In collaboration with:  ian
MFC after:      4 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12814
2017-11-09 22:51:48 +00:00
Mateusz Guzik
c0b5261b55 Save on loginclass list locking by checking if caller already uses the struct 2017-11-01 06:12:14 +00:00
Mateusz Guzik
5949c7e504 Save on uihash table locking by checking if the caller already uses the struct
In particular with poudriere this saves about 90% of lookups.
2017-11-01 05:51:20 +00:00
Konstantin Belousov
2d88da2f06 Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
Mateusz Guzik
f743ea9638 Bump struct thread alignment to 32.
This gives additional bits to use in locking primitives which store
the lock thread pointer in the lock value.

Discussed with:	kib
2017-02-07 17:03:22 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Kevin Lo
61f481fb7e Remove register keyword.
Reviewed by:	kib
2016-10-20 01:21:10 +00:00
Kevin Lo
7c68685366 Remove a sentence about putting initialization in init_proc.c or kern_proc.c
and useless comment.

Reviewed by:	kib
2016-10-20 01:19:37 +00:00
Nathan Whitehorn
8c636a11dc Remove assumptions in MI code that the BSP is CPU 0.
MFC after:	2 weeks
2016-07-11 21:25:28 +00:00
Konstantin Belousov
9e590ff04b When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
  knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function.  The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty.  As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications.  The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach().  Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken.  Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D6859
2016-06-27 21:52:17 +00:00
Konstantin Belousov
5c2cf81845 Update comments for the MD functions managing contexts for new
threads, to make it less confusing and using modern kernel terms.

Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
  cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
  cpu_set_upcall -> cpu_copy_thread (for forks)
  cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:05:44 +00:00
Konstantin Belousov
93ccd6bf87 Get rid of struct proc p_sched and struct thread td_sched pointers.
p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0.  Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D6711
2016-06-05 17:04:03 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Konstantin Belousov
db57c70a5b Rename P_KTHREAD struct proc p_flag to P_KPROC.
I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
2016-02-09 16:30:16 +00:00
John Baldwin
6270fa5f72 Mark proc0 as a kernel process via the P_KTHREAD flag.
All other kernel processes have this flag set and all threads in proc0
(including thread0) have the similar TDP_KTHREAD flag set.

PR:		204999
Submitted by:	Oliver Pinter @ HardenedBSD
Reviewed by:	kib
MFC after:	1 week
2016-02-08 23:06:27 +00:00
Mateusz Guzik
33fd9b9a2b fork: pass arguments to fork1 in a dedicated structure
Suggested by:	kib
2016-02-04 04:22:18 +00:00
Dmitry Chagin
038c720553 Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall
instead of vdso. An upcoming linux_base-c6 needs it.

Differential Revision:  https://reviews.freebsd.org/D1090

Reviewed by:	kib, trasz
MFC after:	1 week
2016-01-09 20:18:53 +00:00
Konstantin Belousov
724f4b62b0 Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct
sysent.

sv_prepsyscall is unused.

sv_sigsize and sv_sigtbl translate signal number from the FreeBSD
namespace into the ABI domain.  It is only utilized on i386 for iBCS2
binaries.  The issue with this approach is that signals for iBCS2 were
delivered with the FreeBSD signal frame layout, which does not follow
iBCS2.  The same note is true for any other potential user if
sv_sigtbl.  In other words, if ABI needs signal number translation, it
really needs custom sv_sendsig method instead.

Sponsored by:	The FreeBSD Foundation
2015-11-28 08:49:07 +00:00
Warner Losh
de830d432c Remove now obsolete comment.
MFC After: 2 days
2015-08-28 20:06:58 +00:00
Warner Losh
3f27281613 Per overwhelming sentiment in the code review, use FEATURE instead.
Differential Revision: https://reviews.freebsd.org/D3488
MFC After: 2 days
2015-08-28 19:53:19 +00:00
Warner Losh
135342777c When the kernel is compiled with INVARIANTS, export that as
debug.invariants.

Differential Revision: https://reviews.freebsd.org/D3488
MFC after: 3 days
2015-08-26 23:58:03 +00:00
Ed Schouten
367a13f905 Limit rights on process descriptors.
On CloudABI, the rights bits returned by cap_rights_get() match up with
the operations that you can actually perform on the file descriptor.

Limiting the rights is good, because it makes it easier to get uniform
behaviour across different operating systems. If process descriptors on
FreeBSD would suddenly gain support for any new file operation, this
wouldn't become exposed to CloudABI processes without first extending
the rights.

Extend fork1() to gain a 'struct filecaps' argument that allows you to
construct process descriptors with custom rights. Use this in
cloudabi_sys_proc_fork() to limit the rights to just fstat() and
pdwait().

Obtained from:	https://github.com/NuxiNL/freebsd
2015-07-31 10:21:58 +00:00
Mateusz Guzik
cd672ca60f Get rid of lim_update_thread and cred_update_thread.
Their primary use was in thread_cow_update to free up old resources.
Freeing had to be done with proc lock held and _cow_ funcs already knew
how to free old structs.
2015-07-16 14:30:11 +00:00
Adrian Chadd
6520495abc Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
Mateusz Guzik
4ea6a9a28f Generalised support for copy-on-write structures shared by threads.
Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.
2015-06-10 10:43:59 +00:00